[INTTEST-6] Leader election takes too long after shutdown instance starts Created: 09/Feb/16  Updated: 19/Oct/17  Resolved: 25/Feb/16

Status: Resolved
Project: integration-test
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Luis Gomez
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File node_start.txt    

 Description   

This is occurring sporadically in the cluster system test:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/controller-csit-3node-clustering-only-beryllium/565/

See karaf attached log excerpt for the instance start. After 5 mins the leader is not elected in this partcular instance.



 Comments   
Comment by Luis Gomez [ 09/Feb/16 ]

Attachment node_start.txt has been added with description: karaf log node start

Comment by Luis Gomez [ 09/Feb/16 ]

Full karaf log for the failing instance is in odl1_karaf.log.xz file of the mentioned job run:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/controller-csit-3node-clustering-only-beryllium/565/

Comment by Tom Pantelis [ 10/Feb/16 ]

At the start of "030 Car Failover Crud On New Leader", odl1 (member-1 @ 10.30.11.161) was the leader. The test stops the leader and leadership transitioned to odl3 (member-3 @ 10.30.11.142). The test then does some CRUD with the remaining 2 nodes and then restarts the previous leader odl1.

odl1 started up at:

2016-02-08 13:28:10,994 | INFO | ult-dispatcher-2 | kka://opendaylight-cluster-data) | 129 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.11.161:2550] - Starting up...

and then joined itself and declared itself as cluster leader at:

2016-02-08 13:28:16,074 | INFO | lt-dispatcher-23 | kka://opendaylight-cluster-data) | 129 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.11.161:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.11.161:2550] is JOINING, roles [member-1]
2016-02-08 13:28:17,062 | INFO | lt-dispatcher-23 | kka://opendaylight-cluster-data) | 129 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.11.161:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.11.161:2550] to [Up]

So odl1 didn't connect to odl3 and formed an island with itself as leader. This is b/c odl1 is the first seed node and akka has special behavior wrt first seed node. It will first try to connect to another seed node and, failing that, it declares itself leader. This is governed by the seed-node-timeout setting which is 5s by default. We had seen this issue before and had changed the seed-node-timeout to 12s in the akka.conf we ship. It looks like the tests are still using the default 5s judging from the timestamps above. I believe the deployment script copies over its own akka.conf so akka.conf template needs to be updated to the latest. It would be nice if the tests could get the latest akka.conf from the build and substitute the appropriate settings on the fly so we don't keep running into issues with a stale akka.conf.

Comment by Tom Pantelis [ 10/Feb/16 ]

Wrt stale akka.conf, this will happen in real environments on upgrade. What we need is a factory akka.conf that we overwrite on feature install and a custom akka.conf that users modify and we preserve. The custom akka.conf would be overlaid/merged with the factory akka.conf. We can do this for Bo and maybe backport to Be SR1.

Comment by Luis Gomez [ 10/Feb/16 ]

Good comments Tom, I think we will have to change the way the cluster files are fed into the deploy job, moving this to integration bugzilla.

Comment by Tom Pantelis [ 10/Feb/16 ]

For now you can just update the akka.conf template to match what we ship in the controller.

Going forward we can implement the factory/custom approach I mentioned so the tests would pickup factory changes automatically. So you don't need to invest any time to do something like that in the test framework. The akka.conf doesn't change that often anyway.

Comment by Luis Gomez [ 10/Feb/16 ]

Right, short term (today) we will update the templates in integration. Medium term (after Be release) we will update deploy scripts to use distribution scripts.

Comment by Luis Gomez [ 10/Feb/16 ]

So after looking at the cluster deploy scripts, they use (now I rememeber shaleen changed this) the right templates from the distribution itself:

AKKACONF=/tmp/${BUNDLEFOLDER}/configuration/initial/akka.conf
MODULESCONF=/tmp/${BUNDLEFOLDER}/configuration/initial/modules.conf
MODULESHARDSCONF=/tmp/${BUNDLEFOLDER}/configuration/initial/module-shards.conf

This means the issues is still there with the right cluster templates, so returning this bug to controller.

Comment by Tom Pantelis [ 10/Feb/16 ]

I don't think it is. The seed-node-timeout is set to 12s in the akk.conf we ship. The timestamps from the log indicate akka only waited 5s (13:28:10,994 -> 13:28:16,074 ~ 5 sec). I would suggest checking the seed-node-timeout setting in the akka.conf on each node after deploy.

Comment by Luis Gomez [ 11/Feb/16 ]

You are right, after deeper look we are still using integration templates for this. So changing the bug to me. Sorry for the ping-pong.

BR/Luis

Comment by Tom Pantelis [ 11/Feb/16 ]

NP - as long as we get it right in the end

(In reply to Luis Gomez from comment #9)
> You are right, after deeper look we are still using integration templates
> for this. So changing the bug to me. Sorry for the ping-pong.
>
> BR/Luis

Comment by Luis Gomez [ 11/Feb/16 ]

Tom, can you please review this:

https://git.opendaylight.org/gerrit/#/c/34432

Comment by Vratko Polak [ 25/Feb/16 ]

> https://git.opendaylight.org/gerrit/#/c/34432

Merged, so setting as FIXED.

Generated at Wed Feb 07 20:04:18 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.