[CONTROLLER-1613] Clustering: Member fails to re-start sometimes in csit -all- jobs Created: 27/Feb/17  Updated: 25/Jul/23  Resolved: 10/Apr/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Andrej Mak
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 7858

 Description   

The first symptom detected by Robot suites is constant 404 on jolokia URL. This Carbon bug does not happen in only jobs. This weekend it hit both Netconf and Controller jobs, previously this was only seen once [0].

Karaf.log for this Bug contains reports on runtime-generated-mapping not finishing, followed by "giving up" from AbstractDataStore:

2017-02-26 04:02:37,092 | WARN | saction-32-34'}} | DeadlockMonitor | 131 - org.opendaylight.controller.config-manager - 0.6.0.SNAPSHOT | ModuleIdentifier

{factoryName='runtime-generated-mapping', instanceName='runtime-mapping-singleton'}

did not finish after 169982 ms
2017-02-26 04:02:40,606 | ERROR | Event Dispatcher | AbstractDataStore | 216 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Shard leaders failed to settle in 90 seconds, giving up

Possibly this is just a performance Bug (startup taking longer than expected), but when the restart succeeds [1], the instance is created in around 10 seconds. Most probably some ODL project tends to make WaitingServiceTracker not find BindingToNormalizedNodeCodec.

[0] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-all-carbon/221/
[1] https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-3node-clustering-all-carbon/199/



 Comments   
Comment by Robert Varga [ 27/Mar/17 ]

Is this still present?

Comment by Vratko Polak [ 28/Mar/17 ]

> Is this still present?

Yes, this still affects around half of netconf [2] runs. Both 404 in robot [3] and "giving up" in karaf.log [4] are there.

[2] https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-3node-clustering-all-carbon/
[3] https://logs.opendaylight.org/releng/jenkins092/netconf-csit-3node-clustering-all-carbon/222/archives/log.html.gz#s1-s6-t14-k2-k2-k5-k1-k2-k1-k1-k2-k1-k4-k5
[4] https://logs.opendaylight.org/releng/jenkins092/netconf-csit-3node-clustering-all-carbon/222/archives/odl2_karaf.log.gz

Comment by Tomas Cere [ 31/Mar/17 ]

Seems like the underlying cause is that the rejoining node could not rejoin the cluster

2017-03-27 08:54:01,343 | WARN | ult-dispatcher-5 | JoinSeedNodeProcess | 155 - com.typesafe.akka.slf4j - 2.4.17 | Couldn't join seed nodes after [15] attmpts, will try again. seed-nodes=[akka.tcp://opendaylight-cluster-data@10.29.12.12:2550, akka.tcp://opendaylight-cluster-data@10.29.13.54:2550]

these are all over the logs which explains the shard not being able to elect a leader. Now why cant the node rejoin? is it possible that there is an environment issue or an issue with the rejoin script?

Comment by Vratko Polak [ 04/Apr/17 ]

> is it possible that there is an environment issue or an issue with the rejoin script?

Unlikely. The suite is the same as in only job [5], where it passes.

[5] https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-3node-clustering-only-carbon

Comment by Andrej Mak [ 10/Apr/17 ]

Last runs of https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-3node-clustering-all-carbon/ passed, so issue seems to be resolved for now.

Generated at Wed Feb 07 19:55:59 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.