[CONTROLLER-1524] Clustering: Jolokia fails to report joining after 2 members were killed Created: 06/Jun/16  Updated: 25/Jul/23  Resolved: 14/Jun/16

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 6013

 Description   

This is a recent Boron regression detected by system tests.

The test scenario starts with 3-node cluster, then it kills 2 members, and then it boots one member back up. Failure occurs after the booted-up member reports it is in sync, the symptom is that 30 seconds is not enough for a (car-config) shard leader to appear.

During May, sync, was achieved typically in 36 seconds [0], and shard leader was confirmed in under 1 second [1]. In June, sync status is achieved in 18 seconds [2], but searching for shard leader still shows NPE in the jolokia output after 30 seconds [3].

Looking at karaf.log of the booting-up member (index 1) [4], 44 second gap is seen, without any explanation what was jolokia waiting for:

2016-06-06 09:02:52,218 | INFO | config-pusher | ConfigPusherImpl | 102 - org.opendaylight.controller.config-persister-impl - 0.5.0.SNAPSHOT | Successfully pushed configuration snapshot
10-restconf-service.xml(odl-restconf,odl-restconf)
2016-06-06 09:03:36,456 | INFO | lt-dispatcher-43 | kka://opendaylight-cluster-data) | 139 - com.typesafe.akka.slf4j - 2.4.4 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.11.155:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] is JOINING, roles [member-2]

And this is from karaf.log of the member that was still up (index 3):
2016-06-06 09:02:50,297 | INFO | ult-dispatcher-4 | ShardManager | 155 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | shard-manager-operational: Received role changed for member-3-shard-entity-ownership-operational from Candidate to Leader
2016-06-06 09:02:50,297 | INFO | ult-dispatcher-4 | ShardManager | 155 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | shard-manager-operational: All Shards are ready - data store operational is ready, available count is 0
2016-06-06 09:02:51,003 | WARN | lt-dispatcher-23 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:02:56,014 | WARN | lt-dispatcher-31 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:01,065 | WARN | ult-dispatcher-4 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:06,088 | WARN | lt-dispatcher-16 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:11,135 | WARN | lt-dispatcher-43 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:16,153 | WARN | lt-dispatcher-15 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:21,163 | WARN | ult-dispatcher-4 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:26,174 | WARN | lt-dispatcher-42 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:31,223 | WARN | lt-dispatcher-43 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:37,180 | INFO | ult-dispatcher-2 | ShardInformation | 155 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | updatePeerAddress for peer member-2-shard-default-config with address akka.tcp://opendaylight-cluster-data@10.30.11.182:2550/user/shardmanager-config/member-2-shard-default-config

Note that data manipulation test cases were successful during the time member-2 was down, but both karaf.log segments only show activity with respect to member-2.

Strangely, after member-2 was brought up, member-1 started reporting its shard status correctly, but member-2 started to show the NPE instead (making subsequent tests fail).
Here is the the beginning of the NPE:
{"request":

{"mbean":"org.opendaylight.controller:Category=Shards,name=member-2-shard-car-config,type=DistributedConfigDatastore","type":"read"}

,"stacktrace":"java.lang.NullPointerException\n\tat com.google.common.base.Preconditions.checkNotNull(Preconditions.java:210)\n\tat com.google.common.base.Joiner.toString(Joiner.java:433)\n\tat com.google.common.base.Joiner$MapJoiner.appendTo(Joiner.java:359)\n\tat com.google.common.base.Joiner$MapJoiner.appendTo(Joiner.java:393)\n\tat com.google.common.base.Joiner$MapJoiner.join(Joiner.java:419)\n\tat com.google.common.base.Joiner$MapJoiner.join(Joiner.java:408)\n\tat com.google.common.base.Joiner$MapJoiner.join(Joiner.java:330)\n\tat org.opendaylight.controller.cluster.datastore.jmx.mbeans.shard.ShardStats.toStringMap(ShardStats.java:319)\n\tat org.opendaylight.controller.cluster.datastore.jmx.mbeans.shard.ShardStats.getPeerAddresses(ShardStats.java:315)\n
...

[0] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/513/robot/report/log.html#s1-s5-t3-k2-k5
[1] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/513/robot/report/log.html#s1-s5-t3-k3
[2] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/529/robot/report/log.html#s1-s5-t3-k2-k5
[3] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/529/robot/report/log.html#s1-s5-t3-k3-k14-k1-k1-k1-k1-k5-k2-k1-k5-k1-k4-k1
[4] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/lastSuccessfulBuild/artifact/odl1_karaf.log.xz



 Comments   
Comment by Vratko Polak [ 06/Jun/16 ]

Also note that there are previous suites which pass. Car_Failover_Crud kills at most 1 member at a time, Car_Persistence_Recovery kills all 3 members (and then brings them up) at once.

Comment by Luis Gomez [ 08/Jun/16 ]

This is regression vs Beryllium, therefore rising priority to critical.

Comment by Tom Pantelis [ 14/Jun/16 ]

The NPE is fixed by https://git.opendaylight.org/gerrit/#/c/40173/.

Comment by Luis Gomez [ 14/Jun/16 ]

I agree, controller cluster jobs are back GREEN.

Generated at Wed Feb 07 19:55:46 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.