Details
-
Bug
-
Status: Resolved
-
Resolution: Done
-
None
-
None
-
None
-
Operating System: All
Platform: All
-
6013
Description
This is a recent Boron regression detected by system tests.
The test scenario starts with 3-node cluster, then it kills 2 members, and then it boots one member back up. Failure occurs after the booted-up member reports it is in sync, the symptom is that 30 seconds is not enough for a (car-config) shard leader to appear.
During May, sync, was achieved typically in 36 seconds [0], and shard leader was confirmed in under 1 second [1]. In June, sync status is achieved in 18 seconds [2], but searching for shard leader still shows NPE in the jolokia output after 30 seconds [3].
Looking at karaf.log of the booting-up member (index 1) [4], 44 second gap is seen, without any explanation what was jolokia waiting for:
2016-06-06 09:02:52,218 | INFO | config-pusher | ConfigPusherImpl | 102 - org.opendaylight.controller.config-persister-impl - 0.5.0.SNAPSHOT | Successfully pushed configuration snapshot
10-restconf-service.xml(odl-restconf,odl-restconf)
2016-06-06 09:03:36,456 | INFO | lt-dispatcher-43 | kka://opendaylight-cluster-data) | 139 - com.typesafe.akka.slf4j - 2.4.4 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.11.155:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] is JOINING, roles [member-2]
And this is from karaf.log of the member that was still up (index 3):
2016-06-06 09:02:50,297 | INFO | ult-dispatcher-4 | ShardManager | 155 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | shard-manager-operational: Received role changed for member-3-shard-entity-ownership-operational from Candidate to Leader
2016-06-06 09:02:50,297 | INFO | ult-dispatcher-4 | ShardManager | 155 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | shard-manager-operational: All Shards are ready - data store operational is ready, available count is 0
2016-06-06 09:02:51,003 | WARN | lt-dispatcher-23 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:02:56,014 | WARN | lt-dispatcher-31 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:01,065 | WARN | ult-dispatcher-4 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:06,088 | WARN | lt-dispatcher-16 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:11,135 | WARN | lt-dispatcher-43 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:16,153 | WARN | lt-dispatcher-15 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:21,163 | WARN | ult-dispatcher-4 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:26,174 | WARN | lt-dispatcher-42 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:31,223 | WARN | lt-dispatcher-43 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
2016-06-06 09:03:37,180 | INFO | ult-dispatcher-2 | ShardInformation | 155 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | updatePeerAddress for peer member-2-shard-default-config with address akka.tcp://opendaylight-cluster-data@10.30.11.182:2550/user/shardmanager-config/member-2-shard-default-config
Note that data manipulation test cases were successful during the time member-2 was down, but both karaf.log segments only show activity with respect to member-2.
Strangely, after member-2 was brought up, member-1 started reporting its shard status correctly, but member-2 started to show the NPE instead (making subsequent tests fail).
Here is the the beginning of the NPE:
{"request":
,"stacktrace":"java.lang.NullPointerException\n\tat com.google.common.base.Preconditions.checkNotNull(Preconditions.java:210)\n\tat com.google.common.base.Joiner.toString(Joiner.java:433)\n\tat com.google.common.base.Joiner$MapJoiner.appendTo(Joiner.java:359)\n\tat com.google.common.base.Joiner$MapJoiner.appendTo(Joiner.java:393)\n\tat com.google.common.base.Joiner$MapJoiner.join(Joiner.java:419)\n\tat com.google.common.base.Joiner$MapJoiner.join(Joiner.java:408)\n\tat com.google.common.base.Joiner$MapJoiner.join(Joiner.java:330)\n\tat org.opendaylight.controller.cluster.datastore.jmx.mbeans.shard.ShardStats.toStringMap(ShardStats.java:319)\n\tat org.opendaylight.controller.cluster.datastore.jmx.mbeans.shard.ShardStats.getPeerAddresses(ShardStats.java:315)\n
...
[0] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/513/robot/report/log.html#s1-s5-t3-k2-k5
[1] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/513/robot/report/log.html#s1-s5-t3-k3
[2] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/529/robot/report/log.html#s1-s5-t3-k2-k5
[3] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/529/robot/report/log.html#s1-s5-t3-k3-k14-k1-k1-k1-k1-k5-k2-k1-k5-k1-k4-k1
[4] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/lastSuccessfulBuild/artifact/odl1_karaf.log.xz