Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1524

Clustering: Jolokia fails to report joining after 2 members were killed

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Resolution: Done
    • None
    • None
    • clustering
    • None
    • Operating System: All
      Platform: All

    • 6013

    Description

      This is a recent Boron regression detected by system tests.

      The test scenario starts with 3-node cluster, then it kills 2 members, and then it boots one member back up. Failure occurs after the booted-up member reports it is in sync, the symptom is that 30 seconds is not enough for a (car-config) shard leader to appear.

      During May, sync, was achieved typically in 36 seconds [0], and shard leader was confirmed in under 1 second [1]. In June, sync status is achieved in 18 seconds [2], but searching for shard leader still shows NPE in the jolokia output after 30 seconds [3].

      Looking at karaf.log of the booting-up member (index 1) [4], 44 second gap is seen, without any explanation what was jolokia waiting for:

      2016-06-06 09:02:52,218 | INFO | config-pusher | ConfigPusherImpl | 102 - org.opendaylight.controller.config-persister-impl - 0.5.0.SNAPSHOT | Successfully pushed configuration snapshot
      10-restconf-service.xml(odl-restconf,odl-restconf)
      2016-06-06 09:03:36,456 | INFO | lt-dispatcher-43 | kka://opendaylight-cluster-data) | 139 - com.typesafe.akka.slf4j - 2.4.4 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.11.155:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] is JOINING, roles [member-2]

      And this is from karaf.log of the member that was still up (index 3):
      2016-06-06 09:02:50,297 | INFO | ult-dispatcher-4 | ShardManager | 155 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | shard-manager-operational: Received role changed for member-3-shard-entity-ownership-operational from Candidate to Leader
      2016-06-06 09:02:50,297 | INFO | ult-dispatcher-4 | ShardManager | 155 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | shard-manager-operational: All Shards are ready - data store operational is ready, available count is 0
      2016-06-06 09:02:51,003 | WARN | lt-dispatcher-23 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
      2016-06-06 09:02:56,014 | WARN | lt-dispatcher-31 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
      2016-06-06 09:03:01,065 | WARN | ult-dispatcher-4 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
      2016-06-06 09:03:06,088 | WARN | lt-dispatcher-16 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
      2016-06-06 09:03:11,135 | WARN | lt-dispatcher-43 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
      2016-06-06 09:03:16,153 | WARN | lt-dispatcher-15 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
      2016-06-06 09:03:21,163 | WARN | ult-dispatcher-4 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
      2016-06-06 09:03:26,174 | WARN | lt-dispatcher-42 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
      2016-06-06 09:03:31,223 | WARN | lt-dispatcher-43 | ReliableDeliverySupervisor | 139 - com.typesafe.akka.slf4j - 2.4.4 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.11.182:2550]] Caused by: [Connection refused: /10.30.11.182:2550]
      2016-06-06 09:03:37,180 | INFO | ult-dispatcher-2 | ShardInformation | 155 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.SNAPSHOT | updatePeerAddress for peer member-2-shard-default-config with address akka.tcp://opendaylight-cluster-data@10.30.11.182:2550/user/shardmanager-config/member-2-shard-default-config

      Note that data manipulation test cases were successful during the time member-2 was down, but both karaf.log segments only show activity with respect to member-2.

      Strangely, after member-2 was brought up, member-1 started reporting its shard status correctly, but member-2 started to show the NPE instead (making subsequent tests fail).
      Here is the the beginning of the NPE:
      {"request":

      {"mbean":"org.opendaylight.controller:Category=Shards,name=member-2-shard-car-config,type=DistributedConfigDatastore","type":"read"}

      ,"stacktrace":"java.lang.NullPointerException\n\tat com.google.common.base.Preconditions.checkNotNull(Preconditions.java:210)\n\tat com.google.common.base.Joiner.toString(Joiner.java:433)\n\tat com.google.common.base.Joiner$MapJoiner.appendTo(Joiner.java:359)\n\tat com.google.common.base.Joiner$MapJoiner.appendTo(Joiner.java:393)\n\tat com.google.common.base.Joiner$MapJoiner.join(Joiner.java:419)\n\tat com.google.common.base.Joiner$MapJoiner.join(Joiner.java:408)\n\tat com.google.common.base.Joiner$MapJoiner.join(Joiner.java:330)\n\tat org.opendaylight.controller.cluster.datastore.jmx.mbeans.shard.ShardStats.toStringMap(ShardStats.java:319)\n\tat org.opendaylight.controller.cluster.datastore.jmx.mbeans.shard.ShardStats.getPeerAddresses(ShardStats.java:315)\n
      ...

      [0] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/513/robot/report/log.html#s1-s5-t3-k2-k5
      [1] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/513/robot/report/log.html#s1-s5-t3-k3
      [2] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/529/robot/report/log.html#s1-s5-t3-k2-k5
      [3] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/529/robot/report/log.html#s1-s5-t3-k3-k14-k1-k1-k1-k1-k5-k2-k1-k5-k1-k4-k1
      [4] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-clustering-only-boron/lastSuccessfulBuild/artifact/odl1_karaf.log.xz

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Unassigned Unassigned
            vrpolak Vratko Polak
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: