Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1693

UnreachableMember during remove-shard-replica prevents new leader to get elected

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • None
    • None
    • clustering
    • None
    • Operating System: All
      Platform: All

    • 8524

      This manifests as a CSIT failure [0]. The UnreachableMember is another issue (CONTROLLER-1645 for example). It is possible that cluster members end up with an inconsistent shard configuration.

      Karaf.log on member-1 [1] shows the replica removal started at 01:55:43,472, then this happened:
      2017-05-22 01:56:06,569 | WARN | lt-dispatcher-32 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.0.Carbon | member-1-shard-default-config: Failed to transfer leadership in 10.01 s
      2017-05-22 01:56:06,572 | INFO | lt-dispatcher-22 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | Stopping Shard member-1-shard-default-config

      Finally, the test teardown started adding the replica back at 01:56:28,959.
      Thus even though the test was waiting 45 seconds, members only have 20 seconds to realize the previous leader is gone (we can add more time to the test if needed).

      As member-3 karaf.log [2] shows no activity between 01:56:03,165 and 01:56:56,244 it looks like member-1 was perhaps somehow still a leader, but "has no leader" response [3] from member-1 when adding the replica back proves there really was no leader, at least from member-1 point of view.

      Every member shows multiple UnreachableMember messages. Not sure if the subsequent ones are the cause or the result of missing the leader.

      [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s36-t1-k2-k13-k1-k3-k1
      [1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/odl1_karaf.log.gz
      [2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/odl3_karaf.log.gz
      [3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s36-t1-k2-k14-k2-k3-k1-k4-k7-k1

            Unassigned Unassigned
            vrpolak Vratko Polak
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: