Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1738

RequestTimeoutException due to "Shard has no current leader" after shutdown-shard-replica with ShardLeaderStateChanged not delivered

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Confirmed
    • Resolution: Unresolved
    • Affects Version/s: unspecified
    • Fix Version/s: None
    • Component/s: clustering
    • Labels:
      None
    • Environment:

      Operating System: All
      Platform: All

    • External issue ID:
      8794

      Description

      This is similar to CONTROLLER-1717 but a different message has been lost here.
      I believe CONTROLLER-1714 was exactly this, but with less evidence in karaf log.

      The Robot failure [0] is the usual 120s timeout we can see caused by multiple bugs (from transaction writer for module-based shard, tell-based protocol):
      RequestTimeoutException: Timed out after 120.029805238seconds

      Looking at karaf log [1] of member-1 (writer, old leader), we can see leadership has been successfully transferred at 04:04:59,250 but the information about the new leader being there has been lost:
      2017-07-04 04:04:59,252 | INFO | lt-dispatcher-42 | LocalActorRef | 174 - com.typesafe.akka.slf4j - 2.4.18 | Message [org.opendaylight.controller.cluster.datastore.messages.ShardLeaderStateChanged] from Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#145361760 to Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config/member-1-shard-default-config-notifier#-591265397 was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

      So the member new it is a Follower, but it was unable to tell client who the new leader is.
      2017-07-04 04:05:19,248 | WARN | monPool-worker-2 | AbstractShardBackendResolver | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | Failed to resolve shard
      java.util.concurrent.TimeoutException: Shard has no current leader

      Perhaps there is a common underlying Bug which causes occasional undelivered messages, and we see different symptoms depending on which message gets lost.

      [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/771/log.html.gz#s1-s20-t1-k2-k8
      [1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/771/odl1_karaf.log.gz

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              vrpolak Vratko Polak
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated: