Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1707

Client not reconnecting successfully after leader movement

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • None
    • None
    • clustering
    • None
    • Operating System: All
      Platform: All

    • 8618

      This is somehow similar to CONTROLLER-1705 but this time the initial leader movement was caused by make-leader-local. The test was running on Sandbox with codebase [0]. From 6 similar test cases, this happened in only one of them.

      The test case was using module-based shard, tell-based protocol, and it was moving leadership away from member-3 where transaction-writer was running.
      Its huge karaf.log [1] contains many debug lines, here is the relevant non-debug segment:

      2017-06-06 17:36:11,744 | INFO | ult-dispatcher-5 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-3-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier

      {frontend=member-3-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-default-config#1462585811, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} reconnecting as ReconnectingClientConnection{client=ClientIdentifier{frontend=member-3-frontend-datastore-config, generation=0}

      , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-default-config#1462585811, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}}
      2017-06-06 17:36:11,744 | INFO | ult-dispatcher-5 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-3-frontend-datastore-config: refreshing backend for shard 0
      2017-06-06 17:36:11,756 | INFO | lt-dispatcher-31 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Leader): Term 3 in "RequestVote [term=3, candidateId=member-1-shard-default-config, lastLogIndex=21003, lastLogTerm=2]" message is greater than leader's term 2 - switching to Follower
      2017-06-06 17:36:11,757 | INFO | lt-dispatcher-31 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Leader) :- Switching from behavior Leader to Follower, election term: 3
      2017-06-06 17:36:11,757 | INFO | lt-dispatcher-30 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-config, leaderId=null, leaderPayloadVersion=-1]
      2017-06-06 17:36:11,757 | INFO | lt-dispatcher-22 | RoleChangeNotifier | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | RoleChangeNotifier for member-3-shard-default-config , received role change from Leader to Follower
      2017-06-06 17:36:11,757 | INFO | lt-dispatcher-22 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received role changed for member-3-shard-default-config from Leader to Follower
      2017-06-06 17:36:11,769 | INFO | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Follower): The prevLogIndex 21001 was found in the log but the term -1 is not equal to the append entriesprevLogTerm 2 - lastIndex: 21003, snapshotIndex: 21002
      2017-06-06 17:36:11,769 | INFO | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=3, success=false, followerId=member-3-shard-default-config, logLastIndex=21003, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]
      2017-06-06 17:36:11,770 | INFO | ult-dispatcher-5 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.SNAPSHOT | member-3-shard-default-config: Successfully transferred leadership to member-1-shard-default-config in 445.5 ms
      2017-06-06 17:36:11,770 | INFO | ult-dispatcher-5 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-config, leaderId=member-1-shard-default-config, leaderPayloadVersion=5]
      2017-06-06 17:36:11,771 | INFO | ult-dispatcher-5 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: All Shards are ready - data store config is ready, available count is 0
      2017-06-06 17:36:11,772 | INFO | ult-dispatcher-6 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done false
      2017-06-06 17:36:11,787 | INFO | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Follower): The prevLogIndex 21001 was found in the log but the term -1 is not equal to the append entriesprevLogTerm 2 - lastIndex: 21003, snapshotIndex: 21002
      2017-06-06 17:36:11,787 | INFO | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=3, success=false, followerId=member-3-shard-default-config, logLastIndex=21003, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]
      2017-06-06 17:36:11,787 | INFO | ult-dispatcher-6 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done true

      It is not clear why no reconnect succeeded after this:

      2017-06-06 17:36:11,820 | WARN | monPool-worker-2 | AbstractShardBackendResolver | 199 - org.opendaylight.control
      java.util.concurrent.TimeoutException: Connection attempt failed
      <-->at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBacken
      <-->at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$connectShard$2(Ab
      <-->at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
      <-->at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
      <-->at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:443)
      <-->at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
      <-->at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
      <-->at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
      <-->at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
      Caused by: org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actor[akka://opendaylight-clu
      <-->at org.opendaylight.controller.cluster.datastore.Shard.handleConnectClient(Shard.java:436)
      <-->at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:306)
      <-->at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)
      <-->at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntype
      <-->at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)
      <-->at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)
      <-->at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
      <-->at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
      <-->at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scal
      <-->at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)
      <-->at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)
      <-->at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)
      <-->at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
      <-->at akka.actor.ActorCell.invoke(ActorCell.scala:495)
      <-->at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
      <-->at akka.dispatch.Mailbox.run(Mailbox.scala:224)
      <-->at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
      <-->at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      <-->at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
      <-->at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      <-->at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

      Finally, the writer failed with RequestTimeoutException [2].

      [0] https://git.opendaylight.org/gerrit/#/c/58355/2
      [1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/6/odl3_karaf.log.gz
      [2] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/6/log.html.gz#s1-s6-t1-k2-k11-k1

            rovarga Robert Varga
            vrpolak Vratko Polak
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: