[CONTROLLER-1707] Client not reconnecting successfully after leader movement Created: 07/Jun/17  Updated: 25/Jul/23  Resolved: 24/Aug/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks CONTROLLER-1737 Current transaction in state READY le... Resolved
External issue ID: 8618

 Description   

This is somehow similar to CONTROLLER-1705 but this time the initial leader movement was caused by make-leader-local. The test was running on Sandbox with codebase [0]. From 6 similar test cases, this happened in only one of them.

The test case was using module-based shard, tell-based protocol, and it was moving leadership away from member-3 where transaction-writer was running.
Its huge karaf.log [1] contains many debug lines, here is the relevant non-debug segment:

2017-06-06 17:36:11,744 | INFO | ult-dispatcher-5 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-3-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier

{frontend=member-3-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-default-config#1462585811, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} reconnecting as ReconnectingClientConnection{client=ClientIdentifier{frontend=member-3-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-default-config#1462585811, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}}
2017-06-06 17:36:11,744 | INFO | ult-dispatcher-5 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-3-frontend-datastore-config: refreshing backend for shard 0
2017-06-06 17:36:11,756 | INFO | lt-dispatcher-31 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Leader): Term 3 in "RequestVote [term=3, candidateId=member-1-shard-default-config, lastLogIndex=21003, lastLogTerm=2]" message is greater than leader's term 2 - switching to Follower
2017-06-06 17:36:11,757 | INFO | lt-dispatcher-31 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Leader) :- Switching from behavior Leader to Follower, election term: 3
2017-06-06 17:36:11,757 | INFO | lt-dispatcher-30 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-config, leaderId=null, leaderPayloadVersion=-1]
2017-06-06 17:36:11,757 | INFO | lt-dispatcher-22 | RoleChangeNotifier | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | RoleChangeNotifier for member-3-shard-default-config , received role change from Leader to Follower
2017-06-06 17:36:11,757 | INFO | lt-dispatcher-22 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received role changed for member-3-shard-default-config from Leader to Follower
2017-06-06 17:36:11,769 | INFO | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Follower): The prevLogIndex 21001 was found in the log but the term -1 is not equal to the append entriesprevLogTerm 2 - lastIndex: 21003, snapshotIndex: 21002
2017-06-06 17:36:11,769 | INFO | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=3, success=false, followerId=member-3-shard-default-config, logLastIndex=21003, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]
2017-06-06 17:36:11,770 | INFO | ult-dispatcher-5 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.SNAPSHOT | member-3-shard-default-config: Successfully transferred leadership to member-1-shard-default-config in 445.5 ms
2017-06-06 17:36:11,770 | INFO | ult-dispatcher-5 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-config, leaderId=member-1-shard-default-config, leaderPayloadVersion=5]
2017-06-06 17:36:11,771 | INFO | ult-dispatcher-5 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: All Shards are ready - data store config is ready, available count is 0
2017-06-06 17:36:11,772 | INFO | ult-dispatcher-6 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done false
2017-06-06 17:36:11,787 | INFO | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Follower): The prevLogIndex 21001 was found in the log but the term -1 is not equal to the append entriesprevLogTerm 2 - lastIndex: 21003, snapshotIndex: 21002
2017-06-06 17:36:11,787 | INFO | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-default-config (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=3, success=false, followerId=member-3-shard-default-config, logLastIndex=21003, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]
2017-06-06 17:36:11,787 | INFO | ult-dispatcher-6 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done true

It is not clear why no reconnect succeeded after this:

2017-06-06 17:36:11,820 | WARN | monPool-worker-2 | AbstractShardBackendResolver | 199 - org.opendaylight.control
java.util.concurrent.TimeoutException: Connection attempt failed
<-->at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBacken
<-->at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$connectShard$2(Ab
<-->at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
<-->at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
<-->at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:443)
<-->at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
<-->at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
<-->at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
<-->at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actor[akka://opendaylight-clu
<-->at org.opendaylight.controller.cluster.datastore.Shard.handleConnectClient(Shard.java:436)
<-->at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:306)
<-->at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)
<-->at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntype
<-->at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)
<-->at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)
<-->at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
<-->at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
<-->at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scal
<-->at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)
<-->at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)
<-->at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)
<-->at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
<-->at akka.actor.ActorCell.invoke(ActorCell.scala:495)
<-->at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
<-->at akka.dispatch.Mailbox.run(Mailbox.scala:224)
<-->at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
<-->at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
<-->at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
<-->at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
<-->at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Finally, the writer failed with RequestTimeoutException [2].

[0] https://git.opendaylight.org/gerrit/#/c/58355/2
[1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/6/odl3_karaf.log.gz
[2] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/6/log.html.gz#s1-s6-t1-k2-k11-k1



 Comments   
Comment by Tomas Cere [ 07/Jun/17 ]

member-1 switched to leader at 17:36:11,800, this seems like the PrimaryShardInfoCache is not flushed on NotLeaderException

Comment by Vratko Polak [ 12/Jun/17 ]

This is affecting [0] one longevity job. See karaf.log [1] between 04:43:03,671 and 04:45:08,911.

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/10/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/10/odl2_karaf.log.gz

Comment by Vratko Polak [ 12/Jun/17 ]

Also, this affects "remote to remote" test case [5] for module shard make-leader-local test, which was consistently passing previously.

Looking at karaf.log [6], frequent syncing on followers might be one the reasons why this Bug happens quite frequently:

2017-06-10 10:43:50,814 | INFO | rd-dispatcher-31 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done false
2017-06-10 10:43:50,826 | INFO | rd-dispatcher-53 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done true
2017-06-10 10:43:50,859 | INFO | rd-dispatcher-37 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done false
2017-06-10 10:43:50,859 | INFO | rd-dispatcher-37 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done true

[5] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/738/log.html.gz#s1-s24-t3-k2-k10
[6] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/738/odl3_karaf.log.gz

Comment by Vratko Polak [ 14/Jun/17 ]

Recent build shows another symptom:
2017-06-14 05:17:49,461 | INFO | rd-dispatcher-46 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-2-shard-default-config, leaderId=null, leaderPayloadVersion=5]
2017-06-14 05:17:49,847 | INFO | rd-dispatcher-39 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.Carbon | member-2-shard-default-config: not currently leader, rejecting request Envelope{sessionId=0, txSequence=2acd, message=CommitLocalTransactionRequest{target=member-2-datastore-config-fe-0-chn-2-txn-5472-0, sequence=0, replyTo=Actorakka://opendaylight-cluster-data/user/$a#-1700534907, coordinated=false}}. isLeader: true, isLeaderActive: false,isLeadershipTransferInProgress: true.
2017-06-14 05:17:49,849 | INFO | rd-dispatcher-39 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.Carbon | member-2-shard-default-config: not currently leader, rejecting request Envelope{sessionId=0, txSequence=2ace, message=CommitLocalTransactionRequest{target=member-2-datastore-config-fe-0-chn-2-txn-5473-0, sequence=0, replyTo=Actorakka://opendaylight-cluster-data/user/$a#-1700534907, coordinated=false}}. isLeader: true, isLeaderActive: false,isLeadershipTransferInProgress: true.

The last message was repeated not two times, but 15 times.

Comment by Robert Varga [ 14/Jun/17 ]

The trouble seems to be in ShardBackendResolver, where we do not invalidate unsuccessful attempts and hold on the failed state.

Comment by Robert Varga [ 14/Jun/17 ]

https://git.opendaylight.org/gerrit/58936

Comment by Robert Varga [ 14/Jun/17 ]

That last batch of messages should be fixed by above patch.

The info messages:

2017-06-10 10:43:50,814 | INFO | rd-dispatcher-31 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done false
2017-06-10 10:43:50,826 | INFO | rd-dispatcher-53 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done true

Tom, I think it is time to make it tuneable and log where we are with thresholds, with pipelining and batching we can end out a batch of append operations, and hence the indexes will be jumping by size of those batches.

Comment by Robert Varga [ 15/Jun/17 ]

Configuration knob: https://git.opendaylight.org/gerrit/58991

Comment by Tom Pantelis [ 15/Jun/17 ]

I think you're right about the batching causing false out-of-sync status. I think Follower should update the SyncStatusTracker after it updates it's commitIndex.

Comment by Robert Varga [ 15/Jun/17 ]

Tom, do you mean something like: https://git.opendaylight.org/gerrit/59037 ?

Comment by Tom Pantelis [ 15/Jun/17 ]

(In reply to Robert Varga from comment #10)
> Tom, do you mean something like: https://git.opendaylight.org/gerrit/59037 ?

exactly

Comment by Vratko Polak [ 20/Jun/17 ]

> PrimaryShardInfoCache is not flushed on NotLeaderException

I think this, or something similar still needs to be fixed.

Recent RelEng run showed RequestTimeoutException [0] after module-based shard replica was removed from Leader located on a different member than writer.

Karaf.log on the writer [1] shows the initial NotLeaderException at 06:47:44,331, the replaced connection to the new leader at 06:48:49,054 and transaction error at 06:49:44,761.

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/755/log.html.gz#s1-s20-t3-k2-k9
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/755/odl2_karaf.log.gz

Comment by Robert Varga [ 20/Jun/17 ]

2017-06-20 06:48:49,058 | WARN | rd-dispatcher-36 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | member-3-shard-default-config: No transactions enqueued while attempting to start canCommit on org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort$Normal@389c8969

would explain that. Need debugs to move forward.

Comment by Vratko Polak [ 03/Jul/17 ]

It seems this can still happen, but with such a reduced frequency, that longevity test is needed to reproduce [9] reliably.

In this run and iteration (previous 76 passed), Leader is being moved away from a member where a single writer is running (the same as the initial description of this Bug).

> would explain that.

Yes, "No transactions enqueued" happened (in the log [10]), that is (one more reason) why I believe this Bug is distinct CONTROLLER-1737.

2017-07-02 00:17:03,690 | WARN | rd-dispatcher-36 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-1-shard-default-config: No transactions enqueued while attempting to start canCommit on org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort$Normal@4887c30b

Also a Sandbox run sees the same failure [11] and message (log [12] at 11:33:40,546). Another run with more verbose logs will come shortly.

[9] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/13/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10
[10] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/13/odl1_karaf.log.gz
[11] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/1/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10
[12] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/1/odl2_karaf.log.gz

Comment by Robert Varga [ 03/Jul/17 ]

I think the trouble comes from this bit:

2017-07-03 11:33:32,294 | WARN | rd-dispatcher-35 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: Current transaction member-1-datastore-config-fe-0-chn-8-txn-7141-0 has timed out after 15000 ms in state READY
2017-07-03 11:33:40,546 | WARN | rd-dispatcher-49 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: No transactions enqueued while attempting to start canCommit on org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort$Normal@3b1d9491

https://git.opendaylight.org/gerrit/59890 adds more logging, but I do believe the second line points to the same transaction as the one aborted in ready state.

When that happens, we are only removing the head of the queue, but not propagate that to the cohort (ShardDataTree line 1044).

Comment by Robert Varga [ 03/Jul/17 ]

This scenario can happen if we are performing a reconnect operation concurrently with the transaction being submitted – we have to flush the state to backend (hence the transaction is READY), but the actual commit request can be delayed.

From long-term perspective, I don't think ShardDataTree should be pruning ready transactions, as they should not be holding up forward progress. Unfortunately that requires interaction with leader movement logic and addition of per-chain queues (to detect the bad case when a subsequent transaction is actually scheduled for commit).

In short-term, https://git.opendaylight.org/gerrit/59895 propagates the timeout in READY state towards the cohort – hence rather than throwing a warning on the backend as taking the usual 120 seconds to time out (because the backend will not produce a response), we will report the timeout as a result of the CAN_COMMIT stage.

Comment by Vratko Polak [ 03/Jul/17 ]

Sandbox run with a logging change [13] failed here [14].

[13] https://git.opendaylight.org/gerrit/#/c/59890/2
[14] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/4/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10

Comment by Vratko Polak [ 04/Jul/17 ]

The comment [15] was intended for this Bug.

[15] https://bugs.opendaylight.org/show_bug.cgi?id=8494#c26

Comment by Vratko Polak [ 04/Jul/17 ]

Copy of [15] with changed link numbers:

With [16] the message [17] is different:
"Backend timeout in state READY after 15000ms" is returned 35 second after the leader movement.

[16] https://git.opendaylight.org/gerrit/#/c/59895/1
[17] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/5/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10

Comment by Vratko Polak [ 04/Jul/17 ]

Code [18] shows the same Robot behavior [19] as [16].

[18] https://git.opendaylight.org/gerrit/#/c/59896/1
[19] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/6/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10

Comment by Vratko Polak [ 04/Jul/17 ]

> shows the same Robot behavior

As does [20] code [21].
If anything, the failures happen after less iterations, compared to current stable/carbon.

[20] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/7/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10
[21] https://git.opendaylight.org/gerrit/#/c/59898/1

Comment by Vratko Polak [ 04/Jul/17 ]

The next patch set [22] still looks [23] the same.

[22] https://git.opendaylight.org/gerrit/#/c/59898/2
[23] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/8/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10

Comment by Vratko Polak [ 06/Jul/17 ]

Another patch [24], the same result [25].

[24] https://git.opendaylight.org/gerrit/#/c/59998/1
[25] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/9/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10

Comment by Robert Varga [ 06/Jul/17 ]

Since tell-based protocol can replay state from frontend, but requires the command stream to be processed in order, we have a different set of constraints on what needs to happens on the backend during leadership transfer.

The first part of this is that Shard needs to know when a pauseLeader() operation no longer runs and has been aborted. Patch https://git.opendaylight.org/gerrit/60032 introduces unpauseLeader() hook.

The second part will be blowing up LeaderFrontendState and purging it from ShardDataTree when pauseLeader() is invoked – which is fine, as we do not allow tell-based protocol to make any progress while we are not an active leader. If the transition process succeeds, we end up being a follower of shut down, which is fine. If the transition fails, we can restore LeaderFrontendState from the journal, the same way we do it when we transition from being a follower.

Comment by Vratko Polak [ 10/Jul/17 ]

> "Backend timeout in state READY after 15000ms" is returned 35 second after the leader movement.

Later code [26], behavior [27] still the same.

[26] https://git.opendaylight.org/gerrit/#/c/60033/5
[27] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/1/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10

Comment by Vratko Polak [ 10/Jul/17 ]

This still happens [28] on occasion in releng functional tests, as identified by "No transactions enqueued" in new leader karaf.log [29] (at 09:21:00,668).

[28] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/776/log.html.gz#s1-s20-t3-k2-k8
[29] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/776/odl3_karaf.log.gz

Comment by Robert Varga [ 10/Jul/17 ]

The failure in comment 24 looks like it is caused by synchronization in test driver. More specifically, the frontend thread is prevented from executing. Specifically it is suspended between:

2017-07-10 10:06:47,396 | DEBUG | ult-dispatcher-5 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | Completing TransmittedConnectionEntry{request=ModifyTransactionRequest

{target=member-3-datastore-config-fe-0-chn-8-txn-7786-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$a#1563295464], modifications=0, protocol=SIMPLE}

, enqueuedTicks=1154543474499, sessionId=12, txSequence=7698} with Envelope{sessionId=c, txSequence=1e12, message=TransactionCommitSuccess{target=member-3-datastore-config-fe-0-chn-8-txn-7786-0, sequence=1}}

and

2017-07-10 10:07:12,575 | DEBUG | ult-dispatcher-5 | AbstractTransactionHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.1.SNAPSHOT | Future #7786 completed successfully

during this time the test driver is busily introducing new transactions and holding the lock...

Comment by Robert Varga [ 10/Jul/17 ]

https://git.opendaylight.org/gerrit/60033
https://git.opendaylight.org/gerrit/60137

Generated at Wed Feb 07 19:56:14 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.