[CONTROLLER-1707] Client not reconnecting successfully after leader movement Created: 07/Jun/17 Updated: 25/Jul/23 Resolved: 24/Aug/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 8618 | ||||||||
| Description |
|
This is somehow similar to The test case was using module-based shard, tell-based protocol, and it was moving leadership away from member-3 where transaction-writer was running. 2017-06-06 17:36:11,744 | INFO | ult-dispatcher-5 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-3-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier {frontend=member-3-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-default-config#1462585811, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} reconnecting as ReconnectingClientConnection{client=ClientIdentifier{frontend=member-3-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-default-config#1462585811, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} It is not clear why no reconnect succeeded after this: 2017-06-06 17:36:11,820 | WARN | monPool-worker-2 | AbstractShardBackendResolver | 199 - org.opendaylight.control Finally, the writer failed with RequestTimeoutException [2]. [0] https://git.opendaylight.org/gerrit/#/c/58355/2 |
| Comments |
| Comment by Tomas Cere [ 07/Jun/17 ] |
|
member-1 switched to leader at 17:36:11,800, this seems like the PrimaryShardInfoCache is not flushed on NotLeaderException |
| Comment by Vratko Polak [ 12/Jun/17 ] |
|
This is affecting [0] one longevity job. See karaf.log [1] between 04:43:03,671 and 04:45:08,911. [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/10/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10 |
| Comment by Vratko Polak [ 12/Jun/17 ] |
|
Also, this affects "remote to remote" test case [5] for module shard make-leader-local test, which was consistently passing previously. Looking at karaf.log [6], frequent syncing on followers might be one the reasons why this Bug happens quite frequently: 2017-06-10 10:43:50,814 | INFO | rd-dispatcher-31 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done false [5] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/738/log.html.gz#s1-s24-t3-k2-k10 |
| Comment by Vratko Polak [ 14/Jun/17 ] |
|
Recent build shows another symptom: The last message was repeated not two times, but 15 times. |
| Comment by Robert Varga [ 14/Jun/17 ] |
|
The trouble seems to be in ShardBackendResolver, where we do not invalidate unsuccessful attempts and hold on the failed state. |
| Comment by Robert Varga [ 14/Jun/17 ] |
| Comment by Robert Varga [ 14/Jun/17 ] |
|
That last batch of messages should be fixed by above patch. The info messages: 2017-06-10 10:43:50,814 | INFO | rd-dispatcher-31 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | shard-manager-config Received follower initial sync status for member-3-shard-default-config status sync done false Tom, I think it is time to make it tuneable and log where we are with thresholds, with pipelining and batching we can end out a batch of append operations, and hence the indexes will be jumping by size of those batches. |
| Comment by Robert Varga [ 15/Jun/17 ] |
|
Configuration knob: https://git.opendaylight.org/gerrit/58991 |
| Comment by Tom Pantelis [ 15/Jun/17 ] |
|
I think you're right about the batching causing false out-of-sync status. I think Follower should update the SyncStatusTracker after it updates it's commitIndex. |
| Comment by Robert Varga [ 15/Jun/17 ] |
|
Tom, do you mean something like: https://git.opendaylight.org/gerrit/59037 ? |
| Comment by Tom Pantelis [ 15/Jun/17 ] |
|
(In reply to Robert Varga from comment #10) exactly |
| Comment by Vratko Polak [ 20/Jun/17 ] |
|
> PrimaryShardInfoCache is not flushed on NotLeaderException I think this, or something similar still needs to be fixed. Recent RelEng run showed RequestTimeoutException [0] after module-based shard replica was removed from Leader located on a different member than writer. Karaf.log on the writer [1] shows the initial NotLeaderException at 06:47:44,331, the replaced connection to the new leader at 06:48:49,054 and transaction error at 06:49:44,761. [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/755/log.html.gz#s1-s20-t3-k2-k9 |
| Comment by Robert Varga [ 20/Jun/17 ] |
|
2017-06-20 06:48:49,058 | WARN | rd-dispatcher-36 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | member-3-shard-default-config: No transactions enqueued while attempting to start canCommit on org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort$Normal@389c8969 would explain that. Need debugs to move forward. |
| Comment by Vratko Polak [ 03/Jul/17 ] |
|
It seems this can still happen, but with such a reduced frequency, that longevity test is needed to reproduce [9] reliably. In this run and iteration (previous 76 passed), Leader is being moved away from a member where a single writer is running (the same as the initial description of this Bug). > would explain that. Yes, "No transactions enqueued" happened (in the log [10]), that is (one more reason) why I believe this Bug is distinct 2017-07-02 00:17:03,690 | WARN | rd-dispatcher-36 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-1-shard-default-config: No transactions enqueued while attempting to start canCommit on org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort$Normal@4887c30b Also a Sandbox run sees the same failure [11] and message (log [12] at 11:33:40,546). Another run with more verbose logs will come shortly. [9] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/13/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10 |
| Comment by Robert Varga [ 03/Jul/17 ] |
|
I think the trouble comes from this bit: 2017-07-03 11:33:32,294 | WARN | rd-dispatcher-35 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: Current transaction member-1-datastore-config-fe-0-chn-8-txn-7141-0 has timed out after 15000 ms in state READY https://git.opendaylight.org/gerrit/59890 adds more logging, but I do believe the second line points to the same transaction as the one aborted in ready state. When that happens, we are only removing the head of the queue, but not propagate that to the cohort (ShardDataTree line 1044). |
| Comment by Robert Varga [ 03/Jul/17 ] |
|
This scenario can happen if we are performing a reconnect operation concurrently with the transaction being submitted – we have to flush the state to backend (hence the transaction is READY), but the actual commit request can be delayed. From long-term perspective, I don't think ShardDataTree should be pruning ready transactions, as they should not be holding up forward progress. Unfortunately that requires interaction with leader movement logic and addition of per-chain queues (to detect the bad case when a subsequent transaction is actually scheduled for commit). In short-term, https://git.opendaylight.org/gerrit/59895 propagates the timeout in READY state towards the cohort – hence rather than throwing a warning on the backend as taking the usual 120 seconds to time out (because the backend will not produce a response), we will report the timeout as a result of the CAN_COMMIT stage. |
| Comment by Vratko Polak [ 03/Jul/17 ] |
|
Sandbox run with a logging change [13] failed here [14]. [13] https://git.opendaylight.org/gerrit/#/c/59890/2 |
| Comment by Vratko Polak [ 04/Jul/17 ] |
|
The comment [15] was intended for this Bug. |
| Comment by Vratko Polak [ 04/Jul/17 ] |
|
Copy of [15] with changed link numbers: With [16] the message [17] is different: [16] https://git.opendaylight.org/gerrit/#/c/59895/1 |
| Comment by Vratko Polak [ 04/Jul/17 ] |
|
Code [18] shows the same Robot behavior [19] as [16]. [18] https://git.opendaylight.org/gerrit/#/c/59896/1 |
| Comment by Vratko Polak [ 04/Jul/17 ] |
|
> shows the same Robot behavior As does [20] code [21]. [20] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/7/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10 |
| Comment by Vratko Polak [ 04/Jul/17 ] |
|
The next patch set [22] still looks [23] the same. [22] https://git.opendaylight.org/gerrit/#/c/59898/2 |
| Comment by Vratko Polak [ 06/Jul/17 ] |
|
Another patch [24], the same result [25]. [24] https://git.opendaylight.org/gerrit/#/c/59998/1 |
| Comment by Robert Varga [ 06/Jul/17 ] |
|
Since tell-based protocol can replay state from frontend, but requires the command stream to be processed in order, we have a different set of constraints on what needs to happens on the backend during leadership transfer. The first part of this is that Shard needs to know when a pauseLeader() operation no longer runs and has been aborted. Patch https://git.opendaylight.org/gerrit/60032 introduces unpauseLeader() hook. The second part will be blowing up LeaderFrontendState and purging it from ShardDataTree when pauseLeader() is invoked – which is fine, as we do not allow tell-based protocol to make any progress while we are not an active leader. If the transition process succeeds, we end up being a follower of shut down, which is fine. If the transition fails, we can restore LeaderFrontendState from the journal, the same way we do it when we transition from being a follower. |
| Comment by Vratko Polak [ 10/Jul/17 ] |
|
> "Backend timeout in state READY after 15000ms" is returned 35 second after the leader movement. Later code [26], behavior [27] still the same. [26] https://git.opendaylight.org/gerrit/#/c/60033/5 |
| Comment by Vratko Polak [ 10/Jul/17 ] |
|
This still happens [28] on occasion in releng functional tests, as identified by "No transactions enqueued" in new leader karaf.log [29] (at 09:21:00,668). [28] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/776/log.html.gz#s1-s20-t3-k2-k8 |
| Comment by Robert Varga [ 10/Jul/17 ] |
|
The failure in comment 24 looks like it is caused by synchronization in test driver. More specifically, the frontend thread is prevented from executing. Specifically it is suspended between: 2017-07-10 10:06:47,396 | DEBUG | ult-dispatcher-5 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | Completing TransmittedConnectionEntry{request=ModifyTransactionRequest {target=member-3-datastore-config-fe-0-chn-8-txn-7786-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$a#1563295464], modifications=0, protocol=SIMPLE}, enqueuedTicks=1154543474499, sessionId=12, txSequence=7698} with Envelope{sessionId=c, txSequence=1e12, message=TransactionCommitSuccess{target=member-3-datastore-config-fe-0-chn-8-txn-7786-0, sequence=1}} and 2017-07-10 10:07:12,575 | DEBUG | ult-dispatcher-5 | AbstractTransactionHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.1.SNAPSHOT | Future #7786 completed successfully during this time the test driver is busily introducing new transactions and holding the lock... |
| Comment by Robert Varga [ 10/Jul/17 ] |
|
https://git.opendaylight.org/gerrit/60033 |