[CONTROLLER-1675] Leadership transfer failed: Follower is not ready to become leader Created: 15/May/17 Updated: 25/Jul/23 |
|
| Status: | Confirmed |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 8446 | ||||||||
| Description |
|
This happened [0] in CSIT for module-based shard with tell-based protocol. Karaf.log [1] shows large number of warnings such as: 2017-05-14 06:18:34,666 | WARN | lt-dispatcher-22 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | member-1-shard-default-config: Unknown history for purged transaction member-3-datastore-config-fe-0-chn-2-txn-1522-0, ignoring Here is the error and five infos before it: 2017-05-14 06:18:34,674 | INFO | lt-dispatcher-28 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | shard-manager-config Received follower initial sync status for member-1-shard-default-config status sync done false 2017-05-14 06:18:34,716 | INFO | ult-dispatcher-2 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | shard-manager-config Received follower initial sync status for member-1-shard-default-config status sync done true 2017-05-14 06:18:34,716 | INFO | ult-dispatcher-2 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | shard-manager-config Received follower initial sync status for member-1-shard-default-config status sync done false 2017-05-14 06:18:34,724 | INFO | lt-dispatcher-29 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | shard-manager-config Received follower initial sync status for member-1-shard-default-config status sync done true 2017-05-14 06:18:34,724 | INFO | lt-dispatcher-29 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-default-config, leaderId=null, leaderPayloadVersion=5] 2017-05-14 06:18:44,695 | ERROR | lt-dispatcher-21 | ClusterAdminRpcService | 201 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.0.Carbon | Leadership transfer failed for shard default. org.opendaylight.controller.cluster.raft.LeadershipTransferFailedException: Failed to transfer leadership to member-1-shard-default-config. Follower is not ready to become leader at org.opendaylight.controller.cluster.raft.RaftActor$1.onFailure(RaftActor.java:304) at org.opendaylight.controller.cluster.raft.RaftActorLeadershipTransferCohort.finish(RaftActorLeadershipTransferCohort.java:187) at org.opendaylight.controller.cluster.raft.RaftActorLeadershipTransferCohort.abortTransfer(RaftActorLeadershipTransferCohort.java:138) at org.opendaylight.controller.cluster.raft.RaftActorLeadershipTransferCohort$1.doCancel(RaftActorLeadershipTransferCohort.java:112) at org.opendaylight.controller.cluster.raft.TimedRunnable.cancel(TimedRunnable.java:47) at org.opendaylight.controller.cluster.raft.TimedRunnable.lambda$new$0(TimedRunnable.java:33) at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:264) at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31) at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170) at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544) at akka.actor.Actor$class.aroundReceive(Actor.scala:497) at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168) at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:664) at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183) at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/684/archives/log.html.gz#s1-s29-t3-k2-k5-k3-k1-k4-k7-k1 |
| Comments |
| Comment by Jakub Morvay [ 15/May/17 ] |
|
I will try to look into logs for more details, but generally I would suggest to retry requests to change shard leadership to local node in case of above-mentioned failure. There are some scenarios where we cannot grant leadership to some follower node, because the follower node is not up to date with current leader. |
| Comment by Robert Varga [ 15/May/17 ] |
|
This is a problem of the follower not being fully caught up. As a first step create a dedicated subclass of LeadershipTransferFailedException, which will indicate that the followers are not fully caught up. |
| Comment by Jakub Morvay [ 17/May/17 ] |
|
Can you please provide debug logs for org.opendaylight.controller.cluster.raft? |
| Comment by A H [ 17/May/17 ] |
|
We are looking to build Carbon RC2 tomorrow 5/18 at 23:59 UTC time assuming there are no blocker bugs. Is there an ETA for when a fix can be merged and this bug resolved for stable/carbon branch? |
| Comment by Jakub Morvay [ 18/May/17 ] |
|
(In reply to A H from comment #4) I am still analyzing this, so I cannot give you ETA yet. But I will try to do my best to update you with ETA and possibly resolve this today. |
| Comment by Jakub Morvay [ 19/May/17 ] |
|
We cannot reproduce this anymore on latest builds. |
| Comment by Vratko Polak [ 22/May/17 ] |
|
This happened again [2]. Looks like a heisenbug. |
| Comment by Robert Varga [ 22/May/17 ] |
|
At least some of this seems to be timing-dependent: 2017-05-22 00:46:21,801 | INFO | ult-dispatcher-2 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.Carbon | member-2-frontend-datastore-config: refreshing backend for shard 0 I.e. just a millisecond after we time out, we are getting a success message and the target actor is initiating new elections. I think RequestLeadership message should include a timeout (or absolute deadline) by which the transfer needs to occur. |
| Comment by Colin Dixon [ 22/May/17 ] |
|
Is this only using the tell-based protocol? Also, do we think this is a regression from earlier releases or not? I'm trying to assess if this is really something we want to hold Carbon up for. |
| Comment by Jakub Morvay [ 23/May/17 ] |
|
(In reply to Colin Dixon from comment #9) This is new functionality, so no regression here. Downgrading importance to normal. |
| Comment by Robert Varga [ 23/May/17 ] |
|
Well, it still affects leader transfer on shutdown – which is not new functionality. Anyway after thinking about this a bit more, I think the problem comes from the timeout value. The code is: // We'll wait an election timeout period for a new leader to be elected plus some cushion to take into This would be fine if the follower started acting on the request immediately, but under load the follower mailbox can have quite a few requests outstanding, and will not starting acting on our request until it has completed going through them – which introduces additional delay which need to account for. The hard part is figuring out how much delay is tolerable, for which there is no hard-n-fast rule. What we need to ensure is that the target client knows that needs to initiate elections and then that they were kicked off – and the only tools available to us are RaftRPCs. Hence I think this really needs to be a two-step process involving the journal, as we need to know when the target follower has received the message and when it is expected to start acting on it. A first-cut idea is to store a journal entry and follow what happends to its replication – starting current timer only after the target follower has appended it in its journal... but that needs more thought. |
| Comment by Tom Pantelis [ 23/May/17 ] |
|
I'm not sure appending a special journal entry is necessary just to transfer leadership. Transferring leadership on shutdown is really not something that needs to do done or completed, hence the timeout - it's a best effort to speed up new leader election but we don't want to hold up shutdown indefinitely. If transfer doesn't complete then a new leader will eventually get elected anyway once a follower selection time out occurs. |
| Comment by Robert Varga [ 23/May/17 ] |
|
That is true for the shutdown case, but not for the pure transfer case. What we are seeing is that the attempt to transfer indicates a failure (follower not ready) after which (milliseconds later) the follower actually initiates new election. Since this is expose to applications via an asynchronous API, indicating a false negative is harmful, certainly more than a longer transfer period. |
| Comment by Colin Dixon [ 23/May/17 ] |
|
Really simple question: is not using the tell-based protocol a workaround? |
| Comment by Colin Dixon [ 23/May/17 ] |
|
> Well, it still affects leader transfer on shutdown – which is not new functionality. While not new functionality, is this is a regression from prior releases or just a new bug we've found that was present before? |
| Comment by Tom Pantelis [ 23/May/17 ] |
|
It doesn't matter if it's a requested transfer or not. It's still a best effort - no guarantee to fulfill the request - it still needs to be time-boxed. If it's a matter of it taking longer then expected when under load then the test needs to take that into account, ie wait longer or try the request again. Regardless, this does not need to block carbon - there has been no change in behavior or regression - it's just that no one has tested leadership transfer under load with certain expectations. |
| Comment by Tom Pantelis [ 23/May/17 ] |
|
(In reply to Colin Dixon from comment #15) It's not a regression nor is it really a bug with leadership transfer. As I've mentioned it's a bast effort with a deadline. Maybe there's things we can do elsewhere to eliminate or lessen the chance for failure in the test case (eg switch to artery) but, for now, we probably need to lower expectations accordingly in the test. |
| Comment by Robert Varga [ 23/May/17 ] |
|
Well, the severity is already normal, so that should clear up the Carbon angst. Tom, I am not arguing the shutdown case, although it may be interesting to reduce the churn potential even then. What I am arguing is the switchover case, which is a process exposed to application developers. The use case there is per-OF-switch-shard, which you want to have co-located with the active connection. My concern here is what the application sees and how it can interpret things. As things stand right now, we tell the application that the movement has failed (authoritatively) and milliseconds after that the move is initiated, but does not complete (I think). Now from end user perspective, the request to move a leader should be non-disruptive, i.e. it should be able to complete even under load. The way I see this happening is: 1) OF switch connects Now it is up to OFP to decide what to do if the move fails, but certainly another attempt is a valid choice – which would mean that OFP is going to retry, further cause backend election churn – slowing things down. Hence I would really like to provide a result which is authoritative: the move will not happen and there will be no further artefacts on the backend due to the attempt. Or the movement succeeds in due time – if the app is willing to wait for the result, it can take minutes for all it cares. Since the process is based on a CompletionStage, the app can wait for a certain time for an authoritative response, or decide that it no longer cares (and enters the unknown territory). |
| Comment by Tom Pantelis [ 23/May/17 ] |
|
I assume you're referring to the "CompletionStage<Void> makeLeaderLocal()" API. There are several ways a transfer request can fail: 1) There's no current leader to fulfill the request So there's basically w phases to leadership transfer - first try to get the or some follower up-to-date such that if it initiates an election, it could win. After that, send TimeoutNow to immediately start an election. However there's still no guarantee the follower will win the election (or even receive the TimeoutNow message in a timely manner or at all). It should get the prior leader's vote but if there's more than 3 nodes, it still needs a vote from at least one other node. For 1), it could fail fast or wait for a period of time to see if a leader emerges and then retry the request (it might already do that in the ShardManager). This is an authoritative failure. For 2) and 3), it has put a deadline on it and aborted so these are authoritative failures. For 4), it may succeed and thus may not be an authoritative failure. For these failures, a client retry could succeed (we could add a flag in the exception). It also seems to make sense for the client to supply the deadline. For 5), it's an invalid request so fail fast (authoritative failure). This is a case where the move will not ever happen (unless the voting configuration is changed by the user) so no point in retrying. |
| Comment by Vratko Polak [ 23/Jun/17 ] |
|
Something very similar to this happened [3] after considerable time of not being visible. I do not see any undelivered messaged in Karaf log of either the old leader [4] and expected new leader [5] (between movement initiation at 11:02:30,526 and giving up at 11:02:57,006), so this is not But I see multiple out-of-sync lines on the expected new leader, and there is also sync status flapping I did not expect after [6], so the new failure fits comments of this Bug. [3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/759/log.html.gz#s1-s36-t3-k2-k9-k3-k1-k4-k7-k1 |
| Comment by Vratko Polak [ 18/Sep/17 ] |
|
Lowering severity to Minor, as no failures of this kind have been seen in around two months. From the comments I have a feeling such failures are still possible, in suites with greater data churn. |
| Comment by Robert Varga [ 06/Nov/21 ] |
|
The unknown history part might be caused by |