[CONTROLLER-1706] Large transaction traffic prevents leader to be moved Created: 06/Jun/17 Updated: 25/Jul/23 Resolved: 21/Jun/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Jakub Morvay |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||
| External issue ID: | 8606 | ||||||||||||
| Description |
|
This symptom is affecting current cluster testing, but after fixing other bugs this might no longer be critical. This has been only seen on Sandbox so far. The robot symptom "Leader not found" [0] is similar to 2017-06-06 07:51:36,160 | WARN | lt-dispatcher-28 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.SNAPSHOT | member-1-shard-default-config: Failed to transfer leadership in 10.01 s I suspect the large transaction is the creation of a large list at the start of write-transactions, see CONTROLLER-1703. But I am not sure whether the following proves or disproves that: 2017-06-06 07:52:40,474 | INFO | ternal.Finalizer | lientBackedTransaction$Finalizer | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Aborted orphan transaction ClientSnapshot {identifier=member-1-datastore-config-fe-0-txn-7-0}[0] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/4/log.html.gz#s1-s18-t1-k2-k12-k1-k3-k1 |
| Comments |
| Comment by Vratko Polak [ 06/Jun/17 ] |
|
Note that this happens both for module-based and prefix-based shards (tell-based protocol) but only when 3 writers/producers (and a single listener) are used. Similar suite with single writer/producer passes (mostly). |
| Comment by Vratko Polak [ 07/Jun/17 ] |
|
With the initial big transaction removed [2] (now merged to stable/carbon), this failure still affects remove-shard-replica when there are frequent small transactions, but only for module-based shards (not prefix-based ones) for some reason. Recent sandbox, robot [3], member-1 karaf.log (huge) [4]: 2017-06-06 19:14:27,516 | WARN | lt-dispatcher-25 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.SNAPSHOT | member-1-shard-default-config: Failed to transfer leadership in 10.03 s [2] https://git.opendaylight.org/gerrit/58355 |
| Comment by Vratko Polak [ 07/Jun/17 ] |
|
> (not prefix-based ones) for some reason. For comparison, prefix shard performance: 2017-06-06 19:23:12,676 | INFO | ult-dispatcher-3 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.SNAPSHOT | member-3-shard-id-ints!-config: Successfully transferred leadership to null in 2.031 s |
| Comment by Jakub Morvay [ 09/Jun/17 ] |
|
I have analyzed logs from run [0]. The only differences between this job and the jobs mentioned above by Vratko are that we are using PoisonPill to kill shard leader replica (not remove-shard-replica rpc) and we have debug logs for raft messages. Moreover, I have been looking on Clean Leader Shutdown suite, not on Remote Listener suite mentioned by Vratko but I believe they are failing because of the same reason. Scenario goes as follows. We have three node cluster and on leader, we start writing data to shard-default-config shard (that's member-3 in this particular run). Then we kill shard-default-config shard on leader node (we send PoisonPill to member-3-shard-default-config shard). We expect that followers will elect new leader quickly, in span of small amount of election timeouts. But this is not the case. Leader is not elected in one minute and test fails. From the logs we can see the leader shard is going down in 11:06:31,494 When we look on followers, they are still processing AppendEntries messages for shard-default-config shard from leader for some small amount of time. But we can see they timeout elections properly, roughly after 10 seconds the leader is down. We can see this on member-2 2017-06-08 11:06:41,479 | TRACE | lt-dispatcher-30 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | Received message ElectionTimeout And this is repeated every 10 seconds, until the test ends with failure. We can see same logs also on member-1. When we look closely into the implementation of handling election timeout on followers, we can see that followers are checking also the cluster status of leader before switching to candidate. And they do it for 180 seconds (if the election timeout is 10 seconds). This was introduced in https://git.opendaylight.org/gerrit/#/c/43265/ as a fix for Clearly, this prevents quick elections in the case the leader shard actor is down but its cluster node is still up and the leadership transfer was not successful. For example, this is the case if you send PoisonPill to leader shard actor or if RaftActor's pauseLeader [1] method fails during leadership transfer attempt after sending Shutdown message to leader shard actor (this can happen as a part of remove-shard-replica rpc operation). [0] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-2nd/17/log.html.gz#s1-s2 |
| Comment by Tom Pantelis [ 09/Jun/17 ] |
|
yes - I warned about side-effects in the patch with just blindly killing the shard in the tests with PoisonPill. On election timeout, followers check akka's cluster state - this is to avoid spurious elections if the leader shard is busy for period of time. We want to avoid deposing the leader unless we can be reasonably sure the node is not available. Normally in production the leader node either is shut down or becomes unavailable due to network partition. In either case, akka's cluster state will mark the node as unreachable/down so followers will initiate a new election on timeout. On graceful leader shutdown, we try to transfer leadership to speed up the process. We do have an upper deadline to initiate a new election on election timeout as a fail-safe as you mentioned. This would occur if the leader shard actor had a hard failure and couldn't recover or the process is hosed (low memory, out of memory etc), either of which should be uncommon. In a 3-node setup, what are you really trying to test by killing the leader shard actor? Why don't you just shut the leader node down, or kill the process, or remove the shard replica? If you really want stop the shard while keeping the process up then I would suggest stopping it gracefully so it transfers leadership. |
| Comment by Tom Pantelis [ 09/Jun/17 ] |
|
(In reply to Jakub Morvay from comment #4) > Clearly, this prevents quick elections in the case the leader shard actor is yes - but a failure in pauseLeader or leadership transfer indicates follower unavailability (normally at least), either followers are down or the leader is partitioned. For the latter, the followers would detect the partition as well and initiate a new election. |
| Comment by Jakub Morvay [ 09/Jun/17 ] |
|
(In reply to Tom Pantelis from comment #5) We have used remove-shard-replica rpc for this and it is still used on releng. We are going to try to stop shard gracefully in our tests, I'm currently working on such rpcs. However I believe this failure will be still present, because pauseLeader method will fail. |
| Comment by Jakub Morvay [ 09/Jun/17 ] |
|
(In reply to Tom Pantelis from comment #6) Well, is that true for pauseLeader method "timeout"? If leader cannot complete all pending tasks in election timeout period, it just aborts the leadership transfer and do nothing (apart from shutting itself, become non-voting etc.). If we know, that leader is stepping down anyway, shouldn't we send TimeoutNow message to random follower to speed things up and not wait 180 seconds for new leader to emerge? |
| Comment by Tom Pantelis [ 09/Jun/17 ] |
|
(In reply to Jakub Morvay from comment #7) pauseLeader in Shard tries to wait for all pending transactions to complete within the election timeout interval. Why would you to expect this to time out in the test suite? In any event, if you anticipate leadership transfer could fail then you can increase the expected deadline in the suite to take into account the potential 180 second timeout. |
| Comment by Jakub Morvay [ 09/Jun/17 ] |
|
(In reply to Tom Pantelis from comment #9) Yeah well, we have seen such case, where leader couldn't complete all txs within the election timeout interval. |
| Comment by Tom Pantelis [ 09/Jun/17 ] |
|
(In reply to Jakub Morvay from comment #8) Keep in mind that hitting the 180 second deadline should be very uncommon. You hit it b/c you blindly killed the leader actor ungracefully while keeping the process up which is abnormal. If pauseLeader fails it most likely means it can't get consensus for a transaction in which case leadership transfer would fail as well as there aren't enough available nodes to elect a new leader. Of course it could time out for uncommon reasons, eg akka's transport is backed up or followers are too busy or in a bad state or the volume of transactions is large. Just sending TimeoutNow message to a random follower isn't ideal as the follower may not be able to collect the votes to become leader. This is why the leader first attempts to "catch up" at least one follower. If leadership transfer times out, could you just send TimeoutNow to all followers? We could but there could be unwanted side-effects. In the common case where a majority of followers isn't available, sending TimeoutNow probably won't help any. However the TimeoutNow message may get queued in akka's transport layer and eventually be delivered at some point which may not be desirable. Same could happen for the uncommon reasons above. |
| Comment by Tom Pantelis [ 09/Jun/17 ] |
|
(In reply to Jakub Morvay from comment #10) When you're testing scale you usually have to be more lenient with the test expectations and deadlines. The test VM environment could randomly cause delays wrt to system and network resources and result in intermittent failures. Even if you forced a TimeoutNow message, if there's a large volume of pending transactions then the follower's queue will likely be backed up so no guarantee when the TimeoutNow message would be delivered. |
| Comment by Tom Pantelis [ 10/Jun/17 ] |
|
Thinking about this some more, if pauseLeader times out, I think it does make sense to continue with leadership transfer instead of aborting. The shard may have a lot of transactions queued up which it can't finish in time but there may still be a follower that is caught up (ie whose matchIndex equals the leader's lastIndex) or would be caught up if leadership transfer continued. Worst case is no follower is available and the "catch up" phase of leadership transfer also times out which would lengthen shut down time but that should be fine. I can look into that. As a last resort if leadership transfer times out, we could force the TimeoutNow message to "active" followers, ie those that have responded recently so should be up. |
| Comment by Vratko Polak [ 12/Jun/17 ] |
|
"Failed to transfer leadership" still happens occasionally ([2] at 09:14:29,865). For example after remove-shard-replica with single transaction-writer [3] (on a node separate to leader of module-based shard, tell-based protocol). [2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/740/odl1_karaf.log.gz |
| Comment by Vratko Polak [ 12/Jun/17 ] |
|
Also the original "Leader not found" symptom is seen [4] occasionally. |
| Comment by Tom Pantelis [ 12/Jun/17 ] |
|
(In reply to Vratko Polák from comment #14) No matter what we do, leadership transfer is a best effort and may not be able to successfully complete in time under load for the reasons I've mentioned. You have to take that into account in the tests with more lenient deadlines. |
| Comment by Tom Pantelis [ 12/Jun/17 ] |