[CONTROLLER-1138] Clustering:Leader and followers get to an inconsistent state when Leader transactions fail Created: 05/Feb/15 Updated: 19/Sep/16 Resolved: 19/Sep/16 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Post-Helium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Kamal Rameshan | Assignee: | Kamal Rameshan |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 2678 |
| Description |
|
Consider a scenario when there are concurrent writes on the same node in the Leader. When we would try to apply these to the state, one of those writes might fail (with an Optimistic lock exception) at the Leader end. But since its serialized at the Follower end, these transactions might go through. Since we dont have a way to rollback, this would result in an inconsistent state between the Leader and the Follower(s). |
| Comments |
| Comment by Tom Pantelis [ 05/Feb/15 ] |
|
Did you observe this issue in a running cluster or is it theory? On the Leader, when applying state after consensus, we perform the commit phase (Shard#finishCommit). commit() on the IMDS doesn't (to my knowledge) raise OptimisticLockFailedException - this is raised in the canCommit phase. We really shouldn't get a failure in commit() on the Leader. It can throw an IllegalArgException if the current tree root differs from the candidate base root (computed during preCommit()) but that means preCommit() and commit weren't done atomically, e.g. there was a preCommit() for txn 1, then a preCommit() for txn 2, then commit() txn 1. This would be a bug - Shard is supposed to only allow 1 txn 3-phase commit at a time. |
| Comment by Moiz Raja [ 05/Feb/15 ] |
|
Tom, this was theory. I somehow was under the impression that commit could throw an exception. In either case it looks like when a transaction is successfully replicated it may be ok to just create a new transaction and commit it. That will ensure that we are doing the exact same operations on leader/follower. Anyway if we can confirm that a commit can never throw an exception we probably do not need any changes for now. |
| Comment by Tom Pantelis [ 05/Feb/15 ] |
|
It can throw a "Data tree root differs from candidate base" exception as I mentioned before. In fact I've seen this in a couple jenkins test runs. But it's a bug and shouldn't happen normally such that we need to handle it. Other than that I don't see anything else in InMemoryDataTree#commit (except paranoid arg checking which shouldn't happen). At this point, the data has been validated and should be good to go. At least that's what I can tell looking at the IMDS code - it would be interesting to get Robert's opinion on this - he knows the IMDS code much better than I do. We probably could create a new transaction and commit it but that would repeat what we did previously, i.e. apply the mods to the tx and preCommit it. |
| Comment by Robert Varga [ 05/Feb/15 ] |
|
The only time when the 'differs' exception is thrown is when a data tree you have two DataTreeCandidates prepare()d and attempt to commit() the second one. DataTree is inherently single-threaded. In case you commit() a candidate, the next logical thing you are allowed to do is a validate() or a prepare(). |
| Comment by Tom Pantelis [ 05/Feb/15 ] |
|
Right - that's what I thought. The Shard code should only allow one 3-phase commit at a time, i.e. the sequence preCommit 1 then preCommit 2 should never happen. (In reply to Robert Varga from comment #4) |
| Comment by Tom Pantelis [ 05/Feb/15 ] |
|
The sequence preCommit 1, preCommit 2, commit 2 can happen with the following scenario:
I was originally thinking that if replicate replies are delayed long enough to expire the current tx but we later receive consensus, then we should apply the state. But if another tx is in progress this can cause the differing root error. Perhaps we should just drop the commit if the txn is no longer current (i.e. expired)? Or only drop it if there's another txn current? |
| Comment by Tom Pantelis [ 05/Feb/15 ] |
|
Another option - re-send the CanCommitMessage to self for the txn so it's queued. If there was a subsequent conflicting change then it should fail which should be fine. However I'm not sure if the DOM txn would support repeating canCommit and preCommit. (In reply to Tom Pantelis from comment #6) |
| Comment by Moiz Raja [ 02/Jun/15 ] |
|
I think we should first create a test to reproduce this issue before we even consider this to be an issue. If we cannot then we should probably target this for Be. |
| Comment by Ananthi Palaniswamy [ 19/Sep/16 ] |
|
The bug is not reproducible, not getting the exception(OptimisticLockFailedException ) while doing the transaction as mentioned in the Bug report. |
| Comment by Tom Pantelis [ 19/Sep/16 ] |
|
The issue in the original description would not occur b/c OptimisticLockFailedException happens in the canCommit phase as noted earlier. The scenario I outlined on 2015-02-05 12:33:18 does not appear it would occur either. If the leader was isolated when it sent out replicate messages, the majority partition would elect a new leader. If the majority partition did actually receive the replicate message but the reply was dropped/lost, then the new leader would finish the commit in the PreLeader state. We have unit tests for this case. If the replicate messages were not received by the majority partition, when the partition was healed, the previous leader would go to follower due to the higher term and it's log entry would get overwritten by the new leader. In the case where all nodes were partitioned, the pending transaction would remain in the queue as we no longer time out the transaction in the PENDING_COMMIT phase. I'll close this bug. |