[CONTROLLER-1138] Clustering:Leader and followers get to an inconsistent state when Leader transactions fail Created: 05/Feb/15  Updated: 19/Sep/16  Resolved: 19/Sep/16

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Post-Helium
Fix Version/s: None

Type: Bug
Reporter: Kamal Rameshan Assignee: Kamal Rameshan
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 2678

 Description   

Consider a scenario when there are concurrent writes on the same node in the Leader.
We would replicate these to the followers too.

When we would try to apply these to the state, one of those writes might fail (with an Optimistic lock exception) at the Leader end.

But since its serialized at the Follower end, these transactions might go through.

Since we dont have a way to rollback, this would result in an inconsistent state between the Leader and the Follower(s).



 Comments   
Comment by Tom Pantelis [ 05/Feb/15 ]

Did you observe this issue in a running cluster or is it theory?

On the Leader, when applying state after consensus, we perform the commit phase (Shard#finishCommit). commit() on the IMDS doesn't (to my knowledge) raise OptimisticLockFailedException - this is raised in the canCommit phase.

We really shouldn't get a failure in commit() on the Leader. It can throw an IllegalArgException if the current tree root differs from the candidate base root (computed during preCommit()) but that means preCommit() and commit weren't done atomically, e.g. there was a preCommit() for txn 1, then a preCommit() for txn 2, then commit() txn 1. This would be a bug - Shard is supposed to only allow 1 txn 3-phase commit at a time.

Comment by Moiz Raja [ 05/Feb/15 ]

Tom, this was theory. I somehow was under the impression that commit could throw an exception. In either case it looks like when a transaction is successfully replicated it may be ok to just create a new transaction and commit it. That will ensure that we are doing the exact same operations on leader/follower.

Anyway if we can confirm that a commit can never throw an exception we probably do not need any changes for now.

Comment by Tom Pantelis [ 05/Feb/15 ]

It can throw a "Data tree root differs from candidate base" exception as I mentioned before. In fact I've seen this in a couple jenkins test runs. But it's a bug and shouldn't happen normally such that we need to handle it. Other than that I don't see anything else in InMemoryDataTree#commit (except paranoid arg checking which shouldn't happen). At this point, the data has been validated and should be good to go. At least that's what I can tell looking at the IMDS code - it would be interesting to get Robert's opinion on this - he knows the IMDS code much better than I do.

We probably could create a new transaction and commit it but that would repeat what we did previously, i.e. apply the mods to the tx and preCommit it.

Comment by Robert Varga [ 05/Feb/15 ]

The only time when the 'differs' exception is thrown is when a data tree you have two DataTreeCandidates prepare()d and attempt to commit() the second one. DataTree is inherently single-threaded. In case you commit() a candidate, the next logical thing you are allowed to do is a validate() or a prepare().

Comment by Tom Pantelis [ 05/Feb/15 ]

Right - that's what I thought. The Shard code should only allow one 3-phase commit at a time, i.e. the sequence preCommit 1 then preCommit 2 should never happen.

(In reply to Robert Varga from comment #4)
> The only time when the 'differs' exception is thrown is when a data tree you
> have two DataTreeCandidates prepare()d and attempt to commit() the second
> one. DataTree is inherently single-threaded. In case you commit() a
> candidate, the next logical thing you are allowed to do is a validate() or a
> prepare().

Comment by Tom Pantelis [ 05/Feb/15 ]

The sequence preCommit 1, preCommit 2, commit 2 can happen with the following scenario:

  • txn 1 is preCommitted
  • txn 1 is replicated to followers however the message or reply is delayed (network issues)
  • 30 sec elapse and tnx 1 is timed out, aborted, and removed as the current txn (Shard#handleTransactionCommitTimeoutCheck).
  • this allows txn 2 to proceed - preCommit and replicate
  • the replicate replies are finally received for txn 1
  • state is applied for txn 1 and Shard#finishCommit is called
  • however txn 1 is no longer the current tx. In that case, if it can find the CohortEntry in the cache, it creates a new txn to preCommit and commit.
  • the replicate replies are received for txn 2
  • state is applied for txn 2 and Shard#finishCommit is called
  • it is still the current tx, so it commits to the IMDS
  • however, the data tree current root now differs from the candidate base

I was originally thinking that if replicate replies are delayed long enough to expire the current tx but we later receive consensus, then we should apply the state. But if another tx is in progress this can cause the differing root error.

Perhaps we should just drop the commit if the txn is no longer current (i.e. expired)? Or only drop it if there's another txn current?

Comment by Tom Pantelis [ 05/Feb/15 ]

Another option - re-send the CanCommitMessage to self for the txn so it's queued. If there was a subsequent conflicting change then it should fail which should be fine. However I'm not sure if the DOM txn would support repeating canCommit and preCommit.

(In reply to Tom Pantelis from comment #6)
> The sequence preCommit 1, preCommit 2, commit 2 can happen with the
> following scenario:
>
> - txn 1 is preCommitted
> - txn 1 is replicated to followers however the message or reply is delayed
> (network issues)
> - 30 sec elapse and tnx 1 is timed out, aborted, and removed as the
> current txn (Shard#handleTransactionCommitTimeoutCheck).
> - this allows txn 2 to proceed - preCommit and replicate
> - the replicate replies are finally received for txn 1
> - state is applied for txn 1 and Shard#finishCommit is called
> - however txn 1 is no longer the current tx. In that case, if it can find
> the CohortEntry in the cache, it creates a new txn to preCommit and commit.
> - the replicate replies are received for txn 2
> - state is applied for txn 2 and Shard#finishCommit is called
> - it is still the current tx, so it commits to the IMDS
> - however, the data tree current root now differs from the candidate base
>
> I was originally thinking that if replicate replies are delayed long enough
> to expire the current tx but we later receive consensus, then we should
> apply the state. But if another tx is in progress this can cause the
> differing root error.
>
> Perhaps we should just drop the commit if the txn is no longer current (i.e.
> expired)? Or only drop it if there's another txn current?

Comment by Moiz Raja [ 02/Jun/15 ]

I think we should first create a test to reproduce this issue before we even consider this to be an issue. If we cannot then we should probably target this for Be.

Comment by Ananthi Palaniswamy [ 19/Sep/16 ]

The bug is not reproducible, not getting the exception(OptimisticLockFailedException ) while doing the transaction as mentioned in the Bug report.
Tested Scenario: Logout Leader node, from  the remaining  two nodes one of the node getting  selected as a Leader and other node as members, not getting exception while doing the transaction in the leader and follower node.

Comment by Tom Pantelis [ 19/Sep/16 ]

The issue in the original description would not occur b/c OptimisticLockFailedException happens in the canCommit phase as noted earlier.

The scenario I outlined on 2015-02-05 12:33:18 does not appear it would occur either. If the leader was isolated when it sent out replicate messages, the majority partition would elect a new leader. If the majority partition did actually receive the replicate message but the reply was dropped/lost, then the new leader would finish the commit in the PreLeader state. We have unit tests for this case. If the replicate messages were not received by the majority partition, when the partition was healed, the previous leader would go to follower due to the higher term and it's log entry would get overwritten by the new leader.

In the case where all nodes were partitioned, the pending transaction would remain in the queue as we no longer time out the transaction in the PENDING_COMMIT phase.

I'll close this bug.

Generated at Wed Feb 07 19:54:47 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.