Details
-
Bug
-
Status: Resolved
-
Resolution: Won't Do
-
Helium
-
None
-
None
-
Operating System: Linux
Platform: Other
-
3938
Description
I found this problem on our custom application running on Helium, but looking at source code it seems that it is possible that this issue can arise (but with much smaller probability) on current (Lithium/master) codebase.
Here is what I found:
1) ShardCommitCoordinator.queuedCohortEntries grows to a point that next transaction will timeout (with AskTimeoutException) executing ThreePhaseCommitCohortProxy.canCommit
2) After ThreePhaseCommitCohortProxy.canCommit will be finished with exception AbortTransaction will be sent to Shard actor.
3) Shard.doAbortTransaction will be called, and it will only handle case when ShardCommitCoordinator already started to execute doCanCommit for transaction to be aborted.
4) After some time ShardCommitCoordinator will start executing doCanCommit for transaction that was already aborted, cohortEntry.getCohort().canCommit().get() will return true and CanCommitTransactionReply(true) will be sent to "internal ask actor" for ask that was already timed out.
5) ShardCommitCoordinator will not start working on next item in ShardCommitCoordinator.queuedCohortEntries until some code will try to abort transaction again due to some other timeout.
This will result in further slowdown of ShardCommitCoordinator (up to several seconds on Helium codebase) and all new transactions will fail due to AskTimeoutException.