[CONTROLLER-1386] Repeating AskTimeoutException when commits not getting executed fast enough Created: 06/Jul/15  Updated: 23/Oct/15  Resolved: 23/Oct/15

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Helium
Fix Version/s: None

Type: Bug
Reporter: Anton Frolov Assignee: Tom Pantelis
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: Other


External issue ID: 3938

 Description   

I found this problem on our custom application running on Helium, but looking at source code it seems that it is possible that this issue can arise (but with much smaller probability) on current (Lithium/master) codebase.
Here is what I found:
1) ShardCommitCoordinator.queuedCohortEntries grows to a point that next transaction will timeout (with AskTimeoutException) executing ThreePhaseCommitCohortProxy.canCommit
2) After ThreePhaseCommitCohortProxy.canCommit will be finished with exception AbortTransaction will be sent to Shard actor.
3) Shard.doAbortTransaction will be called, and it will only handle case when ShardCommitCoordinator already started to execute doCanCommit for transaction to be aborted.
4) After some time ShardCommitCoordinator will start executing doCanCommit for transaction that was already aborted, cohortEntry.getCohort().canCommit().get() will return true and CanCommitTransactionReply(true) will be sent to "internal ask actor" for ask that was already timed out.
5) ShardCommitCoordinator will not start working on next item in ShardCommitCoordinator.queuedCohortEntries until some code will try to abort transaction again due to some other timeout.

This will result in further slowdown of ShardCommitCoordinator (up to several seconds on Helium codebase) and all new transactions will fail due to AskTimeoutException.



 Comments   
Comment by Vratko Polak [ 23/Oct/15 ]

No further Service Releases of Helium are planned. Feel free to re-open if similar symptoms are present in Lithium or later.

Generated at Wed Feb 07 19:55:25 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.