Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1386

Repeating AskTimeoutException when commits not getting executed fast enough

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Resolution: Won't Do
    • Helium
    • None
    • clustering
    • None
    • Operating System: Linux
      Platform: Other

    • 3938

    Description

      I found this problem on our custom application running on Helium, but looking at source code it seems that it is possible that this issue can arise (but with much smaller probability) on current (Lithium/master) codebase.
      Here is what I found:
      1) ShardCommitCoordinator.queuedCohortEntries grows to a point that next transaction will timeout (with AskTimeoutException) executing ThreePhaseCommitCohortProxy.canCommit
      2) After ThreePhaseCommitCohortProxy.canCommit will be finished with exception AbortTransaction will be sent to Shard actor.
      3) Shard.doAbortTransaction will be called, and it will only handle case when ShardCommitCoordinator already started to execute doCanCommit for transaction to be aborted.
      4) After some time ShardCommitCoordinator will start executing doCanCommit for transaction that was already aborted, cohortEntry.getCohort().canCommit().get() will return true and CanCommitTransactionReply(true) will be sent to "internal ask actor" for ask that was already timed out.
      5) ShardCommitCoordinator will not start working on next item in ShardCommitCoordinator.queuedCohortEntries until some code will try to abort transaction again due to some other timeout.

      This will result in further slowdown of ShardCommitCoordinator (up to several seconds on Helium codebase) and all new transactions will fail due to AskTimeoutException.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            tpantelis Tom Pantelis
            anton.frolov@pacnet.com Anton Frolov
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: