[CONTROLLER-1927] Transaction can become stuck in COMMIT_PENDING when a node flaps leader -> follower -> leader Created: 17/Jan/20  Updated: 27/Apr/20  Resolved: 23/Mar/20

Status: Resolved
Project: controller
Component/s: None
Affects Version/s: Oxygen SR4, Fluorine SR3, Sodium SR1, Neon SR3
Fix Version/s: Sodium SR3, Magnesium SR1, 2.0.0

Type: Bug Priority: High
Reporter: Tomas Cere Assignee: Tomas Cere
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to CONTROLLER-1535 sal-akka-raft: Eliminate ClientReques... In Review
relates to CONTROLLER-1928 Regression detected in CSIT Resolved

 Description   

Normally an entry application is as follows:

  1. leader sends an append entry off to persistence and replicates it to followers
  2. leaders creates its ClientRequestTracker
  3. when the entry is done with persistence and replication leader moves its commit index
  4. part of moving the commit index is sending an ApplyState message which finalizes the entry application in the DataTree
  5. The ApplyState determines if a ClientRequestTracker is present and adds an identifier to the ApplyState message if it is.
    • This determines the way in which the finalize of the entry application happens in the DataTree.
    • If it is present the entry is applied as if it originated on the leader,
    • if it is not present it is applied as if the node is a follower.

The problem is when the leader flaps in a leader -> follower -> leader transition after 2. and before 4.. This would mean that the new leader no longer has the ClientRequestTracker which was created in the previous leader state, which means
that when it starts with 5. It will create the ApplyState without an identifier and the entry finishes up the application as if the node is a follower.

This means that it will be applied without finishCommit which means that the transaction will be forever stuck in COMMIT_PENDING state until the node would be restarted.


Generated at Wed Feb 07 19:56:47 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.