Last chunk of snapshot timing out on larger snapshots (CONTROLLER-1907)

[CONTROLLER-2067] Install Snapshot process gets stuck indefinitely Created: 02/Jan/23  Updated: 03/Jan/23

Status: Open
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Sub-task Priority: Medium
Reporter: Tibor Král Assignee: Tibor Král
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

If the Follower starts SnapshotCapture roughly at the same time as the Leader starts SnapshotInstall for said Follower, these two processes can collide in a nasty way.

The SnapshotCapture switches the Follower's SnapshotManager to state PERSISTING. If the Leader manages to send all Snapshot chunks during this state, the Follower reconstructs the Snapshot and tries to apply it. This call to ApplySnapshot however doesn't end in the SnapshotManager.Idle implementation:

@Override
public void apply(final ApplySnapshot toApply) {
    SnapshotManager.this.applySnapshot = toApply;
    lastSequenceNumber = context.getPersistenceProvider().getLastSequenceNumber();
    log.info("lastSequenceNumber prior to persisting applied snapshot: {}", lastSequenceNumber);
    context.getPersistenceProvider().saveSnapshot(toApply.getSnapshot());
    SnapshotManager.this.currentState = PERSISTING;
} 

Since the Follower's SnapshotManager is still persisting the Snapshot from the CaptureSnapshot process, the call to ApplySnapshot from the Leader ends in PERSISTING state:

@Override
public void apply(final ApplySnapshot snapshot) {
    log.debug("apply should not be called in state {}", this);
} 

Here the InstallSnapshot get's stuck. Since the Snapshot wasn't applied, no InstallSnapshotFinished message is sent back to the Leader. Thus the Leader doesn't close the LeaderInstallSnapshotState and thinks the Installation is still ongoing. He doesn't send nor re-send any chunks, since all of them were already received by the Follower. He can't replicate any AppendEntries since the LeaderInstallSnapshotState is still there. The Shard remains in this stalemate forever.


Generated at Wed Feb 07 19:57:07 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.