• Icon: Sub-task Sub-task
    • Resolution: Unresolved
    • Icon: Medium Medium
    • None
    • None
    • clustering
    • None

      If the Follower starts SnapshotCapture roughly at the same time as the Leader starts SnapshotInstall for said Follower, these two processes can collide in a nasty way.

      The SnapshotCapture switches the Follower's SnapshotManager to state PERSISTING. If the Leader manages to send all Snapshot chunks during this state, the Follower reconstructs the Snapshot and tries to apply it. This call to ApplySnapshot however doesn't end in the SnapshotManager.Idle implementation:

      @Override
      public void apply(final ApplySnapshot toApply) {
          SnapshotManager.this.applySnapshot = toApply;
          lastSequenceNumber = context.getPersistenceProvider().getLastSequenceNumber();
          log.info("lastSequenceNumber prior to persisting applied snapshot: {}", lastSequenceNumber);
          context.getPersistenceProvider().saveSnapshot(toApply.getSnapshot());
          SnapshotManager.this.currentState = PERSISTING;
      } 

      Since the Follower's SnapshotManager is still persisting the Snapshot from the CaptureSnapshot process, the call to ApplySnapshot from the Leader ends in PERSISTING state:

      @Override
      public void apply(final ApplySnapshot snapshot) {
          log.debug("apply should not be called in state {}", this);
      } 

      Here the InstallSnapshot get's stuck. Since the Snapshot wasn't applied, no InstallSnapshotFinished message is sent back to the Leader. Thus the Leader doesn't close the LeaderInstallSnapshotState and thinks the Installation is still ongoing. He doesn't send nor re-send any chunks, since all of them were already received by the Follower. He can't replicate any AppendEntries since the LeaderInstallSnapshotState is still there. The Shard remains in this stalemate forever.

            tibor.kral Tibor Král
            tibor.kral Tibor Král
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: