Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1236

Clustering: Log entries may be missing or not applied on persistence recovery after prior snapshot

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Resolution: Done
    • Post-Helium
    • None
    • mdsal
    • None
    • Operating System: All
      Platform: All

    • 2948
    • High

    Description

      I created 60K entries in a yang list which triggered 3 snapshots. However after restarting, I only saw 59945 entries, 55 were missing. The last snapshot contained the last 55 unapplied log entries. They were present in the in-memory journal log but weren't applied to the state. There should have been 55 ApplyJournalEntries messages recovered but none were.

      The problem is that the ApplyJournalEntries were deleted from the persisted journal log when we saved the snapshot. When we initiate a snapshot, the unapplied log entries are captured but the snapshot data is captured asynchronously thus subsequent ReplicatedLogEntry and ApplyJournalEntries messages could've been saved to persistence by the time we save the snapshot. On save snapshot success, we delete messages up to the sequence number provided by akka in the SaveSnapshotSuccess message. However this is the sequence number obtained at the time the snapshot is saved and thus will include any messages saved after the snapshot was captured.

      Solution:

      We can obtain the last sequence number from akka and record it when we capture the snapshot. On save snapshot success, use the recorded sequence number when deleting messages.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            tpantelis Tom Pantelis
            tpantelis Tom Pantelis
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: