Details
-
Bug
-
Status: Resolved
-
Resolution: Done
-
Beryllium
-
None
-
None
-
Operating System: All
Platform: All
-
4850
-
High
Description
I ran into an issue in a 3-node cluster where a continuous out-of-sync loop occurs with AppendEntries even though the followers are actually in sync. This results in continuous install snapshots. This issue occurs after a couple restarts and is related to the initial snapshot after recovery and the logic in the Follower out-of-sync checking wrt the replicatedToAllIndex.
To reproduce:
1) Start a 3-node cluster with empty journals
2) Commit some data
3) Restart the cluster nodes
4) With no additional data committed, restart the cluster again
5) After leader election and all followers connected, the leader will go into a continuous cycle of failed AppendEntries and subsequent install snapshot.
Log snippets on the leader:
14:36:45,301 | AbstractLeader | 299 | member-1-shard-car-config (Leader): handleAppendEntriesReply - FollowerLogInformation for member-3-shard-car-config updated: matchIndex: 10, nextIndex: 11
14:36:45,301 | AbstractLeader | 555 | member-1-shard-car-config (Leader): Checking sendAppendEntries for follower member-3-shard-car-config: active: true, followerNextIndex: 11, leaderLastIndex: 10, leaderSnapShotIndex: 10
14:36:45,809 | AbstractLeader | 218 | member-1-shard-car-config (Leader): handleAppendEntriesReply: received unsuccessful reply: AppendEntriesReply [success=false, logLastIndex=10, logLastTerm=1, followerId=member-3-shard-car-config, payloadVersion=, forceInstallSnapshot=false3]
14:36:46,333 | AbstractLeader| 218 |member-1-shard-car-config (Leader): handleAppendEntriesReply: received unsuccessful reply: AppendEntriesReply [success=false, logLastIndex=10, logLastTerm=1, followerId=member-3-shard-car-config, payloadVersion=, forceInstallSnapshot=false3]
14:36:46,333 | AbstractLeader| 579 |member-1-shard-car-config (Leader): InitiateInstallSnapshot to follower: member-3-shard-car-config,follower-nextIndex: 9, leader-snapshot-index: 10, leader-last-index: 10
14:36:46,334 | SnapshotManager | 142 |member-1-shard-car-config: Capturing Snapshot : lastLogEntry is null. Using lastAppliedIndex 10 and lastAppliedTerm 1 instead.
14:36:46,335 | SnapshotManager$Idle| 237 | member-1-shard-car-config: Initiating snapshot capture CaptureSnapshot [lastAppliedIndex=10, lastAppliedTerm=1, lastIndex=10, lastTerm=1, installSnapshotInitiated=true, replicatedToAllIndex=-1, replicatedToAllTerm=-1, unAppliedEntries size=0] to install on member-3-shard-car-config
14:36:46,346 | SnapshotManager$Creating| 310 | member-1-shard-car-config: Persisting of snapshot done: Snapshot [lastIndex=10, lastTerm=1, lastAppliedIndex=10, lastAppliedTerm=1, unAppliedEntries size=0, state size=842, electionTerm=44, electionVotedFor=member-1-shard-car-config, ServerConfigPayload=null]
14:36:46,346 SnapshotManager$Creating| 360 | member-1-shard-car-config: Removed in-memory snapshotted entries, adjusted snaphsotIndex: 10 and term: 1
14:36:46,346 | AbstractLeader | 660 | member-1-shard-car-config (Leader): sendInstallSnapshot
14:36:46,350 | AbstractLeader| 702 | member-1-shard-car-config (Leader): InstallSnapshot sent to follower Vector(user, shardmanager-config, member-3-shard-car-config), Chunk: 1/1
...
14:36:46,353 | DEBUG | ult-dispatcher-3 | AbstractLeader | 218 | 227 - org.opendaylight.controller.sal-akka-raft - 1.3.0.SNAPSHOT | | member-1-shard-car-config (Leader): handleAppendEntriesReply: received unsuccessful reply: AppendEntriesReply [success=false, logLastIndex=10, logLastTerm=1, followerId=member-3-shard-car-config, payloadVersion=, forceInstallSnapshot=false3]
Log snippets on a follower:
Follower| 103 | member-3-shard-car-config (Follower): handleAppendEntries: AppendEntries [leaderId=member-1-shard-car-config, prevLogIndex=-1, prevLogTerm=-1, leaderCommit=10, replicatedToAllIndex=9, payloadVersion=3, entries=[]]
Follower | 290 | member-3-shard-car-config (Follower): Cannot append entries because the replicatedToAllIndex 9 does not appear to be in the in-memory journal
...
Follower | 354 | member-3-shard-car-config (Follower): InstallSnapshot received from leader member-1-shard-car-config, datasize: 842 , Chunk: 1/1
I started member-1 and then member-2. member-1 was elected the leader and initially member-2 was in sync at log index 10 with next index 11. AppendEntries heartbeats with empty log entry list were succeeded as expected.
The issue begins when member-3 is started. At this point the replicatedToAllIndex on leader is -1. The initial AppendEntries to member-3 with log index 10 succeeds. After that the leader sees that the match index for all followers is 10 and thus sets replicatedToAllIndex to 9. The next AppendEntries to the followers fails with "Cannot append entries because the replicatedToAllIndex 9 does not appear to be in the in-memory journal" due to this logic in the Follower out-of-sync checking:
else if(appendEntries.getPrevLogIndex() == -1 &&
appendEntries.getPrevLogTerm() == -1 &&
appendEntries.getReplicatedToAllIndex() != -1 &&
!isLogEntryPresent(appendEntries.getReplicatedToAllIndex())) {
// This append entry comes from a leader who has it's log aggressively trimmed and so does not have
// the previous entry in it's in-memory journal
LOG.debug("{}: Cannot append entries because the replicatedToAllIndex...",...);
}
In this case the leader has no actual log entries as everything is in the snapshot. Same with the followers. Thus the prevLogIndex in the AppendEntries is passed as -1 b/c the follower's prev index, 9, isn't in the leader's log. Similarly, the replicatedToAllIndex, 9, isn't in the follower's log. Thus the follower thinks it's out of sync even though it really isn't.
This logic was added back in July with https://git.opendaylight.org/gerrit/#/c/24546/. It was also put into stable/lithium. I'm unclear as to the exact scenario that caused this logic to be added. It seems we need some additional checking here, ie maybe check if the follower's actual log is empty or lastIndex == snapshot.
However this code has been in for about 6 months and I haven't seen this issue. I think think it was introduced as a side-effect of snapshotting after recovery. After the first restart, when it snapshots, the snapshot index in the in-memory log gets set to -1 and the in-memory log isn't trimmed. In the persisted snapshot the lastAppliedIndex is 10 with no unapplied entries. So on the next restart, snapshot index in the in-memory log gets set to 10 and the in-memory log is empty. This leads to the issue once replicatedToAllIndex is updated since there's no previous entry in the log.
Normally we snapshot on replication when the log index or memory usage trips a threshold. The snapshot includes the lastAppliedIndex and there's normally at least one unapplied entry. In the in-memory log, we either trim to the lastAppliedIndex or the replicatedToAllIndex so there's always a previous log index.
This issue doesn't occur if the snapshot on recovery is disabled but that seems like a band-aid. The logic in the Follower still looks like it has a flaw.