[CONTROLLER-1630] Follower not sync'ing up after rejoining cluster Created: 10/Apr/17  Updated: 25/Jul/23  Resolved: 22/May/18

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Srini Seetharaman Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8199

 Description   

I frequently see cases where the follower prints the following after cluster is recovered. It prints it forever and does not seem to recover.

2017-04-07 11:24:41,471 | INFO | lt-dispatcher-20 | Shard | 187 - org.opendaylight.controller.sal-akka-raft - 1.4.3.Boron-SR3 | member-2-shard-default-operational (Follower): The log is not empty but the prevLogIndex 1685 was not found in it - lastIndex: 1685, snapshotIndex: -1
2017-04-07 11:24:41,471 | INFO | lt-dispatcher-20 | Shard | 187 - org.opendaylight.controller.sal-akka-raft - 1.4.3.Boron-SR3 | member-2-shard-default-operational (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=23, success=false, followerId=member-2-shard-default-operational, logLastIndex=1685, logLastTerm=9, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]



 Comments   
Comment by Tom Pantelis [ 12/Apr/17 ]

I think a simple solution is to treat that particular out-of-sync scenario such that the leader forces an install snapshot.

However the out-of-sync scenario may be a symptom of another issue which resulted in an index gap. In order to determine that, we need to know what led up to the issue. It would be helpful to reproduce with org.opendaylight.controller.cluster.datastore.Shard debug enabled and provide the logs and the exact steps that were taken to reproduce along with time stamps.

Comment by Tom Pantelis [ 15/Apr/17 ]

Looking at the log message, the leader's previous index, 1685, is the same as the follower's last index so it should find in the index in the journal log. Since the snapshotIndex was -1, there was no prior snapshot which means this check was satisfied in AbstractReplicatedLogImpl#get:

if (adjustedIndex < 0 || adjustedIndex >= journal.size())

{ // physical index should be less than list size and >= 0 return null; }

and lookup failed. Clearly the actual journal size is less than 1685 (adjustedIndex) even though the last entry's index is 1685 which means there's a gap in the journal list/array. The only known way this happens in Boron (that I recall) is if the operational persistent flag is changed from false to true or if one node has persistence enabled while another doesn't. If that wasn't the case in this scenario, then there's another issue and we'd need to reproduce with debug on.

Comment by Robert Varga [ 24/Aug/17 ]

Does this happen with a newer release train (Carbon/Nitrogen)?

Generated at Wed Feb 07 19:56:02 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.