[CONTROLLER-1630] Follower not sync'ing up after rejoining cluster Created: 10/Apr/17 Updated: 25/Jul/23 Resolved: 22/May/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Srini Seetharaman | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8199 |
| Description |
|
I frequently see cases where the follower prints the following after cluster is recovered. It prints it forever and does not seem to recover. 2017-04-07 11:24:41,471 | INFO | lt-dispatcher-20 | Shard | 187 - org.opendaylight.controller.sal-akka-raft - 1.4.3.Boron-SR3 | member-2-shard-default-operational (Follower): The log is not empty but the prevLogIndex 1685 was not found in it - lastIndex: 1685, snapshotIndex: -1 |
| Comments |
| Comment by Tom Pantelis [ 12/Apr/17 ] |
|
I think a simple solution is to treat that particular out-of-sync scenario such that the leader forces an install snapshot. However the out-of-sync scenario may be a symptom of another issue which resulted in an index gap. In order to determine that, we need to know what led up to the issue. It would be helpful to reproduce with org.opendaylight.controller.cluster.datastore.Shard debug enabled and provide the logs and the exact steps that were taken to reproduce along with time stamps. |
| Comment by Tom Pantelis [ 15/Apr/17 ] |
|
Looking at the log message, the leader's previous index, 1685, is the same as the follower's last index so it should find in the index in the journal log. Since the snapshotIndex was -1, there was no prior snapshot which means this check was satisfied in AbstractReplicatedLogImpl#get: if (adjustedIndex < 0 || adjustedIndex >= journal.size()) { // physical index should be less than list size and >= 0 return null; }and lookup failed. Clearly the actual journal size is less than 1685 (adjustedIndex) even though the last entry's index is 1685 which means there's a gap in the journal list/array. The only known way this happens in Boron (that I recall) is if the operational persistent flag is changed from false to true or if one node has persistence enabled while another doesn't. If that wasn't the case in this scenario, then there's another issue and we'd need to reproduce with debug on. |
| Comment by Robert Varga [ 24/Aug/17 ] |
|
Does this happen with a newer release train (Carbon/Nitrogen)? |