[CONTROLLER-1770] Flower failed to synchronize data after it's snapshot synchronization failed Created: 14/Sep/17  Updated: 25/Jul/23

Status: In Progress
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Guan Zhenhang Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 9163

 Description   

3 nodes makes up a cluster,close one of the nodes,and deletes the snapshot and journal files on that node.CLOSE the node as soon as possible AFTER reboot it.Finally,reboot the node and some shard data of that node cannot be successfully synchronized from other nodes.



 Comments   
Comment by Guan Zhenhang [ 14/Sep/17 ]

Cause Analysis:
The persisted data has been removed after the first shutdown of the node,at this point,when the node is restarted,each shard of the node is in the initialization state because the local persistence content is empty,that is,the logIndex of each shard is -1(indicating that the shard has not been synchronized to any data).
When the node is added to the cluster,the shard-leader on the other nodes synchronizes the data to the shard on the node through the raft protocol.Since shard-leader has done log compression, it is only possible to recover the data of the node via the snapshot message. At this point, the new startup node is closed, and it is possible that some of the shard's snapshot has not yet been completed.

Comment by Guan Zhenhang [ 14/Sep/17 ]

Resolvent:
According to the above analysis, the status record of leader synchronization from snapshot to follower has no timeliness, which leads leader to judge snapshot synchronization with follower.The significance of this status record is to avoid sending snapshot messages repeatedly by leader.And the record is useless after the follower's shutdown,so the record cannot always be used as a basis for data synchronization between leader and follower.
To this end,we have added a timeliness check to this status record,the record is invalid in ten minutes(according to our business data scale, ten minutes is a very conservative time interval),and leader will continue to synchronize data to follower.

Comment by Guan Zhenhang [ 14/Sep/17 ]

Our bug fix has passed 3 nodes test,we will push the code if someone can review.

Comment by Kit Lou [ 19/Oct/17 ]

Adjust severity to match last status in bugzilla.

Comment by Jamo Luhrsen [ 19/Oct/17 ]

Hi Guan,

do you have a link to the patch for review? This is a blocker bug for our nitrogen release we are hoping to release in a week.
The contributions are much appreciated.

Comment by Tom Pantelis [ 19/Oct/17 ]

Why is this all of a sudden a blocker for SR1? It wasn't a blocker for nitrogen plus this is an edge case that has been there all along (no regression).

Comment by Tom Pantelis [ 19/Oct/17 ]

I lowered the severity as I don't believe this needs to be a blocker.

Generated at Wed Feb 07 19:56:25 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.