[CONTROLLER-1439] Increase seed-node-timeout to avoid cluster island Created: 30/Oct/15  Updated: 10/Nov/15  Resolved: 10/Nov/15

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Lithium
Fix Version/s: None

Type: Bug
Reporter: Tom Pantelis Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 4563

 Description   

Recently Gary Wu was intermittently seeing a cluster island formed when the first seed node was re-started with the other members up. I also recently saw this in a production environment.

According to akka's docs, on startup the first seed node tries every sec to contact the other seed nodes up to the seed-node-timeout. If none connect then it joins itself and declares itself a leader. The default seed-node-timeut is 5 sec. However this may not be enough in some cases. Akka has a mechanism to gate a node for 5 sec before allowing re-connect (it logs info messages about this). I think if the timing is right, the 5 sec gate could result in the seed node timeout.

Gary increased the seed-node-timeout and that alleviated the island issue. I think we should increase it by default to 15 sec to be safe.



 Comments   
Comment by Gary Wu [ 30/Oct/15 ]

In my environment, increasing the seed-node-timeout to 10s seems to completely eliminate the issue.

Also see https://github.com/akka/akka/issues/18757 for reference.

Comment by Tom Pantelis [ 31/Oct/15 ]

I've seen similar intermittent failure in various test cases in LeaderTest. It always happens when the leader sends the first AppendEntries to the follower however the message goes to dead letters, which normally occurs and/or is supposed to occur when an actor is killed. This seems to be a timing issue/bug in akka, possibly with ActorSelection - maybe the actor isn't in a state to receive messages yet.

I came up with a solution in https://git.opendaylight.org/gerrit/#/c/29027 that appears to alleviate the issue as the LeaderTest ran successfully over 1000 times.

Comment by Tom Pantelis [ 31/Oct/15 ]

Oops - I commented and closed the wrong bug. Re-opening.

Comment by Tom Pantelis [ 31/Oct/15 ]

Master patch https://git.opendaylight.org/gerrit/#/c/29075

Generated at Wed Feb 07 19:55:33 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.