[CONTROLLER-1083] Clustering: 2 leaders in a cluster Created: 07/Jan/15 Updated: 19/Oct/17 Resolved: 14/May/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Post-Helium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Abhishek Kumar | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 2557 |
| Description |
|
While adding flows to config store, in a 3 node cluster, the cluster went to a state with 2 leaders. Node 1: leader; Node 3: One switch connected
When the control reaches the breakpoint in debugger, leadership changes. After this, I "resume program" in debugger. I had other breakpoints but control doesnt stop there. Node 2 becomes leader while other 2 are follower. Restconf call does not return yet. LastApplied and LastLogIndex increments on Node 2 and Node 3. Node 1 still shows 1 uncommitted entry. After a while Restconf call returns with akka timeout at which point Node 1 becomes leader and syncs up with other nodes in cluster. Node 2 still stays to be a leader. Node 3 initially flip-floped between node 1 and node 2 as the leader and synced data with the corresponding leader. Later it stuck with node 2 as the leader. However, add flow on node 1 synced LastApplied, LastLogIndex with that of node 1 but leadership and data (seen with restconf GET) was synced with node 2. The LastLogTerm stayed constant on all 3 nodes. I have attached a screenshot displaying current state of jmx counters on all 3 nodes. |
| Comments |
| Comment by Abhishek Kumar [ 07/Jan/15 ] |
|
Attachment Screen Shot 2015-01-06 at 5.58.03 PM.png has been added with description: JMX Counters |
| Comment by Robert Varga [ 13/Apr/16 ] |
|
Is this still reproducible on Beryllium and/or Li-SR4? |
| Comment by Tom Pantelis [ 14/May/17 ] |
|
It looks like what happened was a split-brain scenario where node 1 became isolated and was auto-downed and quarantined by the node 2/node 3 cluster. We have since disabled auto-down for mainly this reason. |