[CONTROLLER-1083] Clustering: 2 leaders in a cluster Created: 07/Jan/15  Updated: 19/Oct/17  Resolved: 14/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Post-Helium
Fix Version/s: None

Type: Bug
Reporter: Abhishek Kumar Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: PNG File Screen Shot 2015-01-06 at 5.58.03 PM.png    
External issue ID: 2557

 Description   

While adding flows to config store, in a 3 node cluster, the cluster went to a state with 2 leaders.

Node 1: leader; Node 3: One switch connected
-Connect debugger to node 1 with breakpoint in AbstractListeningCommitter.onDataChanged()

  • Use restconf on Node 1 to add a flow

When the control reaches the breakpoint in debugger, leadership changes. After this, I "resume program" in debugger. I had other breakpoints but control doesnt stop there.

Node 2 becomes leader while other 2 are follower.

Restconf call does not return yet.

LastApplied and LastLogIndex increments on Node 2 and Node 3. Node 1 still shows 1 uncommitted entry.

After a while Restconf call returns with akka timeout at which point Node 1 becomes leader and syncs up with other nodes in cluster. Node 2 still stays to be a leader.

Node 3 initially flip-floped between node 1 and node 2 as the leader and synced data with the corresponding leader. Later it stuck with node 2 as the leader. However, add flow on node 1 synced LastApplied, LastLogIndex with that of node 1 but leadership and data (seen with restconf GET) was synced with node 2.

The LastLogTerm stayed constant on all 3 nodes.

I have attached a screenshot displaying current state of jmx counters on all 3 nodes.
Node 1 = 148
Node 2 = 151
Node 3 = 150



 Comments   
Comment by Abhishek Kumar [ 07/Jan/15 ]

Attachment Screen Shot 2015-01-06 at 5.58.03 PM.png has been added with description: JMX Counters

Comment by Robert Varga [ 13/Apr/16 ]

Is this still reproducible on Beryllium and/or Li-SR4?

Comment by Tom Pantelis [ 14/May/17 ]

It looks like what happened was a split-brain scenario where node 1 became isolated and was auto-downed and quarantined by the node 2/node 3 cluster. We have since disabled auto-down for mainly this reason.

Generated at Wed Feb 07 19:54:39 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.