[CONTROLLER-1817] Shard leader does not join back Cluster when it is isolated and rejoined Created: 05/Mar/18  Updated: 07/Dec/18  Resolved: 07/Dec/18

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: High
Reporter: Chethana Lakshmanappa Assignee: Tom Pantelis
Resolution: Cannot Reproduce Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive SL_2_isolate_rejoin_failing.zip    

 Description   

Shard leader does not join back Cluster when it is isolated and rejoined
Steps to recreate:

1. Bring up 3 node cluster and make node 2 as Shard leader using "http://{{controller-ip}}:{{restconf-port}}/restconf/operations/cluster-admin:make-leader-local" API
2. Check if Node 2 is the leader for all shards and data-store-type.
3. Apply ACL on Node 1 and Node 3 to isolate Node 2.
4. Check if Node 1 or Node 3 is elected as Leader and Node 2 is "Isolated Leader"
5. When Active Leader marks Node 2 as terminated to resume Leader capability, remove the ACL applied on Node 1 and Node 3
Note - For Step 3, also tried applying ACL only on Node 2 to isolate Node 1 & 3. Same behavior.

Expected Behavior

Node 2 should join back the cluster as "Follower"

Actual Behavior

Node 2 remains as "Isolated Leader" and all 3 nodes there is "Quarantined address is still unreachable or has not been restarted" message which is seen for close to 50 minutes. After which Node 2 joins back the cluster and cluster operations are resumed.

 



 Comments   
Comment by Robert Varga [ 14/Nov/18 ]

Tom, can you take a look at this, please?

Comment by Tom Pantelis [ 07/Dec/18 ]

Node isolation scenarios are tested and working in CSIT upstream. What version was this and is it reproducible with Fluorine?

Looks like akka quarantined the isolated node in which case it needs to be restarted in order to rejoin. We have a detector for the quarantined event which restarts ODL however, IIRC, there were a couple issues with that that were fixed, one recently I believe by ajayslele.

Comment by Tom Pantelis [ 07/Dec/18 ]

Looking at the logs, I see version 1.6.3.SNAPSHOT for sal-distributed-datastore which I believe is Nitrogen SR3 and the akka version was 2.4.18 - we're now on 2.5.14. Nitrogen is no longer supported upstream. I'm going to close this - if it reproduces on a supported version then either re-open or open a new issue.

Generated at Wed Feb 07 19:56:31 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.