[CONTROLLER-1817] Shard leader does not join back Cluster when it is isolated and rejoined Created: 05/Mar/18 Updated: 07/Dec/18 Resolved: 07/Dec/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | High |
| Reporter: | Chethana Lakshmanappa | Assignee: | Tom Pantelis |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
Shard leader does not join back Cluster when it is isolated and rejoined 1. Bring up 3 node cluster and make node 2 as Shard leader using "http://{{controller-ip}}:{{restconf-port}}/restconf/operations/cluster-admin:make-leader-local" API
2. Check if Node 2 is the leader for all shards and data-store-type.
3. Apply ACL on Node 1 and Node 3 to isolate Node 2.
4. Check if Node 1 or Node 3 is elected as Leader and Node 2 is "Isolated Leader"
5. When Active Leader marks Node 2 as terminated to resume Leader capability, remove the ACL applied on Node 1 and Node 3
Note - For Step 3, also tried applying ACL only on Node 2 to isolate Node 1 & 3. Same behavior.
Expected Behavior Node 2 should join back the cluster as "Follower" Actual Behavior Node 2 remains as "Isolated Leader" and all 3 nodes there is "Quarantined address is still unreachable or has not been restarted" message which is seen for close to 50 minutes. After which Node 2 joins back the cluster and cluster operations are resumed.
|
| Comments |
| Comment by Robert Varga [ 14/Nov/18 ] |
|
Tom, can you take a look at this, please? |
| Comment by Tom Pantelis [ 07/Dec/18 ] |
|
Node isolation scenarios are tested and working in CSIT upstream. What version was this and is it reproducible with Fluorine? Looks like akka quarantined the isolated node in which case it needs to be restarted in order to rejoin. We have a detector for the quarantined event which restarts ODL however, IIRC, there were a couple issues with that that were fixed, one recently I believe by ajayslele. |
| Comment by Tom Pantelis [ 07/Dec/18 ] |
|
Looking at the logs, I see version 1.6.3.SNAPSHOT for sal-distributed-datastore which I believe is Nitrogen SR3 and the akka version was 2.4.18 - we're now on 2.5.14. Nitrogen is no longer supported upstream. I'm going to close this - if it reproduces on a supported version then either re-open or open a new issue. |