[CONTROLLER-1693] UnreachableMember during remove-shard-replica prevents new leader to get elected Created: 22/May/17  Updated: 25/Jul/23  Resolved: 18/Sep/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by CONTROLLER-1706 Large transaction traffic prevents le... Resolved
External issue ID: 8524

 Description   

This manifests as a CSIT failure [0]. The UnreachableMember is another issue (CONTROLLER-1645 for example). It is possible that cluster members end up with an inconsistent shard configuration.

Karaf.log on member-1 [1] shows the replica removal started at 01:55:43,472, then this happened:
2017-05-22 01:56:06,569 | WARN | lt-dispatcher-32 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.0.Carbon | member-1-shard-default-config: Failed to transfer leadership in 10.01 s
2017-05-22 01:56:06,572 | INFO | lt-dispatcher-22 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | Stopping Shard member-1-shard-default-config

Finally, the test teardown started adding the replica back at 01:56:28,959.
Thus even though the test was waiting 45 seconds, members only have 20 seconds to realize the previous leader is gone (we can add more time to the test if needed).

As member-3 karaf.log [2] shows no activity between 01:56:03,165 and 01:56:56,244 it looks like member-1 was perhaps somehow still a leader, but "has no leader" response [3] from member-1 when adding the replica back proves there really was no leader, at least from member-1 point of view.

Every member shows multiple UnreachableMember messages. Not sure if the subsequent ones are the cause or the result of missing the leader.

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s36-t1-k2-k13-k1-k3-k1
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/odl1_karaf.log.gz
[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/odl3_karaf.log.gz
[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s36-t1-k2-k14-k2-k3-k1-k4-k7-k1



 Comments   
Comment by Vratko Polak [ 23/May/17 ]

This seems to be a stable failure for listener tests when the listener is located on the leader member and shard replica is removed there. It happened two runs in a row, both for module-based and prefix-based shard tests.

Example karaf.log [4]: from 09:14:05,199 to 09:14:57,946.

[4] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/720/archives/odl2_karaf.log.gz

Comment by Vratko Polak [ 23/May/17 ]

> when the listener is located on the leader member and shard replica is removed there

It also happened once [5] when the (prefix-based shard) replica was removed from a leader on a different member than the listener.
Previously, that scenario was running into CONTROLLER-1694 instead.

> Example karaf.log [4]

From 09:29:43,066 to 09:31:27,574.

[5] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/720/archives/log.html.gz#s1-s38-t3-k2-k12-k1-k3-k1

Comment by Peter Gubka [ 23/May/17 ]

New debug logs https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-2nd/10/

Odl used from https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/37/

Comment by Peter Gubka [ 25/May/17 ]

New debug logs (including akka)
https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-2nd/12/

Odl was built from controller=63/57763/2:99/57699/3 at
https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/43

Comment by Peter Gubka [ 25/May/17 ]

New debug logs (including akka)
https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-2nd/13/

Odl built from controller=70/57770/4:99/57699/3 at
https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/45/

Shard without the leader
https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-2nd/13/archives/log.html.gz#s1-s2-t1-k2-k13-k1-k3-k1

Comment by Peter Gubka [ 26/May/17 ]

Testing without akka logs:
----------------------------
Odl built from: controller=22/57822/1
https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/49/

Testing job: https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-2nd/21/

Link to problem:https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-2nd/21/archives/log.html.gz#s1-s2-t1-k2-k13-k1-k3-k1

Testing with akka logs:
----------------------------
Odl built from: controller=22/57822/1:99/57699/5
https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/48/

Testing job: https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-2nd/20/

Link to problem:https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-2nd/20/archives/log.html.gz#s1-s2-t1-k2-k13-k1-k3-k1

Comment by Vratko Polak [ 08/Jun/17 ]

This is still present on Releng [19].

[19] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/736/log.html.gz#s1-s38-t3-k2-k12-k1-k3-k1

Comment by Vratko Polak [ 09/Jun/17 ]

Marking CONTROLLER-1706 as dependency.

If the new election takes too long, well-timed UnreachableMember could actually make the election finish sooner.
So if the suite still fails, we should be fixing CONTROLLER-1706 instead of this.

Also, it will save me some time, as now I can assign failures to CONTROLLER-1706 without karaf.log investigation.

Keeping this open for now, as it is possible that Unreachablemember could mess with elections even after CONTROLLER-1706 is fixed.

Comment by Vratko Polak [ 18/Sep/17 ]

> Keeping this open for now, as it is possible that Unreachablemember
> could mess with elections even after CONTROLLER-1706 is fixed.

No such failures were seen, marking as fixed.

Generated at Wed Feb 07 19:56:12 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.