[CONTROLLER-1693] UnreachableMember during remove-shard-replica prevents new leader to get elected Created: 22/May/17 Updated: 25/Jul/23 Resolved: 18/Sep/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 8524 | ||||||||
| Description |
|
This manifests as a CSIT failure [0]. The UnreachableMember is another issue (CONTROLLER-1645 for example). It is possible that cluster members end up with an inconsistent shard configuration. Karaf.log on member-1 [1] shows the replica removal started at 01:55:43,472, then this happened: Finally, the test teardown started adding the replica back at 01:56:28,959. As member-3 karaf.log [2] shows no activity between 01:56:03,165 and 01:56:56,244 it looks like member-1 was perhaps somehow still a leader, but "has no leader" response [3] from member-1 when adding the replica back proves there really was no leader, at least from member-1 point of view. Every member shows multiple UnreachableMember messages. Not sure if the subsequent ones are the cause or the result of missing the leader. [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s36-t1-k2-k13-k1-k3-k1 |
| Comments |
| Comment by Vratko Polak [ 23/May/17 ] |
|
This seems to be a stable failure for listener tests when the listener is located on the leader member and shard replica is removed there. It happened two runs in a row, both for module-based and prefix-based shard tests. Example karaf.log [4]: from 09:14:05,199 to 09:14:57,946. |
| Comment by Vratko Polak [ 23/May/17 ] |
|
> when the listener is located on the leader member and shard replica is removed there It also happened once [5] when the (prefix-based shard) replica was removed from a leader on a different member than the listener. > Example karaf.log [4] From 09:29:43,066 to 09:31:27,574. |
| Comment by Peter Gubka [ 23/May/17 ] |
|
New debug logs https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-2nd/10/ Odl used from https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/37/ |
| Comment by Peter Gubka [ 25/May/17 ] |
|
New debug logs (including akka) Odl was built from controller=63/57763/2:99/57699/3 at |
| Comment by Peter Gubka [ 25/May/17 ] |
|
New debug logs (including akka) Odl built from controller=70/57770/4:99/57699/3 at Shard without the leader |
| Comment by Peter Gubka [ 26/May/17 ] |
|
Testing without akka logs: Testing job: https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-2nd/21/ Testing with akka logs: Testing job: https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-2nd/20/ |
| Comment by Vratko Polak [ 08/Jun/17 ] |
|
This is still present on Releng [19]. |
| Comment by Vratko Polak [ 09/Jun/17 ] |
|
Marking If the new election takes too long, well-timed UnreachableMember could actually make the election finish sooner. Also, it will save me some time, as now I can assign failures to Keeping this open for now, as it is possible that Unreachablemember could mess with elections even after |
| Comment by Vratko Polak [ 18/Sep/17 ] |
|
> Keeping this open for now, as it is possible that Unreachablemember No such failures were seen, marking as fixed. |