[CONTROLLER-1754] Carbon: Sporadic cluster failure when member is restarted in Netconf cluster test Created: 22/Aug/17  Updated: 19/Oct/17  Resolved: 09/Sep/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Nitrogen
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Duplicate
duplicates CONTROLLER-1751 Sporadic cluster failure when member ... Resolved
External issue ID: 9027

 Description   

This is probably a duplicate of an already existing Bug.

CONTROLLER-1751 reports very similar behavior for Nitrogen. Possibly the same bug, but the Robot symptom is different. While on Nitrogen Jolokia works, just reporting cluster is not in sync, on Carbon Jolokia does not find ShardManager [0].

NETCONF-454 is perhaps the same bug, but from its description that looks like a superset of this Bug. According to a comment [1] there, this Bug can be fixed by [2], which is a fix for CONTROLLER-1627.

This Bug is not easy to reproduce reliably, as single restart failure frequency is small. It only affects Netconf suite significantly, because there are multiple restarts.

[0] https://logs.opendaylight.org/releng/jenkins092/netconf-csit-3node-clustering-only-carbon/630/log.html.gz#s1-s5-t13-k2-k2-k8-k1-k2-k1-k1-k3-k1
[1] https://bugs.opendaylight.org/show_bug.cgi?id=8999#c7
[2] https://git.opendaylight.org/gerrit/60485



 Comments   
Comment by Vratko Polak [ 22/Aug/17 ]

Waiting for Sandbox test results to see whether the cherry-picked fix [3] works.

[3] https://git.opendaylight.org/gerrit/62148

Comment by Vratko Polak [ 25/Aug/17 ]

> cherry-picked fix [3]

That does not work.

I have also tried a change in test [4] adding hard resets. It perhaps reduces the frequency, but does not prevent this failure, as Sandbox shows [5].

I have run out of ideas.

[4] https://git.opendaylight.org/gerrit/62194
[5] https://logs.opendaylight.org/sandbox/jenkins091/netconf-csit-3node-clustering-only-carbon/20/log.html.gz#s1-s8-t19-k2-k2-k8-k1-k2-k1-k1-k3-k1

Comment by Robert Varga [ 04/Sep/17 ]

Well, the karaf restarts don't do enough to clear state and end up being the equivalent of bundle reload.

I think the correct fix is to either auto-detect, or expose as a knob, the mechanism of JVM shutdown:

  • in karaf 4, signal restart as we do now
  • everywhere else, including OSGi, terminate the VM
Comment by Vratko Polak [ 05/Sep/17 ]

It seems this bug is way less frequent on all (as opposed to only) jobs. I see only one failure [6] and it is a 404 (as in jolokia not started) as opposed to the sync=false symptom before.

This is important, because Releng/Builder decided [7] to drop only jobs to speed up the distribution-test cycle.

[6] https://logs.opendaylight.org/releng/jenkins092/netconf-csit-3node-clustering-all-carbon/385/log.html.gz#s1-s10-t13-k2-k2-k8-k1-k2-k1-k1-k2-k1-k4-k1
[7] https://git.opendaylight.org/gerrit/62297

Comment by Vratko Polak [ 05/Sep/17 ]

>> adding hard resets

> karaf restarts don't do enough to clear state

The suite at this segment does:
0. kill JVM
1. delete data/ snapshot/ journal/ and similar. Basically only etc/ and karaf.log survive (although there may be obscure places left, where ODL applications can store their data).
2. start karaf again
3. wait for sync (when not synced in 5 minutes, it is this Bug).

Generated at Wed Feb 07 19:56:22 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.