[CONTROLLER-1768] SyncStatus stays false for more than 5minutes after bringing 2 of 3 nodes down and back up. Created: 08/Sep/17  Updated: 18/Jul/18  Resolved: 18/Jul/18

Status: Verified
Project: controller
Component/s: clustering
Affects Version/s: Carbon, Oxygen, Oxygen SR3
Fix Version/s: Fluorine, Oxygen SR3

Type: Bug Priority: High
Reporter: Jamo Luhrsen Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 9133

 Description   

This was seen in a netvirt 3node job:

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-upstream-stateful-carbon/118/log.html.gz

The job has the karaf logs, and in the log for ODL2 I saw this:

2017-09-08 12:43:45,101 | ERROR | Event Dispatcher | AbstractDataStore | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Shard leaders failed to settle in 90 seconds, giving up

The system tests were pasing ok until this point, after which a lot
of failures were seen.



 Comments   
Comment by Tom Pantelis [ 09/Sep/17 ]

It seems that shard leaders weren't resolved after 90 sec but eventually did around 2017-09-08 12:49, eg

INFO | rd-dispatcher-30 | ShardManager | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | shard-manager-config: Received role changed for member-2-shard-inventory-config from Candidate to Leader

I'm not sure what the expectations of the test are but I suspect this is similar to CONTROLLER-1751 where node connectivity occasionally gets delayed in the test environment for some reason (hypothesis is overload in the environment).

Comment by Tom Pantelis [ 09/Sep/17 ]

It seems that shard leaders weren't resolved after 90 sec but eventually did around 2017-09-08 12:49, eg

INFO | rd-dispatcher-30 | ShardManager | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | shard-manager-config: Received role changed for member-2-shard-inventory-config from Candidate to Leader

I'm not sure what the expectations of the test are but I suspect this is similar to CONTROLLER-1751 where node connectivity occasionally gets delayed in the test environment for some reason (hypothesis is overload in the environment).

Comment by Tom Pantelis [ 09/Sep/17 ]

I see in the bug summary that the test brings 2 out of the 3 nodes down then back up. The way akka works is that is that if a node becomes unreachable then it must become reachable or "downed" before it can allow another node to join. So if 2 nodes are unreachable then both have to come back before it allows either back in. We've seen previously that this can take more than 5 min due to suspected intermittent load in the environment with 2 nodes restarting (and whatever other load there is in the VM environment). So most of the time it's relatively quick but sometimes it can take arbitrarily longer. So I would suggest increasing the 5 min expectation to avoid intermittent timeouts (I'd say at least 15 min to be safe).

Comment by Robert Varga [ 10/Sep/17 ]

I wonder if the tests should use more aggressive retry timings. Also are ODL2/ODL3 brought up concurrently?

Comment by Robert Varga [ 10/Sep/17 ]

I meant retry timers at akka.conf layer – as I think they do back off by default.

Comment by Jamo Luhrsen [ 12/Sep/17 ]

(In reply to Tom Pantelis from comment #3)
> I see in the bug summary that the test brings 2 out of the 3 nodes down then
> back up. The way akka works is that is that if a node becomes unreachable
> then it must become reachable or "downed" before it can allow another node
> to join. So if 2 nodes are unreachable then both have to come back before it
> allows either back in. We've seen previously that this can take more than 5
> min due to suspected intermittent load in the environment with 2 nodes
> restarting (and whatever other load there is in the VM environment). So most
> of the time it's relatively quick but sometimes it can take arbitrarily
> longer. So I would suggest increasing the 5 min expectation to avoid
> intermittent timeouts (I'd say at least 15 min to be safe).

ok, I can change the test to wait for 15m for cluster sync, but what worries
me is that later in the same job, there are more fundamental failures from
netvirt's perspective (e.g. created networks not ending up in the config
store). Because of that I'm worried that something is just broken under
the hood and waiting might not be the answer.

but, I'll try.

Comment by Jamo Luhrsen [ 12/Sep/17 ]

(In reply to Robert Varga from comment #5)
> I meant retry timers at akka.conf layer – as I think they do back off by
> default.

do you have an example I can look at? It's not totally clear
to me what values I should tweak. This is worth a try too.

Comment by Jamo Luhrsen [ 12/Sep/17 ]

(In reply to Robert Varga from comment #4)
> I wonder if the tests should use more aggressive retry timings. Also are
> ODL2/ODL3 brought up concurrently?

yeah, ODL2/ODL3 are brought up in the same 1-2s window.

Comment by Jamo Luhrsen [ 11/Jul/18 ]

just wanted to note that I have also seen this recently in a local setup with a recent Oxygen build

Comment by Tom Pantelis [ 18/Jul/18 ]

On the kernel call, you mentioned you saw this stopping/restarting 1 node. This issue is almost a year old was related to 2 nodes stopping/restarting. From earlier analysis and comments from last year, it was determined that the 2 nodes did actually re-join but it happened after the 5 min test deadline. It was suggested to increases the test timeout. I assume there is still a controller test that does this (I know there was originally)? If so, if it hasn't been failing then can we close this issue?

Comment by Jamo Luhrsen [ 18/Jul/18 ]

so we want to use a new bug then?

Comment by Tom Pantelis [ 18/Jul/18 ]

Well we already have CONTROLLER-1849 for a 1 node restart not rejoining. I assume the failure you saw that prompted your recent comment here was the same issue. So unless this issue is still current then let's close it.

Keep in mind - the 401's due to the failed AAA reads and SyncStatus remaining false are all symptoms that the node did not rejoin the cluster on restart for whatever reason. At this point, AFAIK, this is only being seen occasionally when the first seed node is killed/restarted (ie CONTROLLER-1849).

Comment by Jamo Luhrsen [ 18/Jul/18 ]

ok, I follow that logic. we'll open this back up if there is something specific about 2 nodes going down/up,
and we have 1849 to track all that we think is left at this point.

Comment by Jamo Luhrsen [ 18/Jul/18 ]

going with the assumption that this is not seen any more in this specific scenario
where 2 of 3 nodes are going down and back up. We do have a very similar bug
in CONTROLLER-1849 that only deals with one node (first seed node) going down
and up. We'll track that problem there.

Generated at Wed Feb 07 19:56:24 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.