[CONTROLLER-1768] SyncStatus stays false for more than 5minutes after bringing 2 of 3 nodes down and back up. Created: 08/Sep/17 Updated: 18/Jul/18 Resolved: 18/Jul/18 |
|
| Status: | Verified |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Carbon, Oxygen, Oxygen SR3 |
| Fix Version/s: | Fluorine, Oxygen SR3 |
| Type: | Bug | Priority: | High |
| Reporter: | Jamo Luhrsen | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 9133 |
| Description |
|
This was seen in a netvirt 3node job: The job has the karaf logs, and in the log for ODL2 I saw this: 2017-09-08 12:43:45,101 | ERROR | Event Dispatcher | AbstractDataStore | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Shard leaders failed to settle in 90 seconds, giving up The system tests were pasing ok until this point, after which a lot |
| Comments |
| Comment by Tom Pantelis [ 09/Sep/17 ] |
|
It seems that shard leaders weren't resolved after 90 sec but eventually did around 2017-09-08 12:49, eg INFO | rd-dispatcher-30 | ShardManager | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | shard-manager-config: Received role changed for member-2-shard-inventory-config from Candidate to Leader I'm not sure what the expectations of the test are but I suspect this is similar to |
| Comment by Tom Pantelis [ 09/Sep/17 ] |
|
It seems that shard leaders weren't resolved after 90 sec but eventually did around 2017-09-08 12:49, eg INFO | rd-dispatcher-30 | ShardManager | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | shard-manager-config: Received role changed for member-2-shard-inventory-config from Candidate to Leader I'm not sure what the expectations of the test are but I suspect this is similar to |
| Comment by Tom Pantelis [ 09/Sep/17 ] |
|
I see in the bug summary that the test brings 2 out of the 3 nodes down then back up. The way akka works is that is that if a node becomes unreachable then it must become reachable or "downed" before it can allow another node to join. So if 2 nodes are unreachable then both have to come back before it allows either back in. We've seen previously that this can take more than 5 min due to suspected intermittent load in the environment with 2 nodes restarting (and whatever other load there is in the VM environment). So most of the time it's relatively quick but sometimes it can take arbitrarily longer. So I would suggest increasing the 5 min expectation to avoid intermittent timeouts (I'd say at least 15 min to be safe). |
| Comment by Robert Varga [ 10/Sep/17 ] |
|
I wonder if the tests should use more aggressive retry timings. Also are ODL2/ODL3 brought up concurrently? |
| Comment by Robert Varga [ 10/Sep/17 ] |
|
I meant retry timers at akka.conf layer – as I think they do back off by default. |
| Comment by Jamo Luhrsen [ 12/Sep/17 ] |
|
(In reply to Tom Pantelis from comment #3) ok, I can change the test to wait for 15m for cluster sync, but what worries but, I'll try. |
| Comment by Jamo Luhrsen [ 12/Sep/17 ] |
|
(In reply to Robert Varga from comment #5) do you have an example I can look at? It's not totally clear |
| Comment by Jamo Luhrsen [ 12/Sep/17 ] |
|
(In reply to Robert Varga from comment #4) yeah, ODL2/ODL3 are brought up in the same 1-2s window. |
| Comment by Jamo Luhrsen [ 11/Jul/18 ] |
|
just wanted to note that I have also seen this recently in a local setup with a recent Oxygen build |
| Comment by Tom Pantelis [ 18/Jul/18 ] |
|
On the kernel call, you mentioned you saw this stopping/restarting 1 node. This issue is almost a year old was related to 2 nodes stopping/restarting. From earlier analysis and comments from last year, it was determined that the 2 nodes did actually re-join but it happened after the 5 min test deadline. It was suggested to increases the test timeout. I assume there is still a controller test that does this (I know there was originally)? If so, if it hasn't been failing then can we close this issue? |
| Comment by Jamo Luhrsen [ 18/Jul/18 ] |
|
so we want to use a new bug then? |
| Comment by Tom Pantelis [ 18/Jul/18 ] |
|
Well we already have Keep in mind - the 401's due to the failed AAA reads and SyncStatus remaining false are all symptoms that the node did not rejoin the cluster on restart for whatever reason. At this point, AFAIK, this is only being seen occasionally when the first seed node is killed/restarted (ie |
| Comment by Jamo Luhrsen [ 18/Jul/18 ] |
|
ok, I follow that logic. we'll open this back up if there is something specific about 2 nodes going down/up, |
| Comment by Jamo Luhrsen [ 18/Jul/18 ] |
|
going with the assumption that this is not seen any more in this specific scenario |