[CONTROLLER-1849] controller not coming up healthy after being killed and restarted (401 after 5m) Created: 03/Jul/18 Updated: 13/Dec/18 Resolved: 13/Dec/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | None |
| Affects Version/s: | Oxygen |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Lowest |
| Reporter: | Jamo Luhrsen | Assignee: | Tom Pantelis |
| Resolution: | Won't Do | Votes: | 1 |
| Labels: | csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Priority: | Low | ||||||||||||
| Description |
|
Here is an example of this problem from the robot logs. the karaf log is pretty messy if you look from The controller was initially killed with 'kill -9' The controller logs, and robot logs can be found here |
| Comments |
| Comment by Jamo Luhrsen [ 03/Jul/18 ] |
| Comment by Jamo Luhrsen [ 03/Jul/18 ] |
|
and another |
| Comment by Tom Pantelis [ 04/Jul/18 ] |
|
After the restart, here's a grep of all akka output: 2018-06-19T13:31:55,180 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Slf4jLogger | 48 - com.typesafe.akka.slf4j - 2.5.11 | Slf4jLogger started 2018-06-19T13:31:55,254 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Remoting | 48 - com.typesafe.akka.slf4j - 2.5.11 | Starting remoting 2018-06-19T13:31:55,437 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | Remoting | 48 - com.typesafe.akka.slf4j - 2.5.11 | Remoting started; listening on addresses :[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] 2018-06-19T13:31:55,466 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Starting up... 2018-06-19T13:31:55,529 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Registered cluster JMX MBean [akka:type=Cluster] 2018-06-19T13:31:55,530 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Started up successfully 2018-06-19T13:32:07,632 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-14 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] is JOINING, roles [member-1, dc-default] 2018-06-19T13:32:07,648 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-14 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] to [Up] 2018-06-19T13:36:40,475 | INFO | SystemReadyService-0 | TestBundleDiag | 238 - org.opendaylight.infrautils.ready-impl - 1.3.3.SNAPSHOT | OK com.typesafe.akka.slf4j:2.5.11: OSGi state = Active, Karaf bundleState = Active, due to: Declarative Services
What that tells me is that it wasn't able to connect to another node so it made itself leader and joined itself thereby forming a cluster island. This is member-1 which I assume is the first seed node listed so it has special behavior that allows it to do that (the other seed nodes cannot). This is governed by the seed-node-timeout setting in the akka.conf which we set to 12s. So from the log, between 13:31:55 and 13:32:07, which is 12s, it tried to connect to either of the other 2 nodes but was unable to do so, thus it thinks no other node is up and forms its own cluster. The shards don't hear from a leader so they go to Candidate and keep starting elections over and over trying to become leader. So either there was an issue in the CSIT environment that prevented connectivity or an issue with akka that prevented it from connecting to either of the other 2 nodes (or maybe a combination of both). Either way, can't tell anything more from this log. Do we still have the logs from the other 2 nodes? Perhaps there might be akka messages in one or both of those logs indicating connection attempts from member-1. I assume this Is this Oxygen? - can you set the affected version? |
| Comment by Tom Pantelis [ 04/Jul/18 ] |
|
Looking at odl2's log, I see a bunch of these repeated: 2018-06-19T13:31:55,865 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#915987295]] to [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] 2018-06-19T13:31:55,865 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#915987295]]
Followed by a bunch of these: 2018-06-19T13:32:18,275 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-24 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.115:2550: Unreachable [Unreachable] (3), akka.tcp://opendaylight-cluster-data@10.30.170.158:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.115:2550: Unreachable [Unreachable] (2)], member status: [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550 Up seen=false, akka.tcp://opendaylight-cluster-data@10.30.170.123:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.158:2550 Up seen=true]
And we see similar in odl3's log: 2018-06-19T13:31:55,862 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.158:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#915987295]] to [akka.tcp://opendaylight-cluster-data@10.30.170.158:2550] 2018-06-19T13:31:55,862 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.158:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.158:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#915987295]]
So join messages are getting thru to the other nodes but no indication why odl1 (10.30.170.115) was not joined back in. Another akka mystery. I can try opening a bug in akka and see if we get any support. |
| Comment by Tom Pantelis [ 05/Jul/18 ] |
|
So odl1 was restarted several times over the course of an hour. The time previous to 06-19T13:31:27 when it didn't rejoin was at 06-19T13:30:06. On odl2, we see: 2018-06-19T13:30:32,044 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-25 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#383916203]] to [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] 2018-06-19T13:30:32,045 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-25 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#383916203]] 2018-06-19T13:30:32,146 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-25 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@10.30.170.115:2550, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join. 2018-06-19T13:30:32,147 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-25 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Marking unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] as [Down] 2018-06-19T13:30:33,282 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Leader is removing unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] 2018-06-19T13:30:42,777 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-9 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#-1096982950]] to [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] 2018-06-19T13:30:42,777 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-9 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#-1096982950]] 2018-06-19T13:30:42,825 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-9 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] is JOINING, roles [member-1, dc-default] 2018-06-19T13:30:44,278 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-25 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] to [Up] On that restart, odl1 was let back in as expected. About 30 sec later it was killed again: 2018-06-19T13:31:15,640 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-9 | NettyTransport | 48 - com.typesafe.akka.slf4j - 2.5.11 | Remote connection to [/10.30.170.115:50464] failed with java.io.IOException: Connection reset by peer 2018-06-19T13:31:15,642 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | ReliableDeliverySupervisor | 48 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2018-06-19T13:31:19,275 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-9 | ClusterCoreDaemon | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.30.170.115:2550, status = Up)]. Node roles [member-2, dc-default] So odl1 was killed and restarted twice in < 2 minutes. The second time it wasn't allowed back into the cluster within the 12s seed node timeout period. I wonder if there's some logic or edge case in akka which caused it to delay the rejoin due to the quick kills and restarts. Maybe a longer seed-node-timeout setting like 25s would help. I can try to reproduce manually by doing quick kills/restarts. |
| Comment by Tom Pantelis [ 05/Jul/18 ] |
|
So looking at odl1 during the prior successful restart at 06-19T13:30:06, we see: 2018-06-19T13:30:42,821 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-6 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Received InitJoinAck message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.123:2550/system/cluster/core/daemon#1793055218]] to [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] 2018-06-19T13:30:42,854 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Welcome from [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] So akka does print out the InitJoinAck response from another node. We don't see this during the 06-19T13:31:27 restart. I wonder if it's possible there was only one-way network communication between 10.30.170.115 and the other nodes during this time - the behavior seems to indicate that... |
| Comment by Jamo Luhrsen [ 05/Jul/18 ] |
|
I am assuming (based on newer comments) that you found the other logs yes, oxygen (fyi, you can see what version in the url of the csit job). I have |
| Comment by Tom Pantelis [ 05/Jul/18 ] |
|
yeah I notices your link afterwards. I figured it was Oxygen since I saw config system output but I purged that stuff from Fluorine |
| Comment by Jamo Luhrsen [ 05/Jul/18 ] |
|
I'm trying to better understand how the suite is working (it's new to me as Kill Switchs Old Owner ${switch_name}
Restart Switchs Old Owner ${switch_name}
Kill Switchs Successor ${switch_name}
Restart Switchs Successor ${switch_name}
The kill keywords are first figuring out the "owner" or "successor" before I'm hoping that infra communication being broken and only one-way is I suppose one effort we can take is to write some automation to quickly |
| Comment by Tom Pantelis [ 05/Jul/18 ] |
|
yeah the broken infra one-way communication theory seems unlikely. But the behavior seen in the logs is consistent with there being broken one-way communication somewhere so that's why I mentioned it. The logs show odl1 sent InitAckJoin to both odl2 and odl3 and both show they replied with InitAckJoinAck. But we don't the see the InitAckJoinAck received message in odl1 as is seen in a successful rejoin scenario. We also don't see odl2 or odl3 marking odl1 as REACHABLE even tho both received multiple messages from odl1. According to my experience with akka and reading their docs, there shouldn't be any reason odl2 or odl3 shouldn't allow odl1 back in provided there's full 2-way communication. Unless there's some edge-case bug in akka where it still has stale state for odl1.... akka has strict rules wrt convergence where all surviving nodes have to agree to remove the old node incarnation and let the new node incarnation join. I'm going to try locally to keep killing/restarting the first seed node - hopefully tomorrow. As far as what to look for.... you don't want to see odl1 reporting this after the restart: 2018-06-19T13:32:07,632 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-14 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] is JOINING, roles [member-1, dc-default] 2018-06-19T13:32:07,648 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-14 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] to [Up] That means it joined itself and "Up"ed itself. You want to see 2018-06-19T13:30:42,854 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Welcome from [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] meaning one of the other 2 nodes, 10.30.170.123 in this case, let it back into the surviving cluster. |
| Comment by Jamo Luhrsen [ 05/Jul/18 ] |
|
gotcha, so if I just write some script to down/up a single node and check |
| Comment by Tom Pantelis [ 05/Jul/18 ] |
|
It has to be a 3-node setup where you down/up the first seed node listed in the akka.conf. As I mentioned earlier, the first seed node has special significance in that it bootstraps the cluster so it must be running in order to initially form a cluster and is the only node that can join itself. All subsequent seed nodes listed must be able to contact another seed and join with it. So you wouldn't see this 2 cluster island issue if a non-first seed node (odl2 or odl3) were restarted. |
| Comment by Jamo Luhrsen [ 05/Jul/18 ] |
|
ok, makes sense. Can you tell me how to figure out the first seed node? I assume there is something in the logs. |
| Comment by Tom Pantelis [ 05/Jul/18 ] |
|
Open configuration/initial/akka.conf - look for "seed-nodes = ..." - the first one in that list. |
| Comment by Tom Pantelis [ 08/Jul/18 ] |
|
I think it would be good to determine if this issue also occurs if the node is gracefully shut down. I read somewhere that forcefully killing a process does not cleanly shut down the ports and thus a port may stay blocked for a while depending on the operating system. I'm not sure how such a state would manifest in akka however - we didn't see any error/failure reported by akka. |
| Comment by Jamo Luhrsen [ 09/Jul/18 ] |
|
this is obviously a good idea and will be a good data point, but I can't really do this at this time. This I am currently trying to reproduce locally. |
| Comment by Jamo Luhrsen [ 09/Jul/18 ] |
|
here's a simple bash script I'm using to try to locally reproduce. I'm using the netvirt-ha-docker tooling to make a local cluster. #!/bin/bash while true do sudo docker exec -i odl_172.28.5.1 ps aux | grep karaf.main | grep -v grep | awk '{print "kill -9",$2}' | sudo docker exec -i odl_172.28.5.1 sh # pick a random value between 1-90 seconds for when we start karaf again sleepy_time=$((RANDOM%90)) echo "Waiting $sleepy_time seconds after killing karaf, before starting" sleep $sleepy_time sudo docker exec -i odl_172.28.5.1 /odlha/karaf/target/assembly/bin/start tries=0 until [ $tries -ge 60 ] do ((tries++)) echo "$tries th check for 200 code" RESP=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -u "admin:admin" http://172.28.5.1:8181/jolokia/read/org.opendaylight.controller:Category\=ShardManager,name\=shard-manager-config,type\=DistributedConfigDatastore) CODE=$(echo $RESP | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') if [ $CODE -eq "200" ]; then break fi sleep 2; done # we'll have looped for 120s (60 tries) above if 200 never came, so maybe things are broken if [ $tries -eq 60 ]; then echo "Might have caught the bugger"; exit 1 fi done |
| Comment by Jamo Luhrsen [ 09/Jul/18 ] |
|
I basically tried the above script manually all morning (maybe 20-30 times), but once I scripted it and stopped looking I do see this (smoking gun, I think): 2018-07-09T23:11:59,187 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-59 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf 4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] to [ Up] tpantelis, let me try to reproduce this a few more times so we can get a sense of how frequent I can hit it. Assuming I can hit it with regularity, I'll try your idea of doing a "stop" instead of a "kill -9". In the meantime, is there anything else to try/look for while doing this to help get to the root cause. BTW, this is all done on a single laptop, so I hope that can rule out infra networking issues. |
| Comment by Tom Pantelis [ 09/Jul/18 ] |
|
One other thing I can think of is to bump the seed-node-timeout from 12s to say 25s. It's specified in the configuration/factory/akka.conf but that file is overwritten on feature install. You'd want to override in configuration/initial/akka.conf which is preserved. I think running 3 instances on the same box (separate dockers I assume) should rule out networking issues. Given that, it's likely some edge case issue in akka. As mentioned before it may have to do with killing/restarting in relatively quick succession. Let's try to characterize it as much as we can then we can open an issue with akka and see what happens from there. |
| Comment by Jamo Luhrsen [ 10/Jul/18 ] |
|
I posted my script to gerrit since I kept tweaking Over the past 24 hours I was able to reproduce this 6 times. Each reproduction took iterations of Currently, I'm running this script with seed-node-timeout of 30s. I will report back. |
| Comment by Luis Gomez [ 17/Jul/18 ] |
|
Jamo, did you finally try with seed-node-timeout of 30s? |
| Comment by Jamo Luhrsen [ 17/Jul/18 ] |
|
I was reproducing these kinds of issues locally, and for a week I was able to hit this During our kernel call today, one idea was that we can do away with dynamicAuthorization |
| Comment by Tom Pantelis [ 17/Jul/18 ] |
|
Looking at the logs from Luis test, we see similar behavior where odl1 didn't rejoin the cluster and joined itself after 22s. It did get an InitJoinAck response from odl2. odl1: 2018-07-16T11:18:20,251 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-21 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Received InitJoinAck message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.26:2550/system/cluster/core/daemon#-1596937295]] to [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] 2018-07-16T11:18:42,585 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] is JOINING, roles [member-1, dc-default] 2018-07-16T11:18:42,614 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] to [Up] odl2: 2018-07-16T11:18:14,848 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-21 | ReliableDeliverySupervisor | 41 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]] Caused by: [Connection refused: /10.30.170.94:2550] 2018-07-16T11:18:20,105 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#-2094984711]] to [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] 2018-07-16T11:18:20,106 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#-2094984711]] 2018-07-16T11:18:20,291 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@10.30.170.94:2550, status = Up)] is trying to join. *Existing will be removed from the cluster and then new member will be allowed to join.* 2018-07-16T11:18:20,292 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - *Marking unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] as [Down]* 2018-07-16T11:18:21,226 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - *Leader is removing unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]* 2018-07-16T11:18:30,564 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#2006731653]] to [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] 2018-07-16T11:18:30,565 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#2006731653]] several more InitJoin/InitJoinAck sequences... Similar InitJoin/InitJoinAck sequences on odl3. In this sequence, odl2 actually saw that odl1 was restarted with a new incarnation, marked it as Down, and stated it was removing the previous unreachable incarnation. It then is supposed to allow the new incarnation in but never did which led odl1 to join itself after the 12s seed-node-timeout. Strange. I would say this definitely is an issue with akka. |
| Comment by Tom Pantelis [ 17/Jul/18 ] |
| Comment by Luis Gomez [ 19/Jul/18 ] |
|
When you run OF cluster test many times like I did in sandbox this week, I can see the issue of controller not joining the cluster after kill+start also occurs in instances different than 1. For example see this run in sandbox where instance 3 is killed and started and does not join: I think the root cause for this may be the same as with instance 1, so do we want to open a new bug or just track it here? |
| Comment by Tom Pantelis [ 19/Jul/18 ] |
|
It's not exactly the same scenario since only the first seed node can join itself - the others cannot and will try forever to re-join. But I would say the fact that it doesn't rejoin is the same issue. So I think we can track it here. |
| Comment by Luis Gomez [ 19/Jul/18 ] |
|
Also from my test this is the most frequent issue the OF cluster suite has today, it is also very apparent because during the suite we are killing and starting instance few times. |
| Comment by Tom Pantelis [ 19/Jul/18 ] |
|
Have you tried w/o killing it, ie shut it down gracefully? It would be interesting to see if it happens in that case. |
| Comment by Luis Gomez [ 19/Jul/18 ] |
|
Yes, that was my point, at the root there is the issue of an instance not joining regardless of what happens later. I really thought this problem was resolved in akka recent upgrade but in the end it just happens less often than before. |
| Comment by Tom Pantelis [ 19/Jul/18 ] |
|
right - failure to rejoin is the root issue. |
| Comment by Luis Gomez [ 19/Jul/18 ] |
|
Yes, I think shutdown works or fails less often, let me change most kills to shutdown and see what happens. |
| Comment by Tom Pantelis [ 19/Jul/18 ] |
|
I'd like to see if it doesn't fail at all if shutdown gracefully. If that's the case then that gives us another data point to further characterize the issue and we can report that on the akka issue. |
| Comment by Luis Gomez [ 21/Jul/18 ] |
|
After 1 day running I only see failure in the kill+start scenario which means something in the kill adds to the problem. |
| Comment by Tom Pantelis [ 22/Jul/18 ] |
|
I got a response from akka folks - they've asked us to enable debug. In configuration/initial/akka.conf, merge in these settings: akka {
loglevel = DEBUG
remote {
...
log-received-messages = on
log-sent-messages = on
}
...
}
In etc/org.ops4j.pax.logging.cfg, add: log4j2.logger.cluster.name=akka.cluster log4j2.logger.cluster.level=DEBUG log4j2.logger.remote.name=akka.remote log4j2.logger.remote.level=DEBUG This will produce a lot of output so I'd suggest to pare down the tests to only what's needed to repro, eg don't re-run the tests with tell-based and don't run the script that creates 10k cars. Also when restarting the node after kill, delete the previous log so it starts with a clean one. |
| Comment by Luis Gomez [ 23/Jul/18 ] |
|
I started a small suite with all debugs on in sandbox and so far no luck reproducing the problem |
| Comment by Tom Pantelis [ 23/Jul/18 ] |
|
Jamo had a test script running in sandbox that was reproducing it - perhaps get that going again. The akka folks have also suggested we try artery to see if it reproduces - so once we get a test/script that reproduces it. |
| Comment by Jamo Luhrsen [ 23/Jul/18 ] |
|
At this point, I think I'm still able to reproduce this, but the high level symptom is now a 404 on /restconf or /jolokia I did not try with the debugs enabled yet. I wanted to double check my automation could hit it. I can see this in the karaf.log, which I have been assuming is our smoking gun (1st seed node is on an island) 2018-07-23T20:04:13,072 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-51 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] to [Up] I didn't run this with the debugs, but something else is bothersome. Seems like a lot of blueprint errors are |
| Comment by Tom Pantelis [ 23/Jul/18 ] |
|
If shard leaders don't converge then there will be blueprint timeouts in other apps b/c the DS isn't ready. Most apps depend on the DS. Even if an app does startup, there will be timeout failures when it tries to access the DS. |
| Comment by Jamo Luhrsen [ 23/Jul/18 ] |
|
ok, so to be clear, we think this is the same exact bug right? what about that 404 then? I thought with dynamic auth disabled... oh wait, that was for 401. ok, so 404 is probably expected now right? |
| Comment by Tom Pantelis [ 23/Jul/18 ] |
|
404 is "not found". I assume that's coming from jetty so AAA and restconf haven't started b/c they depend on the DS. So it all comes back to the root cause, ie akka not rejoining. At this point we need a reproduction with the akka debug enabled so we can send that off to them. A minimal repro would be best, ie bring up the 3 nodes, verify cluster is OK, then kill/restart the first seed node over and over. So don't create any cars/people etc. Also delete the karaf log before each restart until if fails. |
| Comment by Jamo Luhrsen [ 23/Jul/18 ] |
got it. that's what I was hoping.
that's all I am doing. it's just a slightly improved version of this my automation is not going to delete the karaf.log, but I assume I can manually truncate the logs to make it |
| Comment by Tom Pantelis [ 23/Jul/18 ] |
|
yeah. It would make it a little easier to delete the log if possible but it's OK. |
| Comment by Luis Gomez [ 24/Jul/18 ] |
|
Lets keep this bug open for the issue we see in OFP of instance failing to join cluster after it is killed and started and shows auth error 401 when jolokia/REST is invoked on it: |
| Comment by Jamo Luhrsen [ 24/Jul/18 ] |
|
tpantelis, what do you want here? ecelgp's link above is to a node (NOT the 1st seed node) that was bounced We don't have that 401 any more when we got rid of dynamic auth for the case when we get an island leader on the first New bug, or just keep this one to track all of these very similar problems? |
| Comment by Luis Gomez [ 24/Jul/18 ] |
|
This is seen in the starting instance (member-2): 2018-07-24T11:31:56,731 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | JoinSeedNodeProcess | 48 - com.typesafe.akka.slf4j - 2.5.11 | Couldn't join seed nodes after [16] attempts, will try again. seed-nodes=[akka.tcp://opendaylight-cluster-data@10.30.171.71:2550, akka.tcp://opendaylight-cluster-data@10.30.171.61:2550] |
| Comment by Tom Pantelis [ 24/Jul/18 ] |
|
Looks like that was in Oxygen - not sure of the state there. I've been focusing on Fluorine. That looks like the same issue except it wasn't the first seed node. Again, root cause is a node not rejoining, first seed node or not. I think we see it much more frequently with the first seed node b/c it can join itself with a deadline of only 12s - the others will just keep trying to join the existing cluster indefinitely which I think has a 5 min deadline in CSIT. |
| Comment by Jamo Luhrsen [ 24/Jul/18 ] |
|
I've only been using Oxygen in my local reproduction of this. I thought we could not even hit the island node issue if it wasn't the first seed node? |
| Comment by Tom Pantelis [ 24/Jul/18 ] |
|
It won't hit the island node issue but... see my previous note. To our code it doesn't matter - if the node doesn't join, we can't converge shard leaders. |
| Comment by Jamo Luhrsen [ 24/Jul/18 ] |
|
ok, so what was the big deal about only doing this with the first seed node? |
| Comment by Tom Pantelis [ 24/Jul/18 ] |
|
Because up until now we've only saw it for the first seed node and, as I explained above, it is much more likely to happen. I don't know why there would still be a 401. |
| Comment by Jamo Luhrsen [ 24/Jul/18 ] |
|
I think/thought I had this reproduced locally with the akka debugs enabled, but I'm having trouble making sense of my logs I do see this this output repeated 31 times at the end of my karaf log: Jul 24, 2018 12:52:34 AM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Trying to lock /odlha/karaf/target/assembly/lock Jul 24, 2018 12:52:34 AM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Lock acquired Jul 24, 2018 12:52:34 AM org.apache.karaf.main.Main$KarafLockCallback lockAquired INFO: Lock acquired. Setting startlevel to 100 Jul 24, 2018 12:54:29 AM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Trying to lock /odlha/karaf/target/assembly/lock Jul 24, 2018 12:54:29 AM org.apache.karaf.main.lock.SimpleFileLock lock INFO: Lock acquired which reminds me of but, the timestamps of those messages are off, compared to the message right before they started:
2018-07-24T00:51:10,559 | INFO | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker | 199 - org.opendaylight.controller.blueprint - 0.8.3.SNAPSHOT | Blueprint container for bun
dle org.opendaylight.netvirt.qosservice-impl_0.6.3.SNAPSHOT [359] was successfully created
I think the karaf made logs are using a different method of creating time stamps. Also, my other two controllers, which I stopped at the same time (all three got a finall kill -9 at the same time) have very confused. I'll try again, with some additional steps in my bouncer scripts to keep track of timestamps better. |
| Comment by Luis Gomez [ 24/Jul/18 ] |
|
Coming back to: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-only-oxygen/336/robot-plugin/log.html.gz#s1-s2 I see the restarting member-2 fails to join: 2018-07-24T11:28:43,869 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Slf4jLogger | 48 - com.typesafe.akka.slf4j - 2.5.11 | Slf4jLogger started 2018-07-24T11:28:43,912 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Remoting | 48 - com.typesafe.akka.slf4j - 2.5.11 | Starting remoting 2018-07-24T11:28:44,143 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-6 | Remoting | 48 - com.typesafe.akka.slf4j - 2.5.11 | Remoting started; listening on addresses :[akka.tcp://opendaylight-cluster-data@10.30.171.67:2550] 2018-07-24T11:28:44,168 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-6 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.67:2550] - Starting up... 2018-07-24T11:28:44,268 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.67:2550] - Registered cluster JMX MBean [akka:type=Cluster] 2018-07-24T11:28:44,269 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.67:2550] - Started up successfully 2018-07-24T11:29:08,440 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-23 | JoinSeedNodeProcess | 48 - com.typesafe.akka.slf4j - 2.5.11 | Couldn't join seed nodes after [2] attempts, will try again. seed-nodes=[akka.tcp://opendaylight-cluster-data@10.30.171.71:2550, akka.tcp://opendaylight-cluster-data@10.30.171.61:2550] while the other 2 members receive and respond to the join requests: 2018-07-24T11:28:44,623 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.171.67:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1711052280]] to [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] 2018-07-24T11:28:44,624 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.171.67:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1711052280]] 2018-07-24T11:28:56,431 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.171.67:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1711052280]] to [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] 2018-07-24T11:28:56,434 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.171.67:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1711052280]] This really seems an issue in the restarting member not getting or processing InitJoinAck from the other 2. |
| Comment by Tom Pantelis [ 25/Jul/18 ] |
|
yup - it's the same pattern I've seen in numerous logs. |
| Comment by Tom Pantelis [ 25/Jul/18 ] |
|
I think it will be easier to make sense out of it if it restarts with a clean log. |
| Comment by Tom Pantelis [ 25/Jul/18 ] |
|
So the 401's occur in Oxygen b/c AAA shiro doesn't initialize due to the DS not ready: 2018-07-24T11:33:26,723 | ERROR | paxweb-extender-1-thread-1 | EnvironmentLoader | 143 - org.apache.shiro.web - 1.3.2 | Shiro environment initialization failed java.lang.RuntimeException: Error obtaining AAAShiroProvider at org.opendaylight.aaa.shiro.web.env.KarafIniWebEnvironment.init(KarafIniWebEnvironment.java:76) ~[189:org.opendaylight.aaa.shiro:0.7.3] Without shiro, we can't authenticate. This code was refactored in Fluorine such that I believe we no longer get 401's but 404's. Again either way, it's all due to shards not converging due to akka not rejoining. |
| Comment by Jamo Luhrsen [ 25/Jul/18 ] |
|
quick note that I let myself get sidetracked by |
| Comment by Jamo Luhrsen [ 27/Jul/18 ] |
|
Here are the logs with these extra debugs from a recent reproduction in my local setup karaf.1.log karaf.1.log is the 1st seed node, and the node I am doing the kill -9 and start repeatedly until we hit this bug. I've trimmed karaf.1.log down to only include the logs since it's final restart. You can find this log (which we think 2018-07-27T10:25:04,024 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-55 | Cluster(akka://opendaylight-cluster-data) | 90 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] to [Up] you can see 172.28.5.1 is moving itself to Leader [Up] The other two karaf.log files are from the other controllers, and I have trimmed them out to only include logs from the |
| Comment by Tom Pantelis [ 27/Jul/18 ] |
|
nice - thanks |
| Comment by Tom Pantelis [ 31/Jul/18 ] |
|
I attached the logs to the akka issue. The bottom line is that node1 was able to send messages (InitJoin) to nodes 2 and 3 but no messages from nodes 2 and 3 were received by node1 (IntJoinAck, Heartbeat). We'll see what they think but I suspect the next step might be to run wireshark or something to see what's actually getting on the wire. I think the issue is on node1's side - otherwise it would mean both nodes 2 and 3 would experience the same issue simultaneously which is much less likely. Plus nodes 2 and 3 were communicating back and forth fine. |
| Comment by Sam Hague [ 31/Jul/18 ] |
|
jluhrsen look at https://git.opendaylight.org/gerrit/#/c/74585/ for how we added the tcpdump for port 6653 ofp traffic on devstack nodes. That function can be used to start one for akka traffic. We might need a variant https://git.opendaylight.org/gerrit/#/c/74586/ because the tests stop the tcpdump based on the grep of tcpdump in ps, so at the end of each suite it will kill all running tcpdump's. That check should be more selective if we need to keep the akka tcpdump running for the whole job duration. If the job is a single suite then it is not a problem. |
| Comment by Tom Pantelis [ 01/Aug/18 ] |
|
Here's the docs for Akka artery : https://doc.akka.io/docs/akka/2.5.11/remoting-artery.html. They have added TCP support since I last looked at it but another article I read suggests it isn't quite ready for production. I would think the default is UDP but it can be explicitly set via transport = aeron-udp as outlined in the Preparing your ActorSystem for Remoting section. Also seed node addresses must start with "akka://opendaylight-cluster-data...", ie drop the ".tcp". |
| Comment by Jamo Luhrsen [ 14/Aug/18 ] |
|
I was able to reproduce this with a few more debugs happening. I'm doing a netstat to see which ports are open and I did a tcpdump for I didn't see any difference in the output of the netstat command from a passing run and a failing run. It looks like this: tcp 0 0 127.0.0.11:39681 0.0.0.0:* LISTEN - tcp6 0 0 :::44444 :::* LISTEN 12969/java tcp6 0 0 :::8101 :::* LISTEN 12969/java tcp6 0 0 127.0.0.1:1099 :::* LISTEN 12969/java tcp6 0 0 127.0.0.1:35949 :::* LISTEN 12969/java tcp6 0 0 :::8181 :::* LISTEN 12969/java tcp6 0 0 172.28.5.1:2550 :::* LISTEN 12969/java tcp6 0 0 :::34519 :::* LISTEN 12969/java udp 0 0 127.0.0.11:43904 0.0.0.0:* - what's weird though, is that this controller 1 (first seed node) doesn't have port 8185 lit up, in the passing or failing tcp 0 0 127.0.0.11:38895 0.0.0.0:* LISTEN - tcp6 0 0 127.0.0.1:36513 :::* LISTEN 4201/java tcp6 0 0 :::8101 :::* LISTEN 4201/java tcp6 0 0 127.0.0.1:1099 :::* LISTEN 4201/java tcp6 0 0 :::46097 :::* LISTEN 4201/java tcp6 0 0 :::8181 :::* LISTEN 4201/java tcp6 0 0 172.28.5.2:2550 :::* LISTEN 4201/java tcp6 0 0 :::8185 :::* LISTEN 4201/java tcp6 0 0 :::44444 :::* LISTEN 4201/java [^104.pcap] udp 0 0 127.0.0.11:47322 0.0.0.0:* - Attached are two packet captures that open fine in wireshark. filename 104.pcap is from |
| Comment by Tom Pantelis [ 15/Aug/18 ] |
|
The packet capture basically shows what we've seen in the logs, ie node1 is sending out InitJoin messages to 2 & 3 and InitJoinAck messages are coming back. Also Heartbeat messages are incoming from 2 & 3 along with our AppendEntries heartbeats. So node1 is sending out messages but incoming messages are not getting processed for some reason. I'll update the akka ticket. I think the next steps are:
|
| Comment by Jamo Luhrsen [ 15/Aug/18 ] |
|
How can we debug this further to root cause it? You noted in the akka bug we know The akka folks on that github bug have suggested we try adding extra debugs in akka itself and trying with Everything you've listed above is a workaround idea. I can try all of those and maybe they will give us some BTW, I've now run more than 800 iterations of this reproduction script locally with a graceful shutdown and |
| Comment by Tom Pantelis [ 15/Aug/18 ] |
|
< lower level debugs we can enable in karaf to try to trace the packets being processed? It's not karaf - it's akka code. >The akka folks on that github bug have suggested we try adding extra debugs in akka itself and trying with Their code has to be modified to add more debug - I think they mean they would add it and supply a snapshot. In the end we may have to workaround this or at least doc it as a known anomaly. Even if they find the root cause and are able to fix it, that would likely come in a new akka version that we would have to eventually upgrade to (unless they can patch the version we use). artery is not a workaround - we want to switch to it anyway - hopefully in Neon - we'll have to longer term anyway since the netty-based version will be removed at some point. |
| Comment by Jamo Luhrsen [ 15/Aug/18 ] |
I know it's akka code that's not receiving the initJoinAck, but that doesn't mean it's getting
That's not how I took it. Could you clarify that with them? I don't mind to keep digging here and
atery is a workaround. If the bug is not in artery and we move to artery we are not fixing any bug. We and, if the move to artery is similar to our move to tell-based, then who knows when that will happen. we've upgraded akka before, so I'm not understanding why that is something to worry about. I will try will try with artery at some point. |
| Comment by Tom Pantelis [ 15/Aug/18 ] |
|
> Maybe akka is doing all this work itself. Yes it is - all of the code in question is akka > That's not how I took it. Could you clarify that with them? I don't mind to keep digging here and "*Next step would probably be to add some additional logging to Akka *" - there is no more debug logging to enable - code has to be added to add more detailed logging. > atery is a workaround. If the bug is not in artery and we move to artery we are not fixing any bug. We yes b/c artery is their next-generation remoting layer. The netty-based one is essentially deprecated and will be removed. So we have to go to it anyway. In fact we're probably lucky they're even helping us at all (for free anyway). |
| Comment by Jamo Luhrsen [ 15/Aug/18 ] |
ok, I just must be confused with the netty we are using in ODL for other things and
well, to be fair, I took what you quoted above as them suggesting we make the
ok, fair enough. Let me try artery next. |
| Comment by Tom Pantelis [ 15/Aug/18 ] |
|
> well, to be fair, I took what you quoted above as them suggesting we make the yeah - I think you're right. I'll look into that when I have some cycles.... |
| Comment by Jamo Luhrsen [ 23/Aug/18 ] |
|
having a really hard time making artery work, but still am trying. |
| Comment by Jamo Luhrsen [ 12/Sep/18 ] |
|
I neglected to update this, but a week or two ago I was finally able to get artery to work and force So, maybe this means our root cause is going to be somewhere in the netty transport for akka. I We can change all of our tests to only do graceful restarts which should hopefully prevent this |
| Comment by Tom Pantelis [ 13/Sep/18 ] |
|
> I don't think there is anyone willing to keep digging on a root cause and fix for this. It would be either me or Robert and I don't have cycles (nor desire > We can change all of our tests to only do graceful restarts... Sounds good. |
| Comment by Sam Hague [ 17/Sep/18 ] |
|
I converted the NetVirt CSIT to use graceful start and stop. It looks promising and we don't see the OLFE's and other weird exceptions, but it does bring out other issues. openflowplugin and ovsdb have various bundles shutting down, but not all at once, and the listening ports are still open. The switches continue sending to the ODL shutting down. We need to clean those issues up. That being said, it seems like we are adapting the tests to accommodate bugs. I could see if we kept the test as is, then maybe they shouldn't fail or we should have checks to know that there will be issues during this time and not fail. |
| Comment by Tom Pantelis [ 17/Sep/18 ] |
|
Graceful shutdown should be the norm in production so CSIT should be testing it and, if bundles have issues shutting down gracefully, they should be addressed. "kill -9" should be a last resort as it could cause leave persistence in a corrupted state. So it sounds like introducing graceful shutdown is uncovering issues which is a good thing. |
| Comment by Luis Gomez [ 17/Sep/18 ] |
|
FYI I left 1 test in OFP doing the kill so at least I can track how often the kill fails: |
| Comment by Jamo Luhrsen [ 30/Oct/18 ] |
|
This basic problem of a controller not coming up healthy after restarting still happens. Yes, the last example |
| Comment by Tom Pantelis [ 30/Oct/18 ] |
|
It's an issue in akka and it would not be a trivial effort to learn that large code base (not to mention scala) plus it's an old code base that has essentially been replaced by artery, which is why the akka devs aren't anxious to jump on it (at least not for free). Plus it's an extreme test scenario (killing/restarting the process quickly over and over) that realistically wouldn't be seen in production (which I'm sure is why it's never been reported in the 10 years akka's been around) or if it ever did, you got bigger problems than the node not rejoining the cluster after like the 50th time or whatever |
| Comment by Jamo Luhrsen [ 30/Oct/18 ] |
|
fair enough, but just to be clear, this is not happening because of killing/restarting quickly over and over for 50 times. yes, that was the script besides the "kill -9" part, it's pretty gentle. again, I understand nobody wants or plans to work on it. I have lowered the priority. maybe tell-based will be better, but that work is yet to be done and the jobs we have for tell based have failures to figure |
| Comment by Tom Pantelis [ 30/Oct/18 ] |
|
It's not related to tell-based - that's our stuff. It's in akka or a lib it uses (eg netty) or possibly at the TCP-level where something is stale due to the process being killed. AFAIK, from you and Luis testing, it does not occur if the process is gracefully shutdown nor does it occur after the first kill/restart (ie it takes multiple iterations if it occurs at all). Also it has not been seen to occur with the new remoting code base, artery, (with UDP at least) so IMO, it makes more sense to focus on switching to artery rather than spending the time to try to learn, debug, possibly fix the old akka code base where it appears Lightbend has little to no interest in continuing to support (at least for free). |
| Comment by Jamo Luhrsen [ 30/Oct/18 ] |
|
yeah, I said 'tell-based", but confused that with "artery". my bad. but yeah, it's very clear this bug is not getting fix and that there is no interest from anyone to |
| Comment by Robert Varga [ 14/Nov/18 ] |
|
tpantelis I noticed https://github.com/akka/akka/issues/25632 in 2.5.17 (which we still cannot have, as things are in-flux and we'll hopefully get 2.5.19). Is that something that can help? |
| Comment by Tom Pantelis [ 14/Nov/18 ] |
|
That is related to quarantining but that doesn't happen in this scenario (that (usually) happens with node isolation) plus, even if the killed node was somehow quarantined, restarting clears the quarantined state. No split brain occurs here - it's a simple node kill/restart. |
| Comment by Jamo Luhrsen [ 13/Dec/18 ] |
|
moved this to Low priority and closing as we aren't seeing failures any more because we moved to |