[CONTROLLER-1849] controller not coming up healthy after being killed and restarted (401 after 5m) Created: 03/Jul/18  Updated: 13/Dec/18  Resolved: 13/Dec/18

Status: Resolved
Project: controller
Component/s: None
Affects Version/s: Oxygen
Fix Version/s: None

Type: Bug Priority: Lowest
Reporter: Jamo Luhrsen Assignee: Tom Pantelis
Resolution: Won't Do Votes: 1
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 104.pcap     File 105.1.log     File 105.pcap     File karaf.1.log     File karaf.2.log     File karaf.3.log     File karaf.log.since_last_restart.xz    
Issue Links:
Relates
relates to NETVIRT-1315 Troubleshooting Controller CSIT In Progress
relates to CONTROLLER-1790 Leader can currently not perform its ... Resolved
Priority: Low

 Description   

Here is an example of this problem from the robot logs.

the karaf log is pretty messy if you look from
the point when it was restarted until the end. Search on timestamp "Jun 19, 2018 1:31:27"
for when the restart was initiated.

The controller was initially killed with 'kill -9'

The controller logs, and robot logs can be found here



 Comments   
Comment by Jamo Luhrsen [ 03/Jul/18 ]

another example

Comment by Jamo Luhrsen [ 03/Jul/18 ]

and another

Comment by Tom Pantelis [ 04/Jul/18 ]

After the restart, here's a grep of all akka output:

2018-06-19T13:31:55,180 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Slf4jLogger                      | 48 - com.typesafe.akka.slf4j - 2.5.11 | Slf4jLogger started

2018-06-19T13:31:55,254 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Remoting                         | 48 - com.typesafe.akka.slf4j - 2.5.11 | Starting remoting

2018-06-19T13:31:55,437 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | Remoting                         | 48 - com.typesafe.akka.slf4j - 2.5.11 | Remoting started; listening on addresses :[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550]

2018-06-19T13:31:55,466 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Starting up...

2018-06-19T13:31:55,529 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Registered cluster JMX MBean [akka:type=Cluster]

2018-06-19T13:31:55,530 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-15 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Started up successfully

2018-06-19T13:32:07,632 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-14 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] is JOINING, roles [member-1, dc-default]

2018-06-19T13:32:07,648 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-14 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] to [Up]

2018-06-19T13:36:40,475 | INFO  | SystemReadyService-0 | TestBundleDiag                   | 238 - org.opendaylight.infrautils.ready-impl - 1.3.3.SNAPSHOT | OK com.typesafe.akka.slf4j:2.5.11: OSGi state = Active, Karaf bundleState = Active, due to: Declarative Services

 

What that tells me is that it wasn't able to connect to another node so it made itself leader and joined itself thereby forming a cluster island. This is member-1 which I assume is the first seed node listed so it has special behavior that allows it to do that (the other seed nodes cannot).   This is governed by the seed-node-timeout setting in the akka.conf which we set to 12s. So from the log, between 13:31:55 and 13:32:07, which is 12s, it tried to connect to either of the other 2 nodes but was unable to do so, thus it thinks no other node is up and forms its own cluster. The shards don't hear from a leader so they go to Candidate and keep starting elections over and over trying to become leader.

So either there was an issue in the CSIT environment that prevented connectivity or an issue with akka that prevented it from connecting to either of the other 2 nodes (or maybe a combination of both). Either way, can't tell anything more from this log. Do we still have the logs from the other 2 nodes? Perhaps there might be akka messages in one or both of those logs indicating connection attempts from member-1.

I assume this Is this Oxygen? - can you set the affected version?

Comment by Tom Pantelis [ 04/Jul/18 ]

Looking at odl2's log, I see a bunch of these repeated:

2018-06-19T13:31:55,865 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#915987295]] to [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550]

2018-06-19T13:31:55,865 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#915987295]]

 

Followed by a bunch of these:

2018-06-19T13:32:18,275 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-24 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.115:2550: Unreachable [Unreachable] (3), akka.tcp://opendaylight-cluster-data@10.30.170.158:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.115:2550: Unreachable [Unreachable] (2)], member status: [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550 Up seen=false, akka.tcp://opendaylight-cluster-data@10.30.170.123:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.158:2550 Up seen=true]

 

And we see similar in odl3's log:

2018-06-19T13:31:55,862 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.158:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#915987295]] to [akka.tcp://opendaylight-cluster-data@10.30.170.158:2550]

2018-06-19T13:31:55,862 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.158:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.158:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#915987295]]

 

So join messages are getting thru to the other nodes but no indication why odl1 (10.30.170.115) was not joined back in. Another akka mystery. I can try opening a bug in akka and see if we get any support.

Comment by Tom Pantelis [ 05/Jul/18 ]

So odl1 was restarted several times over the course of an hour. The time previous to 06-19T13:31:27 when it didn't rejoin was at 06-19T13:30:06. On odl2, we see:

2018-06-19T13:30:32,044 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-25 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#383916203]] to [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550]

2018-06-19T13:30:32,045 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-25 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#383916203]]

2018-06-19T13:30:32,146 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-25 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@10.30.170.115:2550, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

2018-06-19T13:30:32,147 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-25 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Marking unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] as [Down]

2018-06-19T13:30:33,282 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Leader is removing unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550]

2018-06-19T13:30:42,777 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-9 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#-1096982950]] to [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550]

2018-06-19T13:30:42,777 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-9 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.115:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#-1096982950]]

2018-06-19T13:30:42,825 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-9 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] is JOINING, roles [member-1, dc-default]

2018-06-19T13:30:44,278 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-25 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] to [Up]

On that restart, odl1 was let back in as expected. About 30 sec later it was killed again:

2018-06-19T13:31:15,640 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-9 | NettyTransport                   | 48 - com.typesafe.akka.slf4j - 2.5.11 | Remote connection to [/10.30.170.115:50464] failed with java.io.IOException: Connection reset by peer

2018-06-19T13:31:15,642 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | ReliableDeliverySupervisor       | 48 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 

2018-06-19T13:31:19,275 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-9 | ClusterCoreDaemon                | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.30.170.115:2550, status = Up)]. Node roles [member-2, dc-default]

So odl1 was killed and restarted twice in < 2 minutes. The second time it wasn't allowed back into the cluster within the 12s seed node timeout period. I wonder if there's some logic or edge case in akka which caused it to delay the rejoin due to the quick kills and restarts. Maybe a longer seed-node-timeout setting like 25s would help. I can try to reproduce manually by doing quick kills/restarts.

Comment by Tom Pantelis [ 05/Jul/18 ]

So looking at odl1 during the prior successful restart at 06-19T13:30:06, we see:

2018-06-19T13:30:42,821 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-6 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Received InitJoinAck message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.123:2550/system/cluster/core/daemon#1793055218]] to [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550]

2018-06-19T13:30:42,854 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Welcome from [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550]

So akka does print out the InitJoinAck response from another node. We don't see this during the 06-19T13:31:27 restart. I wonder if it's possible there was only one-way network communication between 10.30.170.115 and the other nodes during this time - the behavior seems to indicate that...

Comment by Jamo Luhrsen [ 05/Jul/18 ]

I am assuming (based on newer comments) that you found the other logs
you wanted.

yes, oxygen (fyi, you can see what version in the url of the csit job). I have
updated the affected version (sorry for forgetting that)

Comment by Tom Pantelis [ 05/Jul/18 ]

yeah I notices your link afterwards. I figured it was Oxygen since I saw config system output but I purged that stuff from Fluorine

Comment by Jamo Luhrsen [ 05/Jul/18 ]

I'm trying to better understand how the suite is working (it's new to me as
well), but here's the high level robot code:

    Kill Switchs Old Owner    ${switch_name}
    Restart Switchs Old Owner    ${switch_name}
    Kill Switchs Successor    ${switch_name}
    Restart Switchs Successor    ${switch_name}

The kill keywords are first figuring out the "owner" or "successor" before
killing the controller. So, in this specific case I think the first keyword finds
that odl1 was found to be the Owner and killed. So, I guess odl2 or odl3
would now become the owner. Then odl1 is restarted and maybe becomes
the successor for that switch. If that's the case, then the step to kill the
successory will again take down odl1 and restart it.

I'm hoping that infra communication being broken and only one-way is
so unlikely that we can ignore that for now, but I have no way to prove
that.

I suppose one effort we can take is to write some automation to quickly
take a controller down/up and check for whatever the smoking gun is in
this case. tpantelis is there some string or set of strings we can poll
on in the karaf.log to know if we hit this "controller on an island"
scenario? I can work on something to run repeatedly looking for this
problem.

Comment by Tom Pantelis [ 05/Jul/18 ]

yeah the broken infra one-way communication theory seems unlikely. But the behavior seen in the logs is consistent with there being broken one-way communication somewhere so that's why I mentioned it. The logs show odl1 sent InitAckJoin to both odl2 and odl3 and both show they replied with InitAckJoinAck. But we don't the see the InitAckJoinAck received message in odl1 as is seen in a successful rejoin scenario. We also don't see odl2 or odl3 marking odl1 as REACHABLE even tho both received multiple messages from odl1. According to my experience with akka and reading their docs, there shouldn't be any reason odl2 or odl3 shouldn't allow odl1 back in provided there's full 2-way communication. Unless there's some edge-case bug in akka where it still has stale state for odl1.... akka has strict rules wrt convergence where all surviving nodes have to agree to remove the old node incarnation and let the new node incarnation join.

I'm going to try locally to keep killing/restarting the first seed node - hopefully tomorrow.

As far as what to look for.... you don't want to see odl1 reporting this after the restart:

2018-06-19T13:32:07,632 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-14 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] is JOINING, roles [member-1, dc-default]

2018-06-19T13:32:07,648 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-14 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] to [Up]

That means it joined itself and "Up"ed itself. You want to see

2018-06-19T13:30:42,854 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.115:2550] - Welcome from [akka.tcp://opendaylight-cluster-data@10.30.170.123:2550]

meaning one of the other 2 nodes, 10.30.170.123 in this case, let it back into the surviving cluster.

Comment by Jamo Luhrsen [ 05/Jul/18 ]

gotcha, so if I just write some script to down/up a single node and check
for "$ip.Leader is moving node.$ip.* to [Up]" and exit if found, I'll catch
this island case. I can also poll for that "Welcome" message to know
when to keep trying.

Comment by Tom Pantelis [ 05/Jul/18 ]

It has to be a 3-node setup where you down/up the first seed node listed in the akka.conf. As I mentioned earlier, the first seed node has special significance in that it bootstraps the cluster so it must be running in order to initially form a cluster and is the only node that can join itself. All subsequent seed nodes listed must be able to contact another seed and join with it. So you wouldn't see this 2 cluster island issue if a non-first seed node (odl2 or odl3) were restarted.

Comment by Jamo Luhrsen [ 05/Jul/18 ]

ok, makes sense. Can you tell me how to figure out the first seed node? I assume there is something in the logs.

Comment by Tom Pantelis [ 05/Jul/18 ]

Open configuration/initial/akka.conf - look for "seed-nodes = ..." - the first one in that list.

Comment by Tom Pantelis [ 08/Jul/18 ]

I think it would be good to determine if this issue also occurs if the node is gracefully shut down. I read somewhere that forcefully killing a process does not cleanly shut down the ports and thus a port may stay blocked for a while depending on the operating system. I'm not sure how such a state would manifest in akka however - we didn't see any error/failure reported by akka.

Comment by Jamo Luhrsen [ 09/Jul/18 ]

this is obviously a good idea and will be a good data point, but I can't really do this at this time. This
problem is only seen infrequently in CSIT and I haven't reproduced it locally. So, without a solid
way to hit it, adding the graceful shutdown to CSIT won't tell us much.

I am currently trying to reproduce locally.

Comment by Jamo Luhrsen [ 09/Jul/18 ]

here's a simple bash script I'm using to try to locally reproduce. I'm using the netvirt-ha-docker tooling to make a local cluster.

#!/bin/bash

while true
do
  sudo docker exec -i odl_172.28.5.1 ps aux | 
                                     grep karaf.main | grep -v grep | 
                                     awk '{print "kill -9",$2}' | 
                                     sudo docker exec -i odl_172.28.5.1 sh
  # pick a random value between 1-90 seconds for when we start karaf again
  sleepy_time=$((RANDOM%90))
  echo "Waiting $sleepy_time seconds after killing karaf, before starting"
  sleep $sleepy_time
  sudo docker exec -i odl_172.28.5.1 /odlha/karaf/target/assembly/bin/start

  tries=0
  until [ $tries -ge 60 ]
  do
      ((tries++))
      echo "$tries th check for 200 code"
      RESP=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -u "admin:admin" http://172.28.5.1:8181/jolokia/read/org.opendaylight.controller:Category\=ShardManager,name\=shard-manager-config,type\=DistributedConfigDatastore)
      CODE=$(echo $RESP | tr -d '\n' | sed -e 's/.*HTTPSTATUS://')

      if [ $CODE -eq "200" ]; then
          break
      fi
      sleep 2;
  done

  # we'll have looped for 120s (60 tries) above if 200 never came, so maybe things are broken
  if [ $tries -eq 60 ]; then
      echo "Might have caught the bugger"; exit 1
  fi
done
Comment by Jamo Luhrsen [ 09/Jul/18 ]

I basically tried the above script manually all morning (maybe 20-30 times), but once I scripted it and stopped looking
I hit the bug on the 13th iteration.

I do see this (smoking gun, I think):

2018-07-09T23:11:59,187 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-59 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf
4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] to [
Up]

tpantelis, let me try to reproduce this a few more times so we can get a sense of how frequent I can hit it.

Assuming I can hit it with regularity, I'll try your idea of doing a "stop" instead of a "kill -9".

In the meantime, is there anything else to try/look for while doing this to help get to the root cause.

BTW, this is all done on a single laptop, so I hope that can rule out infra networking issues.

Comment by Tom Pantelis [ 09/Jul/18 ]

One other thing I can think of is to bump the seed-node-timeout from 12s to say 25s. It's specified in  the configuration/factory/akka.conf but that file is overwritten on feature install. You'd want to override in configuration/initial/akka.conf which is preserved.

I think running 3 instances on the same box (separate dockers I assume) should rule out networking issues. Given that, it's likely some edge case issue in akka. As mentioned before it may have to do with killing/restarting in relatively quick succession. Let's try to characterize it as much as we can then we can open an issue with akka and see what happens from there.

Comment by Jamo Luhrsen [ 10/Jul/18 ]

I posted my script to gerrit since I kept tweaking
it and wanted to make it available for others if they were so inclined to try this out locally as well.

Over the past 24 hours I was able to reproduce this 6 times. Each reproduction took iterations of
71, 13, 20, 25, 205, 30 tries. This was using the default seed-node-timeout of 12s.

Currently, I'm running this script with seed-node-timeout of 30s. I will report back.

Comment by Luis Gomez [ 17/Jul/18 ]

Jamo, did you finally try with seed-node-timeout of 30s?
I am asking because I saw the issue yesterday:
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-only-fluorine/147/robot-plugin/log.html.gz

Comment by Jamo Luhrsen [ 17/Jul/18 ]

I was reproducing these kinds of issues locally, and for a week I was able to hit this
kind of issue where killing/restarting a node was giving me a 401 unauthorized
after automating it and letting it try over and over. Another problem that might be
related is that cluster syncstatus gets stuck at 'false'. I that is tracked in CONTROLLER-1768.
This CONTROLLER-1768 bug was coming when I moved seed-node-timeout to 30s,
but I was not able to locally reproduce this bug CONTROLLER-1849 any longer,
even when moving back to 12s.

During our kernel call today, one idea was that we can do away with dynamicAuthorization
altogether. tpantelis mentioned he could push a patch for that, but it looked simple
enough so I already made one.

https://git.opendaylight.org/gerrit/#/c/74157/

Comment by Tom Pantelis [ 17/Jul/18 ]

Looking at the logs from Luis test, we see similar behavior where odl1 didn't rejoin the cluster and joined itself after 22s. It did get an InitJoinAck response from odl2.

odl1:

2018-07-16T11:18:20,251 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-21 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Received InitJoinAck message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.26:2550/system/cluster/core/daemon#-1596937295]] to [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]

2018-07-16T11:18:42,585 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] is JOINING, roles [member-1, dc-default]

2018-07-16T11:18:42,614 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] to [Up]

odl2:

 
2018-07-16T11:18:14,848 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-21 | ReliableDeliverySupervisor       | 41 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]] Caused by: [Connection refused: /10.30.170.94:2550]

2018-07-16T11:18:20,105 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#-2094984711]] to [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550]

2018-07-16T11:18:20,106 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#-2094984711]]

2018-07-16T11:18:20,291 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@10.30.170.94:2550, status = Up)] is trying to join. *Existing will be removed from the cluster and then new member will be allowed to join.*

2018-07-16T11:18:20,292 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - *Marking unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] as [Down]*

2018-07-16T11:18:21,226 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - *Leader is removing unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]*

2018-07-16T11:18:30,564 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#2006731653]] to [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550]

2018-07-16T11:18:30,565 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#2006731653]]

several more InitJoin/InitJoinAck sequences...

Similar InitJoin/InitJoinAck sequences on odl3.

In this sequence, odl2 actually saw that odl1 was restarted with a new incarnation, marked it as Down, and stated it was removing the previous unreachable incarnation. It then is supposed to allow the new incarnation in but never did which led odl1 to join itself after the 12s seed-node-timeout. Strange. I would say this definitely is an issue with akka.

Comment by Tom Pantelis [ 17/Jul/18 ]

I opened https://github.com/akka/akka/issues/25361

Comment by Luis Gomez [ 19/Jul/18 ]

When you run OF cluster test many times like I did in sandbox this week, I can see the issue of controller not joining the cluster after kill+start also occurs in instances different than 1. For example see this run in sandbox where instance 3 is killed and started and does not join:

https://logs.opendaylight.org/sandbox/vex-yul-odl-jenkins-2/openflowplugin-csit-3node-clustering-only-fluorine/50/robot-plugin/log.html.gz

I think the root cause for this may be the same as with instance 1, so do we want to open a new bug or just track it here?

Comment by Tom Pantelis [ 19/Jul/18 ]

It's not exactly the same scenario since only the first seed node can join itself - the others cannot and will try forever to re-join. But I would say the fact that it doesn't rejoin is the same issue. So I think we can track it here.

Comment by Luis Gomez [ 19/Jul/18 ]

Also from my test this is the most frequent issue the OF cluster suite has today, it is also very apparent because during the suite we are killing and starting instance few times.

Comment by Tom Pantelis [ 19/Jul/18 ]

Have you tried w/o killing it, ie shut it down gracefully? It would be interesting to see if it happens in that case.

Comment by Luis Gomez [ 19/Jul/18 ]

Yes, that was my point, at the root there is the issue of an instance not joining regardless of what happens later. I really thought this problem was resolved in akka recent upgrade but in the end it just happens less often than before.

Comment by Tom Pantelis [ 19/Jul/18 ]

right - failure to rejoin is the root issue.

Comment by Luis Gomez [ 19/Jul/18 ]

Yes, I think shutdown works or fails less often, let me change most kills to shutdown and see what happens.

Comment by Tom Pantelis [ 19/Jul/18 ]

I'd like to see if it doesn't fail at all if shutdown gracefully. If that's the case then that gives us another data point to further characterize the issue and we can report that on the akka issue.

Comment by Luis Gomez [ 21/Jul/18 ]

After 1 day running I only see failure in the kill+start scenario which means something in the kill adds to the problem.

Comment by Tom Pantelis [ 22/Jul/18 ]

I got a response from akka folks - they've asked us to enable debug.

In configuration/initial/akka.conf, merge in these settings:

akka {
  loglevel = DEBUG 
  remote { 
     ...
     log-received-messages = on 
     log-sent-messages = on
  }
 ...
}

In etc/org.ops4j.pax.logging.cfg, add:

log4j2.logger.cluster.name=akka.cluster
log4j2.logger.cluster.level=DEBUG
log4j2.logger.remote.name=akka.remote
log4j2.logger.remote.level=DEBUG

This will produce a lot of output so I'd suggest to pare down the tests to only what's needed to repro, eg don't re-run the tests with tell-based and don't run the script that creates 10k cars. Also when restarting the node after kill, delete the previous log so it starts with a clean one.

Comment by Luis Gomez [ 23/Jul/18 ]

I started a small suite with all debugs on in sandbox and so far no luck reproducing the problem

Comment by Tom Pantelis [ 23/Jul/18 ]

Jamo had a test script running in sandbox that was reproducing it - perhaps get that going again. The akka folks have also suggested we try artery to see if it reproduces - so once we get a test/script that reproduces it.

Comment by Jamo Luhrsen [ 23/Jul/18 ]

At this point, I think I'm still able to reproduce this, but the high level symptom is now a 404 on /restconf or /jolokia

I did not try with the debugs enabled yet. I wanted to double check my automation could hit it.

I can see this in the karaf.log, which I have been assuming is our smoking gun (1st seed node is on an island)

2018-07-23T20:04:13,072 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-51 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] to [Up]

I didn't run this with the debugs, but something else is bothersome. Seems like a lot of blueprint errors are
coming. Could this be some other kind of bug that just ends up looking the same at the end? I attached
the karaf.log from the 1st seed node (I trimmed out everything except the logs after the final restart before
it hit the 404) karaf.log.since_last_restart.xz

Comment by Tom Pantelis [ 23/Jul/18 ]

If shard leaders don't converge then there will be blueprint timeouts in other apps b/c the DS isn't ready. Most apps depend on the DS. Even if an app does startup, there will be timeout failures when it tries to access the DS.

Comment by Jamo Luhrsen [ 23/Jul/18 ]

ok, so to be clear, we think this is the same exact bug right?

what about that 404 then? I thought with dynamic auth disabled... oh wait, that was for 401. ok, so 404 is probably expected now right?
when we hit the bug, I mean.

Comment by Tom Pantelis [ 23/Jul/18 ]

404 is "not found". I assume that's coming from jetty so AAA and restconf haven't started b/c they depend on the DS. So it all comes back to the root cause, ie akka not rejoining.

At this point we need a reproduction with the akka debug enabled so we can send that off to them. A minimal repro would be best, ie bring up the 3 nodes, verify cluster is OK, then kill/restart the first seed node over and over. So don't create any cars/people etc. Also delete the karaf log before each restart until if fails.

Comment by Jamo Luhrsen [ 23/Jul/18 ]

404 is "not found". I assume that's coming from jetty so AAA and restconf haven't started b/c they depend on the DS. So it all comes back to the root cause, ie akka not rejoining.

got it. that's what I was hoping.

At this point we need a reproduction with the akka debug enabled so we can send that off to them. A minimal repro would be best, ie bring up the 3 nodes, verify cluster is OK, then kill/restart the first seed node over and over. So don't create any cars/people etc. Also delete the karaf log before each restart until if fails.

that's all I am doing. it's just a slightly improved version of this

my automation is not going to delete the karaf.log, but I assume I can manually truncate the logs to make it
easier. It's really just looking for some known entry on a restart and finding the last one. I think it's as simple as finding the last
line with the string "SimpleFileLock lock" to know when the last restart was done.

Comment by Tom Pantelis [ 23/Jul/18 ]

yeah. It would make it a little easier to delete the log if possible but it's OK.

Comment by Luis Gomez [ 24/Jul/18 ]

Lets keep this bug open for the issue we see in OFP of instance failing to join cluster after it is killed and started and shows auth error 401 when jolokia/REST is invoked on it:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-only-oxygen/336/robot-plugin/log.html.gz#s1-s2

Comment by Jamo Luhrsen [ 24/Jul/18 ]

tpantelis, what do you want here? ecelgp's link above is to a node (NOT the 1st seed node) that was bounced
and came back up giving a 401.

We don't have that 401 any more when we got rid of dynamic auth for the case when we get an island leader on the first
seed node. At least I can't hit it with my local reproduction.

New bug, or just keep this one to track all of these very similar problems?

Comment by Luis Gomez [ 24/Jul/18 ]

This is seen in the starting instance (member-2):

2018-07-24T11:31:56,731 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | JoinSeedNodeProcess              | 48 - com.typesafe.akka.slf4j - 2.5.11 | Couldn't join seed nodes after [16] attempts, will try again. seed-nodes=[akka.tcp://opendaylight-cluster-data@10.30.171.71:2550, akka.tcp://opendaylight-cluster-data@10.30.171.61:2550]
Comment by Tom Pantelis [ 24/Jul/18 ]

Looks like that was in Oxygen - not sure of the state there. I've been focusing on Fluorine. That looks like the same issue except it wasn't the first seed node. Again, root cause is a node not rejoining, first seed node or not. I think we see it much more frequently with the first seed node b/c it can join itself with a deadline of only 12s - the others will just keep trying to join the existing cluster indefinitely which I think has a 5 min deadline in CSIT.

Comment by Jamo Luhrsen [ 24/Jul/18 ]

I've only been using Oxygen in my local reproduction of this. I thought we could not even hit the island node issue if it wasn't the first seed node?

Comment by Tom Pantelis [ 24/Jul/18 ]

It won't hit the island node issue but... see my previous note. To our code it doesn't matter - if the node doesn't join, we can't converge shard leaders.

Comment by Jamo Luhrsen [ 24/Jul/18 ]

ok, so what was the big deal about only doing this with the first seed node?
and how do we explain the 401 now?

Comment by Tom Pantelis [ 24/Jul/18 ]

Because up until now we've only saw it for the first seed node and, as I explained above, it is much more likely to happen.

I don't know why there would still be a 401.

Comment by Jamo Luhrsen [ 24/Jul/18 ]

I think/thought I had this reproduced locally with the akka debugs enabled, but I'm having trouble making sense of my logs
and timestamps.

I do see this this output repeated 31 times at the end of my karaf log:

Jul 24, 2018 12:52:34 AM org.apache.karaf.main.lock.SimpleFileLock lock
INFO: Trying to lock /odlha/karaf/target/assembly/lock
Jul 24, 2018 12:52:34 AM org.apache.karaf.main.lock.SimpleFileLock lock
INFO: Lock acquired
Jul 24, 2018 12:52:34 AM org.apache.karaf.main.Main$KarafLockCallback lockAquired
INFO: Lock acquired. Setting startlevel to 100
Jul 24, 2018 12:54:29 AM org.apache.karaf.main.lock.SimpleFileLock lock
INFO: Trying to lock /odlha/karaf/target/assembly/lock
Jul 24, 2018 12:54:29 AM org.apache.karaf.main.lock.SimpleFileLock lock
INFO: Lock acquired

which reminds me of CONTROLLER-1845

but, the timestamps of those messages are off, compared to the message right before they started:

2018-07-24T00:51:10,559 | INFO  | Blueprint Event Dispatcher: 1 | BlueprintBundleTracker           | 199 - org.opendaylight.controller.blueprint - 0.8.3.SNAPSHOT | Blueprint container for bun
dle org.opendaylight.netvirt.qosservice-impl_0.6.3.SNAPSHOT [359] was successfully created

I think the karaf made logs are using a different method of creating time stamps.

Also, my other two controllers, which I stopped at the same time (all three got a finall kill -9 at the same time) have
the same final timestamp in their logs (2018-07-24T05:24:32), but I don't see that timestamp in my first node
(the one I'm bouncing)

very confused.

I'll try again, with some additional steps in my bouncer scripts to keep track of timestamps better.

Comment by Luis Gomez [ 24/Jul/18 ]

Coming back to: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-only-oxygen/336/robot-plugin/log.html.gz#s1-s2

I see the restarting member-2 fails to join:

2018-07-24T11:28:43,869 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Slf4jLogger                      | 48 - com.typesafe.akka.slf4j - 2.5.11 | Slf4jLogger started
2018-07-24T11:28:43,912 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Remoting                         | 48 - com.typesafe.akka.slf4j - 2.5.11 | Starting remoting
2018-07-24T11:28:44,143 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-6 | Remoting                         | 48 - com.typesafe.akka.slf4j - 2.5.11 | Remoting started; listening on addresses :[akka.tcp://opendaylight-cluster-data@10.30.171.67:2550]
2018-07-24T11:28:44,168 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-6 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.67:2550] - Starting up...
2018-07-24T11:28:44,268 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.67:2550] - Registered cluster JMX MBean [akka:type=Cluster]
2018-07-24T11:28:44,269 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.67:2550] - Started up successfully

2018-07-24T11:29:08,440 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-23 | JoinSeedNodeProcess              | 48 - com.typesafe.akka.slf4j - 2.5.11 | Couldn't join seed nodes after [2] attempts, will try again. seed-nodes=[akka.tcp://opendaylight-cluster-data@10.30.171.71:2550, akka.tcp://opendaylight-cluster-data@10.30.171.61:2550]

while the other 2 members receive and respond to the join requests:

2018-07-24T11:28:44,623 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.171.67:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1711052280]] to [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550]
2018-07-24T11:28:44,624 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.171.67:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1711052280]]

2018-07-24T11:28:56,431 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.171.67:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1711052280]] to [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550]
2018-07-24T11:28:56,434 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | Cluster(akka://opendaylight-cluster-data) | 48 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.171.71:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.171.67:2550/system/cluster/core/daemon/joinSeedNodeProcess-1#1711052280]]

This really seems an issue in the restarting member not getting or processing InitJoinAck from the other 2.

Comment by Tom Pantelis [ 25/Jul/18 ]

yup - it's the same pattern I've seen in numerous logs.

Comment by Tom Pantelis [ 25/Jul/18 ]

I think it will be easier to make sense out of it if it restarts with a clean log.

Comment by Tom Pantelis [ 25/Jul/18 ]

So the 401's occur in Oxygen b/c AAA shiro doesn't initialize due to the DS not ready:

2018-07-24T11:33:26,723 | ERROR | paxweb-extender-1-thread-1 | EnvironmentLoader                | 143 - org.apache.shiro.web - 1.3.2 | Shiro environment initialization failed
java.lang.RuntimeException: Error obtaining AAAShiroProvider
	at org.opendaylight.aaa.shiro.web.env.KarafIniWebEnvironment.init(KarafIniWebEnvironment.java:76) ~[189:org.opendaylight.aaa.shiro:0.7.3]

Without shiro, we can't authenticate. This code was refactored in Fluorine such that I believe we no longer get 401's but 404's.

Again either way, it's all due to shards not converging due to akka not rejoining.

Comment by Jamo Luhrsen [ 25/Jul/18 ]

quick note that I let myself get sidetracked by CONTROLLER-1845 since I bit me pretty irritatingly when
trying to repro this one with the extra akka debugs. When 1845 hits, it makes the repro efforts for this
worthless.

Comment by Jamo Luhrsen [ 27/Jul/18 ]

Here are the logs with these extra debugs from a recent reproduction in my local setup karaf.1.log karaf.2.log karaf.3.log

karaf.1.log is the 1st seed node, and the node I am doing the kill -9 and start repeatedly until we hit this bug.

I've trimmed karaf.1.log down to only include the logs since it's final restart. You can find this log (which we think
is our smoking gun) in it:

2018-07-27T10:25:04,024 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-55 | Cluster(akka://opendaylight-cluster-data) | 90 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@172.28.5.1:2550] to [Up]

you can see 172.28.5.1 is moving itself to Leader [Up]

The other two karaf.log files are from the other controllers, and I have trimmed them out to only include logs from the
timestamp of the 1st seed node's last startup.

Comment by Tom Pantelis [ 27/Jul/18 ]

nice - thanks

Comment by Tom Pantelis [ 31/Jul/18 ]

I attached the logs to the akka issue. The bottom line is that node1 was able to send messages (InitJoin) to nodes 2 and 3 but no messages from nodes 2 and 3 were received by node1 (IntJoinAck, Heartbeat). We'll see what they think but I suspect the next step might be to run wireshark or something to see what's actually getting on the wire. I think the issue is on node1's side - otherwise it would mean both nodes 2 and 3 would experience the same issue simultaneously which is much less likely. Plus nodes 2 and 3 were communicating back and forth fine.

Comment by Sam Hague [ 31/Jul/18 ]

jluhrsen look at https://git.opendaylight.org/gerrit/#/c/74585/ for how we added the tcpdump for port 6653 ofp traffic on devstack nodes. That function can be used to start one for akka traffic. We might need a variant https://git.opendaylight.org/gerrit/#/c/74586/ because the tests stop the tcpdump based on the grep of tcpdump in ps, so at the end of each suite it will kill all running tcpdump's. That check should be more selective if we need to keep the akka tcpdump running for the whole job duration. If the job is a single suite then it is not a problem.

Comment by Tom Pantelis [ 01/Aug/18 ]

Here's the docs for Akka artery : https://doc.akka.io/docs/akka/2.5.11/remoting-artery.html. They have added TCP support since I last looked at it but another article I read suggests it isn't quite ready for production. I would think the default is UDP but it can be explicitly set via transport = aeron-udp as outlined in the Preparing your ActorSystem for Remoting section. Also seed node addresses must start with "akka://opendaylight-cluster-data...", ie drop the ".tcp".

Comment by Jamo Luhrsen [ 14/Aug/18 ]

I was able to reproduce this with a few more debugs happening. I'm doing a netstat to see which ports are open and I did a tcpdump for
all traffic on port 2550 for this node that comes up and joins itself as leader (island).

I didn't see any difference in the output of the netstat command from a passing run and a failing run. It looks like this:

tcp        0      0 127.0.0.11:39681        0.0.0.0:*               LISTEN      -
tcp6       0      0 :::44444                :::*                    LISTEN      12969/java
tcp6       0      0 :::8101                 :::*                    LISTEN      12969/java
tcp6       0      0 127.0.0.1:1099          :::*                    LISTEN      12969/java
tcp6       0      0 127.0.0.1:35949         :::*                    LISTEN      12969/java
tcp6       0      0 :::8181                 :::*                    LISTEN      12969/java
tcp6       0      0 172.28.5.1:2550         :::*                    LISTEN      12969/java
tcp6       0      0 :::34519                :::*                    LISTEN      12969/java
udp        0      0 127.0.0.11:43904        0.0.0.0:*                           -

what's weird though, is that this controller 1 (first seed node) doesn't have port 8185 lit up, in the passing or failing
case. I haven't figured that out yet, and probably just noise in all of this, but wanted to point it out. So, for example,
the netstat in controller 2 looks like this:

tcp        0      0 127.0.0.11:38895        0.0.0.0:*               LISTEN      -
tcp6       0      0 127.0.0.1:36513         :::*                    LISTEN      4201/java
tcp6       0      0 :::8101                 :::*                    LISTEN      4201/java
tcp6       0      0 127.0.0.1:1099          :::*                    LISTEN      4201/java
tcp6       0      0 :::46097                :::*                    LISTEN      4201/java
tcp6       0      0 :::8181                 :::*                    LISTEN      4201/java
tcp6       0      0 172.28.5.2:2550         :::*                    LISTEN      4201/java
tcp6       0      0 :::8185                 :::*                    LISTEN      4201/java
tcp6       0      0 :::44444                :::*                    LISTEN      4201/java [^104.pcap] 
udp        0      0 127.0.0.11:47322        0.0.0.0:*                           -

Attached are two packet captures that open fine in wireshark. filename 104.pcap is from
a successful fun where the controller was killed and restarted and successfully joined the
cluster. filename 105.pcap is from the failure case. also attached is 105.1.log which is the
karaf log from the controller 1 in the failure case. I don't have karaf logs from the other
controllers in the failure case, nor any from the passing case(s).

104.pcap 105.pcap 105.1.log

Comment by Tom Pantelis [ 15/Aug/18 ]

The packet capture basically shows what we've seen in the logs, ie node1 is sending out InitJoin messages to 2 & 3 and InitJoinAck messages are coming back. Also Heartbeat messages are incoming from 2 & 3 along with our AppendEntries heartbeats. So node1 is sending out messages but incoming messages are not getting processed for some reason. I'll update the akka ticket.

I think the next steps are:

  • try it with graceful shutdown and see if it repros
  • try akka.remote.use-passive-connections = off as the akka engineer suggested
  • try artery
Comment by Jamo Luhrsen [ 15/Aug/18 ]

How can we debug this further to root cause it? You noted in the akka bug we know
the packets are coming from the other nodes, but the akka debugs we tried before do not show those being received. Is there no
lower level debugs we can enable in karaf to try to trace the packets being processed?

The akka folks on that github bug have suggested we try adding extra debugs in akka itself and trying with
a snapshot of that. Can you help me do that?

Everything you've listed above is a workaround idea. I can try all of those and maybe they will give us some
clues, but we have some debug momentum at this point and I think we should keep fighting.

BTW, I've now run more than 800 iterations of this reproduction script locally with a graceful shutdown and
have not seen the issue. I think that's safe to use as confirmation that it is not coming with a graceful
restart. Not that it only comes with a kill -9 restart, but at least we have a little extra data.

Comment by Tom Pantelis [ 15/Aug/18 ]

< lower level debugs we can enable in karaf to try to trace the packets being processed?

It's not karaf - it's akka code.

>The akka folks on that github bug have suggested we try adding extra debugs in akka itself and trying with
> a snapshot of that. Can you help me do that?

Their code has to be modified to add more debug - I think they mean they would add it and supply a snapshot.

In the end we may have to workaround this or at least doc it as a known anomaly. Even if they find the root cause and are able to fix it, that would likely come in a new akka version that we would have to eventually upgrade to (unless they can patch the version we use). artery is not a workaround - we want to switch to it anyway - hopefully in Neon - we'll have to longer term anyway since the netty-based version will be removed at some point.

Comment by Jamo Luhrsen [ 15/Aug/18 ]

< lower level debugs we can enable in karaf to try to trace the packets being processed?

It's not karaf - it's akka code.

I know it's akka code that's not receiving the initJoinAck, but that doesn't mean it's getting
dropped in akka code does it? Maybe it's dropped earlier in whatever connection handling
we have? There has to be something grabbing these packets to hand up to akka? Forget
I said "karaf". Like netty or whatever that magic is we are using in ODL? Obviously I don't
fully know these details. Maybe akka is doing all this work itself.

>The akka folks on that github bug have suggested we try adding extra debugs in akka itself and trying with
> a snapshot of that. Can you help me do that?

Their code has to be modified to add more debug - I think they mean they would add it and supply a snapshot.

That's not how I took it. Could you clarify that with them? I don't mind to keep digging here and
make further progress on the root cause and possible fix.

In the end we may have to workaround this or at least doc it as a known anomaly. Even if they find the root cause and are able to fix it, that would likely come in a new akka version that we would have to eventually upgrade to (unless they can patch the version we use). artery is not a workaround - we want to switch to it anyway - hopefully in Neon - we'll have to longer term anyway since the netty-based version will be removed at some point.

atery is a workaround. If the bug is not in artery and we move to artery we are not fixing any bug. We
are ignoring it and moving to something that doesn't have it.

and, if the move to artery is similar to our move to tell-based, then who knows when that will happen.

we've upgraded akka before, so I'm not understanding why that is something to worry about.

I will try will try with artery at some point.

Comment by Tom Pantelis [ 15/Aug/18 ]

> Maybe akka is doing all this work itself.

Yes it is - all of the code in question is akka

> That's not how I took it. Could you clarify that with them? I don't mind to keep digging here and
> make further progress on the root cause and possible fix.

"*Next step would probably be to add some additional logging to Akka *" - there is no more debug logging to enable - code has to be added to add more detailed logging.

> atery is a workaround. If the bug is not in artery and we move to artery we are not fixing any bug. We
> are ignoring it and moving to something that doesn't have it.

yes b/c artery is their next-generation remoting layer. The netty-based one is essentially deprecated and will be removed. So we have to go to it anyway. In fact we're probably lucky they're even helping us at all (for free anyway).

Comment by Jamo Luhrsen [ 15/Aug/18 ]

> Maybe akka is doing all this work itself.

Yes it is - all of the code in question is akka

ok, I just must be confused with the netty we are using in ODL for other things and
what you mean by netty-based akka.

> That's not how I took it. Could you clarify that with them? I don't mind to keep digging here and
> make further progress on the root cause and possible fix.

"*Next step would probably be to add some additional logging to Akka *" - there is no more debug logging to enable - code has to be added to add more detailed logging.

well, to be fair, I took what you quoted above as them suggesting we make the
changes for additional logging, not that they would.

> atery is a workaround. If the bug is not in artery and we move to artery we are not fixing any bug. We
> are ignoring it and moving to something that doesn't have it.

yes b/c artery is their next-generation remoting layer. The netty-based one is essentially deprecated and will be removed. So we have to go to it anyway. In fact we're probably lucky they're even helping us at all (for free anyway).

ok, fair enough. Let me try artery next.

Comment by Tom Pantelis [ 15/Aug/18 ]

> well, to be fair, I took what you quoted above as them suggesting we make the
> changes for additional logging, not that they would.

yeah - I think you're right. I'll look into that when I have some cycles....

Comment by Jamo Luhrsen [ 23/Aug/18 ]

having a really hard time making artery work, but still am trying.

Comment by Jamo Luhrsen [ 12/Sep/18 ]

I neglected to update this, but a week or two ago I was finally able to get artery to work and force
the shared memory disk usage to be cleaned on every kill/restart iteration. I was able to run the
reproduction script for more than 400 iterations without hitting this bug. I think that's good enough
for now.

So, maybe this means our root cause is going to be somewhere in the netty transport for akka. I
know the plan is to eventually move everything to tell based w/ artery. I don't think there is anyone
willing to keep digging on a root cause and fix for this. This is also not coming if we do graceful
restarts.

We can change all of our tests to only do graceful restarts which should hopefully prevent this
bug from marking failures in our CSIT. When we get to a legit tell/artery implementation I think
we should try to get back to our kill/restart cases as those are still legit cases imho.

Comment by Tom Pantelis [ 13/Sep/18 ]

> I don't think there is anyone willing to keep digging on a root cause and fix for this. 

It would be either me or Robert and I don't have cycles (nor desire ). I think I can speak for Robert as well. Also this is a rather obscure and extreme test scenario and not very likely it would be seen in production anyway.

> We can change all of our tests to only do graceful restarts...

Sounds good.

Comment by Sam Hague [ 17/Sep/18 ]

I converted the NetVirt CSIT to use graceful start and stop. It looks promising and we don't see the OLFE's and other weird exceptions, but it does bring out other issues. openflowplugin and ovsdb have various bundles shutting down, but not all at once, and the listening ports are still open. The switches continue sending to the ODL shutting down. We need to clean those issues up.

That being said, it seems like we are adapting the tests to accommodate bugs. I could see if we kept the test as is, then maybe they shouldn't fail or we should have checks to know that there will be issues during this time and not fail.

Comment by Tom Pantelis [ 17/Sep/18 ]

Graceful shutdown should be the norm in production so CSIT should be testing it and, if bundles have issues shutting down gracefully, they should be addressed. "kill -9" should be a last resort as it could cause leave persistence in a corrupted state. So it sounds like introducing graceful shutdown is uncovering issues which is a good thing.

Comment by Luis Gomez [ 17/Sep/18 ]

FYI I left 1 test in OFP doing the kill so at least I can track how often the kill fails:

https://git.opendaylight.org/gerrit/#/c/74412/

Comment by Jamo Luhrsen [ 30/Oct/18 ]

This basic problem of a controller not coming up healthy after restarting still happens. Yes, the last example
I saw was in the 'kill -9' scenario. So I suppose we want to keep this bug open, but I'll lower the priority of it. I really don't
know who/when/how it will ever be addressed though. There is just nobody to work on it, and 'kill -9' is hopefully
not a common way for people to stop ODL.

Comment by Tom Pantelis [ 30/Oct/18 ]

It's an issue in akka and it would not be a trivial effort to learn that large code base (not to mention scala) plus it's an old code base that has essentially been replaced by artery, which is why the akka devs aren't anxious to jump on it (at least not for free). Plus it's an extreme test scenario (killing/restarting the process quickly over and over) that realistically wouldn't be seen in production (which I'm sure is why it's never been reported in the 10 years akka's been around) or if it ever did, you got bigger problems than the node not rejoining the cluster after like the 50th time or whatever

Comment by Jamo Luhrsen [ 30/Oct/18 ]

fair enough, but just to be clear, this is not happening because of killing/restarting quickly over and over for 50 times. yes, that was the script
I tried to share that made it easier to reproduce, but the failures are still seen in our automation where it's killing, waiting to make sure
the process is dead, then starting and again waiting/polling for the controller to come up (so not a 401) and be in syncstatus true.

besides the "kill -9" part, it's pretty gentle.

again, I understand nobody wants or plans to work on it. I have lowered the priority.

maybe tell-based will be better, but that work is yet to be done and the jobs we have for tell based have failures to figure
out as well

Comment by Tom Pantelis [ 30/Oct/18 ]

It's not related to tell-based - that's our stuff. It's in akka or a lib it uses (eg netty) or possibly at the TCP-level where something is stale due to the process being killed. AFAIK, from you and Luis testing, it does not occur if the process is gracefully shutdown nor does it occur after the first kill/restart (ie it takes multiple iterations if it occurs at all). Also it has not been seen to occur with the new remoting code base, artery, (with UDP at least) so IMO, it makes more sense to focus on switching to artery rather than spending the time to try to learn, debug, possibly fix the old akka code base where it appears Lightbend has little to no interest in continuing to support (at least for free).

Comment by Jamo Luhrsen [ 30/Oct/18 ]

yeah, I said 'tell-based", but confused that with "artery". my bad.

but yeah, it's very clear this bug is not getting fix and that there is no interest from anyone to
look any deeper. Gotta keep it open though, so we can point back to it when our automation
continues to have these failures. We have moved a lot of the tests to use graceful stop/start
so hopefully it's not as frequent any more.

Comment by Robert Varga [ 14/Nov/18 ]

tpantelis I noticed https://github.com/akka/akka/issues/25632 in 2.5.17 (which we still cannot have, as things are in-flux and we'll hopefully get 2.5.19). Is that something that can help?

Comment by Tom Pantelis [ 14/Nov/18 ]

That is related to quarantining but that doesn't happen in this scenario (that (usually) happens with node isolation) plus, even if the killed node was somehow quarantined, restarting clears the quarantined state. No split brain occurs here - it's a simple node kill/restart.

Comment by Jamo Luhrsen [ 13/Dec/18 ]

moved this to Low priority and closing as we aren't seeing failures any more because we moved to
a graceful stop/start, but I'm sure we would see it if we went back to the kill -9. Also, it
doesn't come when we use artery (even with kill -9) and hopefully that is going to be
what is default in the near future.

Generated at Wed Feb 07 19:56:36 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.