-
Bug
-
Resolution: Done
-
Medium
-
None
-
None
-
None
This is happening in all branches and it is very easy to reproduce:
1) Bring up 3 node cluster with any ODL feature (e.g. odl-restconf)
2) Isolate 1 instance from the other 2 using iptables:
sudo iptables -A OUTPUT -d 192.168.0.101 -j DROP; sudo iptables -A OUTPUT -d 192.168.0.103 -j DROP; sudo iptables -A INPUT -s 192.168.0.101 -j DROP; sudo iptables -A INPUT -s 192.168.0.103 -j DROP
3) Wait until the isolated instance is quarantined by the other 2 (~3 mins):
2017-10-26 04:08:25,112 | ERROR | ult-dispatcher-4 | Remoting | 84 - com.typesafe.akka.slf4j - 2.4.18 | Association to [akka.tcp://opendaylight-cluster-data@192.168.0.102:2550] with UID [-1659815551] irrecoverably failed. Quarantining address. java.util.concurrent.TimeoutException: Delivery of system messages timed out and they were dropped. at akka.remote.ReliableDeliverySupervisor$$anonfun$gated$1.applyOrElse(Endpoint.scala:351)[83:com.typesafe.akka.remote:2.4.18] at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[78:com.typesafe.akka.actor:2.4.18] at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)[83:com.typesafe.akka.remote:2.4.18] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[78:com.typesafe.akka.actor:2.4.18] at akka.actor.ActorCell.invoke(ActorCell.scala:495)[78:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[78:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.run(Mailbox.scala:224)[78:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[78:com.typesafe.akka.actor:2.4.18] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
4) Rejoin the instance to the cluster:
sudo iptables -D OUTPUT -d 192.168.0.101 -j DROP; sudo iptables -D OUTPUT -d 192.168.0.103 -j DROP; sudo iptables -D INPUT -s 192.168.0.101 -j DROP; sudo iptables -D INPUT -s 192.168.0.103 -j DROP
5) The instance gets restarted and after that it will never rejoin the cluster or boot properly:
2017-10-26 04:18:29,585 | WARN | ult-dispatcher-3 | QuarantinedMonitorActor | 204 - org.opendaylight.controller.sal-clustering-commons - 1.7.0.SNAPSHOT | Got quarantined by akka.tcp://opendaylight-cluster-data@192.168.0.101:2550
2017-10-26 04:18:29,585 | WARN | ult-dispatcher-3 | rantinedMonitorActorPropsFactory | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.7.0.SNAPSHOT | Restarting karaf container