Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1783

Member fails to rejoin cluster after it is quarantined

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Medium Medium
    • None
    • None
    • clustering
    • None

      This is happening in all branches and it is very easy to reproduce:

      1) Bring up 3 node cluster with any ODL feature (e.g. odl-restconf)
      2) Isolate 1 instance from the other 2 using iptables:

      sudo iptables -A OUTPUT -d 192.168.0.101 -j DROP; sudo iptables -A OUTPUT -d 192.168.0.103 -j DROP; sudo iptables -A INPUT -s 192.168.0.101 -j DROP; sudo iptables -A INPUT -s 192.168.0.103 -j DROP
      

      3) Wait until the isolated instance is quarantined by the other 2 (~3 mins):

      2017-10-26 04:08:25,112 | ERROR | ult-dispatcher-4 | Remoting                         | 84 - com.typesafe.akka.slf4j - 2.4.18 | Association to [akka.tcp://opendaylight-cluster-data@192.168.0.102:2550] with UID [-1659815551] irrecoverably failed. Quarantining address.
      java.util.concurrent.TimeoutException: Delivery of system messages timed out and they were dropped.
      	at akka.remote.ReliableDeliverySupervisor$$anonfun$gated$1.applyOrElse(Endpoint.scala:351)[83:com.typesafe.akka.remote:2.4.18]
      	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[78:com.typesafe.akka.actor:2.4.18]
      	at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)[83:com.typesafe.akka.remote:2.4.18]
      	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[78:com.typesafe.akka.actor:2.4.18]
      	at akka.actor.ActorCell.invoke(ActorCell.scala:495)[78:com.typesafe.akka.actor:2.4.18]
      	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[78:com.typesafe.akka.actor:2.4.18]
      	at akka.dispatch.Mailbox.run(Mailbox.scala:224)[78:com.typesafe.akka.actor:2.4.18]
      	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[78:com.typesafe.akka.actor:2.4.18]
      	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
      	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
      	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
      	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
      

      4) Rejoin the instance to the cluster:

      sudo iptables -D OUTPUT -d 192.168.0.101 -j DROP; sudo iptables -D OUTPUT -d 192.168.0.103 -j DROP; sudo iptables -D INPUT -s 192.168.0.101 -j DROP; sudo iptables -D INPUT -s 192.168.0.103 -j DROP
      

      5) The instance gets restarted and after that it will never rejoin the cluster or boot properly:

      2017-10-26 04:18:29,585 | WARN  | ult-dispatcher-3 | QuarantinedMonitorActor          | 204 - org.opendaylight.controller.sal-clustering-commons - 1.7.0.SNAPSHOT | Got quarantined by akka.tcp://opendaylight-cluster-data@192.168.0.101:2550
      2017-10-26 04:18:29,585 | WARN  | ult-dispatcher-3 | rantinedMonitorActorPropsFactory | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.7.0.SNAPSHOT | Restarting karaf container
      

            ecelgp Luis Gomez
            ecelgp Luis Gomez
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: