[CONTROLLER-1783] Member fails to rejoin cluster after it is quarantined Created: 26/Oct/17  Updated: 24/May/18  Resolved: 24/May/18

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Luis Gomez Assignee: Luis Gomez
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File karaf_fail_after_rejoin.log    

 Description   

This is happening in all branches and it is very easy to reproduce:

1) Bring up 3 node cluster with any ODL feature (e.g. odl-restconf)
2) Isolate 1 instance from the other 2 using iptables:

sudo iptables -A OUTPUT -d 192.168.0.101 -j DROP; sudo iptables -A OUTPUT -d 192.168.0.103 -j DROP; sudo iptables -A INPUT -s 192.168.0.101 -j DROP; sudo iptables -A INPUT -s 192.168.0.103 -j DROP

3) Wait until the isolated instance is quarantined by the other 2 (~3 mins):

2017-10-26 04:08:25,112 | ERROR | ult-dispatcher-4 | Remoting                         | 84 - com.typesafe.akka.slf4j - 2.4.18 | Association to [akka.tcp://opendaylight-cluster-data@192.168.0.102:2550] with UID [-1659815551] irrecoverably failed. Quarantining address.
java.util.concurrent.TimeoutException: Delivery of system messages timed out and they were dropped.
	at akka.remote.ReliableDeliverySupervisor$$anonfun$gated$1.applyOrElse(Endpoint.scala:351)[83:com.typesafe.akka.remote:2.4.18]
	at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[78:com.typesafe.akka.actor:2.4.18]
	at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)[83:com.typesafe.akka.remote:2.4.18]
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[78:com.typesafe.akka.actor:2.4.18]
	at akka.actor.ActorCell.invoke(ActorCell.scala:495)[78:com.typesafe.akka.actor:2.4.18]
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[78:com.typesafe.akka.actor:2.4.18]
	at akka.dispatch.Mailbox.run(Mailbox.scala:224)[78:com.typesafe.akka.actor:2.4.18]
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[78:com.typesafe.akka.actor:2.4.18]
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]

4) Rejoin the instance to the cluster:

sudo iptables -D OUTPUT -d 192.168.0.101 -j DROP; sudo iptables -D OUTPUT -d 192.168.0.103 -j DROP; sudo iptables -D INPUT -s 192.168.0.101 -j DROP; sudo iptables -D INPUT -s 192.168.0.103 -j DROP

5) The instance gets restarted and after that it will never rejoin the cluster or boot properly:

2017-10-26 04:18:29,585 | WARN  | ult-dispatcher-3 | QuarantinedMonitorActor          | 204 - org.opendaylight.controller.sal-clustering-commons - 1.7.0.SNAPSHOT | Got quarantined by akka.tcp://opendaylight-cluster-data@192.168.0.101:2550
2017-10-26 04:18:29,585 | WARN  | ult-dispatcher-3 | rantinedMonitorActorPropsFactory | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.7.0.SNAPSHOT | Restarting karaf container


 Comments   
Comment by Luis Gomez [ 01/Nov/17 ]

As suggested during the Kernel call, I tried with latest nitrogen and it works. Would it be possible to fix carbon and oxygen?

Comment by Luis Gomez [ 01/Nov/17 ]

When testing carbon I see it recovers if I only install odl-restconf feature but it does not if I install odl-openflowplugin-flow-services-rest feature (see attached log of the rejoined instance).

Comment by OpenDaylight Release [ 03/May/18 ]

Since the bug is unassigned I'm currently assigning it to you.

Please assign to the relevant person. 

Comment by Luis Gomez [ 03/May/18 ]

I will need to recheck this bug.

Comment by Luis Gomez [ 24/May/18 ]

I do not see this issue anymore so closing it.

Comment by Tom Pantelis [ 24/May/18 ]

nice - It looks like upgrading to akka 2.5.x has fixed a couple long standing issues.

Comment by Luis Gomez [ 24/May/18 ]

Yeah, this one must have been fixed before the AKKA update as I did not check it for very long time.

Generated at Wed Feb 07 19:56:26 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.