[CONTROLLER-1783] Member fails to rejoin cluster after it is quarantined Created: 26/Oct/17 Updated: 24/May/18 Resolved: 24/May/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Medium |
| Reporter: | Luis Gomez | Assignee: | Luis Gomez |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
This is happening in all branches and it is very easy to reproduce: 1) Bring up 3 node cluster with any ODL feature (e.g. odl-restconf) sudo iptables -A OUTPUT -d 192.168.0.101 -j DROP; sudo iptables -A OUTPUT -d 192.168.0.103 -j DROP; sudo iptables -A INPUT -s 192.168.0.101 -j DROP; sudo iptables -A INPUT -s 192.168.0.103 -j DROP 3) Wait until the isolated instance is quarantined by the other 2 (~3 mins): 2017-10-26 04:08:25,112 | ERROR | ult-dispatcher-4 | Remoting | 84 - com.typesafe.akka.slf4j - 2.4.18 | Association to [akka.tcp://opendaylight-cluster-data@192.168.0.102:2550] with UID [-1659815551] irrecoverably failed. Quarantining address. java.util.concurrent.TimeoutException: Delivery of system messages timed out and they were dropped. at akka.remote.ReliableDeliverySupervisor$$anonfun$gated$1.applyOrElse(Endpoint.scala:351)[83:com.typesafe.akka.remote:2.4.18] at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[78:com.typesafe.akka.actor:2.4.18] at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203)[83:com.typesafe.akka.remote:2.4.18] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[78:com.typesafe.akka.actor:2.4.18] at akka.actor.ActorCell.invoke(ActorCell.scala:495)[78:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[78:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.run(Mailbox.scala:224)[78:com.typesafe.akka.actor:2.4.18] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[78:com.typesafe.akka.actor:2.4.18] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[300:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] 4) Rejoin the instance to the cluster: sudo iptables -D OUTPUT -d 192.168.0.101 -j DROP; sudo iptables -D OUTPUT -d 192.168.0.103 -j DROP; sudo iptables -D INPUT -s 192.168.0.101 -j DROP; sudo iptables -D INPUT -s 192.168.0.103 -j DROP 5) The instance gets restarted and after that it will never rejoin the cluster or boot properly:
2017-10-26 04:18:29,585 | WARN | ult-dispatcher-3 | QuarantinedMonitorActor | 204 - org.opendaylight.controller.sal-clustering-commons - 1.7.0.SNAPSHOT | Got quarantined by akka.tcp://opendaylight-cluster-data@192.168.0.101:2550
2017-10-26 04:18:29,585 | WARN | ult-dispatcher-3 | rantinedMonitorActorPropsFactory | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.7.0.SNAPSHOT | Restarting karaf container
|
| Comments |
| Comment by Luis Gomez [ 01/Nov/17 ] |
|
As suggested during the Kernel call, I tried with latest nitrogen and it works. Would it be possible to fix carbon and oxygen? |
| Comment by Luis Gomez [ 01/Nov/17 ] |
|
When testing carbon I see it recovers if I only install odl-restconf feature but it does not if I install odl-openflowplugin-flow-services-rest feature (see attached log of the rejoined instance). |
| Comment by OpenDaylight Release [ 03/May/18 ] |
|
Since the bug is unassigned I'm currently assigning it to you. Please assign to the relevant person. |
| Comment by Luis Gomez [ 03/May/18 ] |
|
I will need to recheck this bug. |
| Comment by Luis Gomez [ 24/May/18 ] |
|
I do not see this issue anymore so closing it. |
| Comment by Tom Pantelis [ 24/May/18 ] |
|
nice - It looks like upgrading to akka 2.5.x has fixed a couple long standing issues. |
| Comment by Luis Gomez [ 24/May/18 ] |
|
Yeah, this one must have been fixed before the AKKA update as I did not check it for very long time. |