[CONTROLLER-883] Clustering : Network Seg (> seconds ) between cluster nodes requires restart Created: 22/Sep/14  Updated: 19/Oct/17  Resolved: 18/Aug/15

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Helium
Fix Version/s: None

Type: Bug
Reporter: James Gregory Hall Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Duplicate
duplicates CONTROLLER-1396 Clustering: Node does not rejoin afte... Resolved
is duplicated by CONTROLLER-1102 Clustering : Disable auto downing of ... Resolved
External issue ID: 2035

 Description   

Establish a three node odl-mdsal-clustering cluster, then turn off the NIC on one of the nodes for 15 seconds or so... then turn it back on.

The temporarily lost node will not successfully reconnect until it's controller process is restarted, apparently by design based on the INFO log messages.

I'm not sure why the restart is required, but if this remains necessary then we'll need the node to auto-restart itself if it detects that it's being quarantined for lack of restart.

Especially in lab situations where clustering confidence is first established, switches get shutdown for more then 15 seconds frequently, and our SDN controller's cluster should auto-recover from this, preferably without orchestration hacks to cover for it.

2014-09-22 14:02:42,521 | WARN | lt-dispatcher-17 | Remoting | 234 - com.typesafe.akka.slf4j - 2.3.4 | Tried to associate with unreachable remote address [akka.tcp://opendaylight-cluster-data@192.168.1.26:2550]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.



 Comments   
Comment by Mark Mozolewski [ 07/Feb/15 ]

James, the Quarantine behavior you are seeing is part of the lifecycle and failure model of the Akka framework. The Clustering team will have to discuss how we want to approach this long term since there are larger questions about network partition behavior, etc. I will drive that and own this bug.

Note that ultimately the restart that is needed is just on the instances of the Akka systems (2 on each controller for data and rpcs) and should not be the controller as a whole. But since there is no mechanism for that now you would have to restart the controller. There is a local workaround:

You can disable this behavior by commenting out the 2 occurrences of the following line in your ${karaf.home}/configuration/initial/akka.conf file (by prepending with “//“). With this change Akka will not Quarantine nodes for your testing/development.

I’ve confirmed on 3 local VMs that this works for disabling/enabling a node's NIC.

Comment by Mark Mozolewski [ 09/Feb/15 ]

(In reply to Mark Mozolewski from comment #1)
> James, the Quarantine behavior you are seeing is part of the lifecycle and
> failure model of the Akka framework. The Clustering team will have to
> discuss how we want to approach this long term since there are larger
> questions about network partition behavior, etc. I will drive that and own
> this bug.
>
> Note that ultimately the restart that is needed is just on the instances of
> the Akka systems (2 on each controller for data and rpcs) and should not be
> the controller as a whole. But since there is no mechanism for that now you
> would have to restart the controller. There is a local workaround:
>
> You can disable this behavior by commenting out the 2 occurrences of the
> following line in your ${karaf.home}/configuration/initial/akka.conf file
> (by prepending with “//“).

// auto-down-unreachable-after = 10s

> With this change Akka will not Quarantine nodes
> for your testing/development.
>
> I’ve confirmed on 3 local VMs that this works for disabling/enabling a
> node's NIC.

Comment by Mark Mozolewski [ 11/Feb/15 ]

Proposal to increase default auto-down time while we plan overall auto-down behavior for clustering.

(Controller) https://git.opendaylight.org/gerrit/#/c/15117

(*Integration) https://git.opendaylight.org/gerrit/#/c/15175/

  • For cluster deploy scripts to match akka.conf.
Comment by Moiz Raja [ 18/Aug/15 ]

4037 has more details of the problems caused by turning auto-down-after-unreachable

Generated at Wed Feb 07 19:54:07 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.