[CONTROLLER-883] Clustering : Network Seg (> seconds ) between cluster nodes requires restart Created: 22/Sep/14 Updated: 19/Oct/17 Resolved: 18/Aug/15 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Helium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | James Gregory Hall | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||
| External issue ID: | 2035 | ||||||||||||
| Description |
|
Establish a three node odl-mdsal-clustering cluster, then turn off the NIC on one of the nodes for 15 seconds or so... then turn it back on. The temporarily lost node will not successfully reconnect until it's controller process is restarted, apparently by design based on the INFO log messages. I'm not sure why the restart is required, but if this remains necessary then we'll need the node to auto-restart itself if it detects that it's being quarantined for lack of restart. Especially in lab situations where clustering confidence is first established, switches get shutdown for more then 15 seconds frequently, and our SDN controller's cluster should auto-recover from this, preferably without orchestration hacks to cover for it. 2014-09-22 14:02:42,521 | WARN | lt-dispatcher-17 | Remoting | 234 - com.typesafe.akka.slf4j - 2.3.4 | Tried to associate with unreachable remote address [akka.tcp://opendaylight-cluster-data@192.168.1.26:2550]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted. |
| Comments |
| Comment by Mark Mozolewski [ 07/Feb/15 ] |
|
James, the Quarantine behavior you are seeing is part of the lifecycle and failure model of the Akka framework. The Clustering team will have to discuss how we want to approach this long term since there are larger questions about network partition behavior, etc. I will drive that and own this bug. Note that ultimately the restart that is needed is just on the instances of the Akka systems (2 on each controller for data and rpcs) and should not be the controller as a whole. But since there is no mechanism for that now you would have to restart the controller. There is a local workaround: You can disable this behavior by commenting out the 2 occurrences of the following line in your ${karaf.home}/configuration/initial/akka.conf file (by prepending with “//“). With this change Akka will not Quarantine nodes for your testing/development. I’ve confirmed on 3 local VMs that this works for disabling/enabling a node's NIC. |
| Comment by Mark Mozolewski [ 09/Feb/15 ] |
|
(In reply to Mark Mozolewski from comment #1) // auto-down-unreachable-after = 10s > With this change Akka will not Quarantine nodes |
| Comment by Mark Mozolewski [ 11/Feb/15 ] |
|
Proposal to increase default auto-down time while we plan overall auto-down behavior for clustering. (Controller) https://git.opendaylight.org/gerrit/#/c/15117 (*Integration) https://git.opendaylight.org/gerrit/#/c/15175/
|
| Comment by Moiz Raja [ 18/Aug/15 ] |
|
4037 has more details of the problems caused by turning auto-down-after-unreachable |