[CONTROLLER-1396] Clustering: Node does not rejoin after restart Created: 22/Jul/15  Updated: 27/Oct/15  Resolved: 27/Oct/15

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Lithium
Fix Version/s: None

Type: Bug
Reporter: Shaleen Saxena Assignee: Gary Wu
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive Logs.zip    
Issue Links:
Duplicate
is duplicated by CONTROLLER-883 Clustering : Network Seg (> seconds ... Resolved
is duplicated by CONTROLLER-1385 Make manual-down the default for akka... Resolved
External issue ID: 4037

 Comments   
Comment by Shaleen Saxena [ 22/Jul/15 ]

This issue is seen most commonly during clustering datastore integration tests. 1 out of 5 times, "140 Recovery Restart Follower" test suite will fail. The first failed test case is "Add cars to the first follower"; and following test cases will also fail after that. From the log.html:

the response of the POST to add car=<Response [500]>

Looking in the logs, the newly started node has failed to join the cluster. The logs from all 3 members are attached.

The three cluster nodes are:
member-1 = 10.18.162.168
member-2 = 10.18.162.170
member-3 = 10.18.162.171

The failure is seen in "140 Recovery Restart Follower" test. The time stamp for the start of test is 20:41:11.

Comment by Shaleen Saxena [ 22/Jul/15 ]

Attachment Logs.zip has been added with description: Karaf logs from all members.

Comment by Tom Pantelis [ 23/Jul/15 ]

I tested the scenario in the 140 test. The first seed node is node1 which became the akka cluster leader as expected. I stopped node2 and node3.

node1 quickly declared the other nodes unreachable and continued to retry the connection, ie:

2015-07-23 02:55:40,271 | WARN | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$2 | 71 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Association with remote system [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552]] Caused by: [Connection refused: /127.0.0.1:2552]

I also got this message which is expected based on the akka docs:

2015-07-23 02:56:16,234 | INFO | ult-dispatcher-2 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 -> akka.tcp://opendaylight-cluster-data@127.0.0.1:2552: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 -> akka.tcp://opendaylight-cluster-data@127.0.0.1:2554: Unreachable [Unreachable] (2)], member status: [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@127.0.0.1:2552 Up seen=false, akka.tcp://opendaylight-cluster-data@127.0.0.1:2554 Up seen=false]

After restarting node2:

2015-07-23 02:56:38,965 | INFO | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@127.0.0.1:2552, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
2015-07-23 02:56:38,965 | INFO | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Marking unreachable node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] as [Down]

And then this message repeated over and over every 11 sec....

2015-07-23 02:56:49,932 | INFO | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@127.0.0.1:2552, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

So node1 wouldn't allow node2 back in until about 5 min, after which node2 and node3 were auto-downed and node2 was allowed to rejoin.

I tried it with auto-down-unreachable-after to off. This time node2 wasn't allowed to rejoin until node3 was started:

2015-07-23 03:11:17,235 | INFO | lt-dispatcher-16 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Leader is removing unreachable node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552]
2015-07-23 03:11:17,325 | WARN | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$2 | 71 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Association with remote system [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2015-07-23 03:11:25,935 | INFO | ult-dispatcher-2 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] is JOINING, roles [member-2]
2015-07-23 03:11:26,234 | INFO | lt-dispatcher-14 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] to [Up]

Setting auto-down-unreachable-after to the original 10s, nodes 2 & 3 are auto-downed and removed quickly and thus are allowed to rejoin quickly after restart.

So when node(s) are unreachable, the leader can't perform it's duties like allowing nodes to join. I had previously thought this only applies to new nodes that hadn't previously joined. But apparently that's not the case - it also applies to previously joined nodes that are unreachable.

So the behavior seems to be that all unreachable nodes must become reachable or downed before any are allowed back by the leader. This doesn't seem right.

The interesting (or wierd) part is that node2 remained as a follower and saw node1 as the shard leader and continued to receive heartbeats from the node1 even though node2 wasn't allowed to rejoin the akka cluster and was still seen as unreachable. So things seem OK between the 2 until you try to initiate a transaction on node2. Since node2 didn't get the MemberUp event for node1, it didn't have node1's actor address so transactions on node2 failed (from restconf).

So akka remoting had a connection from node1 -> node2 and allowed messages to be sent while akka clustering deemed node2 unreachable. That seems broken - major disconnect between the 2 components.

Based on my testing, I don't see how the 140 test works at all when just one of the followers is restarted and it tries to add cars on that follower.

Comment by Tom Pantelis [ 23/Jul/15 ]

I tested the scenario in the 140 test. The first seed node is node1 which became the akka cluster leader as expected. I stopped node2 and node3.

node1 quickly declared the other nodes unreachable and continued to retry the connection, ie:

2015-07-23 02:55:40,271 | WARN | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$2 | 71 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Association with remote system [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552]] Caused by: [Connection refused: /127.0.0.1:2552]

I also got this message which is expected based on the akka docs:

2015-07-23 02:56:16,234 | INFO | ult-dispatcher-2 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 -> akka.tcp://opendaylight-cluster-data@127.0.0.1:2552: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 -> akka.tcp://opendaylight-cluster-data@127.0.0.1:2554: Unreachable [Unreachable] (2)], member status: [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@127.0.0.1:2552 Up seen=false, akka.tcp://opendaylight-cluster-data@127.0.0.1:2554 Up seen=false]

After restarting node2:

2015-07-23 02:56:38,965 | INFO | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@127.0.0.1:2552, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
2015-07-23 02:56:38,965 | INFO | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Marking unreachable node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] as [Down]

And then this message repeated over and over every 11 sec....

2015-07-23 02:56:49,932 | INFO | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@127.0.0.1:2552, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

So node1 wouldn't allow node2 back in until about 5 min, after which node2 and node3 were auto-downed and node2 was allowed to rejoin.

I tried it with auto-down-unreachable-after to off. This time node2 wasn't allowed to rejoin until node3 was started:

2015-07-23 03:11:17,235 | INFO | lt-dispatcher-16 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Leader is removing unreachable node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552]
2015-07-23 03:11:17,325 | WARN | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$2 | 71 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Association with remote system [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2015-07-23 03:11:25,935 | INFO | ult-dispatcher-2 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] is JOINING, roles [member-2]
2015-07-23 03:11:26,234 | INFO | lt-dispatcher-14 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 | | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2552] to [Up]

Setting auto-down-unreachable-after to the original 10s, nodes 2 & 3 are auto-downed and removed quickly and thus are allowed to rejoin quickly after restart.

So when node(s) are unreachable, the leader can't perform it's duties like allowing nodes to join. I had previously thought this only applies to new nodes that hadn't previously joined. But apparently that's not the case - it also applies to previously joined nodes that are unreachable.

So the behavior seems to be that all unreachable nodes must become reachable or downed before any are allowed back by the leader. This doesn't seem right.

The interesting (or wierd) part is that node2 remained as a follower and saw node1 as the shard leader and continued to receive heartbeats from the node1 even though node2 wasn't allowed to rejoin the akka cluster and was still seen as unreachable. So things seem OK between the 2 until you try to initiate a transaction on node2. Since node2 didn't get the MemberUp event for node1, it didn't have node1's actor address so transactions on node2 failed (from restconf).

So akka remoting had a connection from node1 -> node2 and allowed messages to be sent while akka clustering deemed node2 unreachable. That seems broken - major disconnect between the 2 components.

Based on my testing, I don't see how the 140 test works at all when just one of the followers is restarted and it tries to add cars on that follower.

Comment by Tom Pantelis [ 23/Jul/15 ]

Actually I do know how the 140 test works. The CDS will block startup waiting for shard leaders to be elected up to 90s. With both data stores it blocks for 3 min. The test waits for the cluster-test-app to be started which happens after 3 min. So the 3 min startup combined with the time it takes to shutdown the nodes and the 1 min retries for add cars to succeed gives the 5 min auto-down enough time to kick in.

Comment by Tom Pantelis [ 23/Jul/15 ]

I played around with this a bit more. I stopped node3 and also invoked the "leave" operation manually via the akka JMX. As advertised, the cluster leader immediately transitioned node3 to exiting and removed it from the cluster. Then I stopped and restarted node2 and it quickly re-joined the cluster as node3 was removed.

So "leave" is a graceful and immediate way to tell the leader to remove the node w/o waiting for unreachable and down to occur. This seems to be a reasonable solution to this issue (in lieu of setting auto-down-unreachable-after back to a low value which has other issues). I would think akka would issue a "leave" automatically on graceful shutdown but it doesn't. Maybe there's a setting for that. Otherwise we should be able to issue the leave programmatically on shutdown.

Of course this wouldn't apply in the case of an ungraceful shutdown or network partition involving multiple nodes but those scenarios will be uncommon.

Comment by Colin Dixon [ 23/Jul/15 ]

It appears that this is caused by and/or related to this bug in Akka
https://github.com/akka/akka/issues/13584

Comment by Tom Pantelis [ 24/Jul/15 ]

I opened a new issue https://github.com/akka/akka/issues/18067.

Comment by Colin Dixon [ 13/Aug/15 ]

I wrote up this summary of the problem:

Clustering Autodown Issue

Problem Statement:
===
There is no "right" setting for how and when to down nodes. We are in a catch-22.

If a node becomes is downed, it has to be rebooted (or at least the actor system has to be rebooted) to get a new UUID to join. This means downing nodes too aggressively requires manual intervention with some regularity to keep fault-tolerance.

Unreachable nodes can only be moved to up when the cluster is converged. This requires that all unreachable nodes be declared down or all become reachable for any unreachable node to be declared up again. This causes problems if you have 3 nodes and 2 become unreachable. Even if one becomes reachable again (in theory returning the cluster to a good state) it can't be moved to up unless the other node is declared down or becomes reachable again. This means unless we down nodes reasonable aggressively, we can see periods of unnecessary unavailability.

This is tracked in ODL as CONTROLLER-1396 (see below)

Akka Definitions:
===

  • unreachable: the failure detector has marked this node likely down and quarantined it, if it become reachable it will be allowed in
  • downed: the node has been marked as dead and will not be allowed to rejoin
  • convergence: there are no unreachable nodes, e.g., all "members" are up or down
    From: http://doc.akka.io/docs/akka/snapshot/common/cluster.html

Solutions:
===

  • First, is to fix Akka's logic:
  • Possibly, fix it so that we allow nodes to rejoin without having "convergence" (this is Akka issue 18067 [see below] and they aren't super optimistic about it)
  • Possibly, implement our own auto-down version which would ???. This would likely required help from Akka/Typesafe and maybe consulting.
  • Second, is to implement our Akka actors so that they test if they're downed and can reboot themselves automatically.
  • Third, is to have nodes leave on graceful shutdown. This is only a partial fix as it won't help non-graceful shutdowns.
  • Fourth, use something other than Akka clustering.
  • Maybe just use Akka remoting, but not clustering.
  • Maybe just use something like AMQP.

https://bugs.opendaylight.org/show_bug.cgi?id=4037
https://github.com/akka/akka/issues/18067#issuecomment-129323444

Comment by Colin Dixon [ 13/Aug/15 ]

Based on some talking with TomP and discussion on the Akka issue, I think the best solution in the short run might be the following:

1.) Set an aggressive, but reasonable auto-down timeout
2.) When we get a MemberDown event any remote cluster instance, we send a YouAreDown message via Akka remoting to that node repeatedly, e.g., every 10 seconds
3.) If a cluster instance gets a YouAreDown message, then it reboots itself to be able to rejoin the cluster despite being auto-downed

We could optimize step 2 a bit if we knew when the node was reachable despite being down and only send the message then, but the core idea is the same.

Comment by Tom Pantelis [ 13/Aug/15 ]

When a MemberUp occurs for the downed node then stop the YouAreDown timer. Also, I think we could ignore YouAreDown until the first MemberUp is received for itself. That should really minimize the chance for a "false" redundant restart as the other nodes would also get the MemberUp within a short period of time.

(In reply to Colin Dixon from comment #10)
> Based on some talking with TomP and discussion on the Akka issue, I think
> the best solution in the short run might be the following:
>
> 1.) Set an aggressive, but reasonable auto-down timeout
> 2.) When we get a MemberDown event any remote cluster instance, we send a
> YouAreDown message via Akka remoting to that node repeatedly, e.g., every 10
> seconds
> 3.) If a cluster instance gets a YouAreDown message, then it reboots itself
> to be able to rejoin the cluster despite being auto-downed
>
> We could optimize step 2 a bit if we knew when the node was reachable
> despite being down and only send the message then, but the core idea is the
> same.

Comment by Phillip Shea [ 13/Aug/15 ]

(In reply to Colin Dixon from comment #10)

Wouldn't setting auto-down to 10 seconds, then rebooting to rejoin cause performance fall of a cliff in a environment with intermittent connections? It takes over a minute to restart with a minimum set of modules installed. This would mean a 10 second interruption would result in the controller being unavailable for more than a minute. Am I wrong?

> Based on some talking with TomP and discussion on the Akka issue, I think
> the best solution in the short run might be the following:
>
> 1.) Set an aggressive, but reasonable auto-down timeout
> 2.) When we get a MemberDown event any remote cluster instance, we send a
> YouAreDown message via Akka remoting to that node repeatedly, e.g., every 10
> seconds
> 3.) If a cluster instance gets a YouAreDown message, then it reboots itself
> to be able to rejoin the cluster despite being auto-downed
>
> We could optimize step 2 a bit if we knew when the node was reachable
> despite being down and only send the message then, but the core idea is the
> same.

Comment by Tom Pantelis [ 13/Aug/15 ]

We would just restart the actor systems, not the process. It looks like we have no choice based on the discussions with the akka dev. He did mention a new feature that is not released yet that he couldn't talk about. Even if this new feature helps us here, it will be a while before we could upgrade akka.

If Colin's idea works and with auto-down set lower (e.g. 30s to give it some cushion), the issue with the "140" tests would be alleviated and we would also have automatic recovery of partitioned nodes.

A potential issue could be if both sides of the partition somehow see each other down and, upon healing, each sends YouAreDown to the other side. Not sure if that could happen. If so, it probably could only happen in clusters larger than 3.

(In reply to Phillip Shea from comment #12)
> (In reply to Colin Dixon from comment #10)
>
> Wouldn't setting auto-down to 10 seconds, then rebooting to rejoin cause
> performance fall of a cliff in a environment with intermittent connections?
> It takes over a minute to restart with a minimum set of modules installed.
> This would mean a 10 second interruption would result in the controller
> being unavailable for more than a minute. Am I wrong?
>
>
>
> > Based on some talking with TomP and discussion on the Akka issue, I think
> > the best solution in the short run might be the following:
> >
> > 1.) Set an aggressive, but reasonable auto-down timeout
> > 2.) When we get a MemberDown event any remote cluster instance, we send a
> > YouAreDown message via Akka remoting to that node repeatedly, e.g., every 10
> > seconds
> > 3.) If a cluster instance gets a YouAreDown message, then it reboots itself
> > to be able to rejoin the cluster despite being auto-downed
> >
> > We could optimize step 2 a bit if we knew when the node was reachable
> > despite being down and only send the message then, but the core idea is the
> > same.

Comment by Phillip Shea [ 13/Aug/15 ]

(In reply to Tom Pantelis from comment #13)

Cool. Thanks for the explanation, Tom.

> We would just restart the actor systems, not the process. It looks like we
> have no choice based on the discussions with the akka dev. He did mention a
> new feature that is not released yet that he couldn't talk about. Even if
> this new feature helps us here, it will be a while before we could upgrade
> akka.
>
> If Colin's idea works and with auto-down set lower (e.g. 30s to give it some
> cushion), the issue with the "140" tests would be alleviated and we would
> also have automatic recovery of partitioned nodes.
>
> A potential issue could be if both sides of the partition somehow see each
> other down and, upon healing, each sends YouAreDown to the other side. Not
> sure if that could happen. If so, it probably could only happen in clusters
> larger than 3.
>
> (In reply to Phillip Shea from comment #12)
> > (In reply to Colin Dixon from comment #10)
> >
> > Wouldn't setting auto-down to 10 seconds, then rebooting to rejoin cause
> > performance fall of a cliff in a environment with intermittent connections?
> > It takes over a minute to restart with a minimum set of modules installed.
> > This would mean a 10 second interruption would result in the controller
> > being unavailable for more than a minute. Am I wrong?
> >
> >
> >
> > > Based on some talking with TomP and discussion on the Akka issue, I think
> > > the best solution in the short run might be the following:
> > >
> > > 1.) Set an aggressive, but reasonable auto-down timeout
> > > 2.) When we get a MemberDown event any remote cluster instance, we send a
> > > YouAreDown message via Akka remoting to that node repeatedly, e.g., every 10
> > > seconds
> > > 3.) If a cluster instance gets a YouAreDown message, then it reboots itself
> > > to be able to rejoin the cluster despite being auto-downed
> > >
> > > We could optimize step 2 a bit if we knew when the node was reachable
> > > despite being down and only send the message then, but the core idea is the
> > > same.

Comment by Colin Dixon [ 13/Aug/15 ]

Short version: if we really have intermittent connections such that we routinely lose nodes for 10+ seconds, yes.

In reality, I'm not sure we can realistically function anyway in such an environment regardless of this solution.

(In reply to Phillip Shea from comment #12)
> (In reply to Colin Dixon from comment #10)
>
> Wouldn't setting auto-down to 10 seconds, then rebooting to rejoin cause
> performance fall of a cliff in a environment with intermittent connections?
> It takes over a minute to restart with a minimum set of modules installed.
> This would mean a 10 second interruption would result in the controller
> being unavailable for more than a minute. Am I wrong?

Comment by Gary Wu [ 10/Sep/15 ]

I'll be working on this issue per Tom's request.

Comment by Gary Wu [ 18/Sep/15 ]

A quick update on where I'm at with this bug.

I had a prototype implemented using the YouAreDown message mechanism as suggested by Colin and Tom.

While testing this out, I ran into an issue: Sometimes the YouAreDown messages would not make it to the destination node because the Akka association (connection) to the auto-downed node has been quarantined. When this happens, there's no way to communicate to the auto-downed node to tell it to restart.

Since the quarantine is what requires the ActorSystem reboot in the first place, I'm now exploring the possibility of relying on Akka's own quarantined detection, and using that as a signal to reboot the ActorSystem instead of sending our own YouAreDown message system.

This is going okay so far, except for two issues:

1. Even though Akka can accurately detect being quarantined by a remote node, it doesn't bubble this up in a named exception. So, I'm having to resort to doing string matching on the exception message, which is fragile.

2. Occasionally, even after the ActorSystem has been restarted, it will fail to rejoin the cluster properly (i.e. the other nodes would fail to get the MemberUp event). In a three node cluster, this would happen on just one of the nodes. I'm still investigating this one.

Since we have a separate Akka cluster for the RPC system, I guess we would need to implement something similar for that as well?

Comment by Tom Pantelis [ 18/Sep/15 ]

Thanks for exploring this. I was afraid the YouAreDown message might get blocked. It would be ideal if akka could provide some indication to our code on the partitioned node that it's been quarantined - it sounds like you've may have achieved that?

I imagine we would have to do the same with the RPC actor system. However we've talked about using just one actor system. I don't remember if there's a bug for that but that could be another task to work on if you want.

Comment by Gary Wu [ 18/Sep/15 ]

Quarantine detection works if we're okay with doing string matching on the exception message. Hopefully Akka doesn't change their exception messages often.

Right now the main challenge is figuring out why restarting the ActorSystem can usually, but not always, allow the node to rejoin the cluster.

I did observe, on occasion, that nodes on two sides of a partition mutually quarantine each other. I haven't yet put much thought into how that case should be handled.

If there is an existing bug on consolidating the datastore and RPC cluster systems, go ahead and send it my way.

Comment by Gary Wu [ 30/Sep/15 ]

Quick update on this bug.

I'm still working on the ActorSystem restart issue. Namely, sporadically, after a node restarts its ActorSystem, it is not able to rejoin the cluster. The other two nodes do not get the MemberUp notification, and the node in question becomes a single node cluster. This despite the fact that Akka Remoting seems to be able to re-associate fine with the newly restarted node and drop the quarantine status.

Still investigating.

Comment by Gary Wu [ 01/Oct/15 ]

Looks like the cluster rejoin problem was caused by a node restarting that happens to be the first seed node specified in its own akka.conf. Since Akka Clustering will join to the seed node that responds first, sometimes the seed node that responds first is itself, and the node ends up only joining itself and forms a single-node cluster.

To prevent this, during ActorSystem restarts, the list of seed nodes should not contain the self address of the node that is restarting. My plan is to:

For the initial boot, the configuration in akka.conf will be used as is.
For ActorSystem restarts, prepare a separate Akka config in memory that is the same as akka.conf but removes the node's self address from the list of seed nodes.

Let me know if you have any thoughts on this approach.

Comment by Tom Pantelis [ 01/Oct/15 ]

That sounds reasonable. Moiz has a draft patch to use a single ActorSystem for CDS and RPC which will make things easier.

Comment by Gary Wu [ 02/Oct/15 ]

I've made an initial commit here:

https://git.opendaylight.org/gerrit/#/c/27852/

The case of mutual quarantines hasn't been addressed yet. On rare occasions when multiple nodes restart simultaneously (due to mutual quarantines), islands can form. Will need to think of a good way to address this scenario next.

Comment by Gary Wu [ 21/Oct/15 ]

In regard to the seed node issue (initial seed node intermittently forms an island on restart):

I have verified that the initial seed node is unable to rejoin the cluster around 10% of the time, on both Akka 2.3.10 and 2.3.14.

According to Patrik Nordwall from Typesafe, this is supposed to work, so maybe there is a bug in Akka.

I've created the following issues against Akka:

Cluster initial seed node intermittently fails to rejoin cluster on restart
https://github.com/akka/akka/issues/18757

Add named exception to detect when a cluster node has been quarantined by others
https://github.com/akka/akka/issues/18758

Comment by Gary Wu [ 21/Oct/15 ]

I re-did some tests, and found that I can eliminate the initial seed node rejoin error by increasing seed-node-timeout to 10s. This means that the Akka documentation was correct, and it was something specific to my test system that was sporadically preventing nodes 2 and 3 from responding to the first node within the default seed-node-timeout of 5s.

This means that we can proceed with the node restart without having to modify the seed node configurations.

What is the best way to restart the container? Do we want to run an external script? Or is it easy to programmatically restart bundle 0? (I'm not familiar with this.)

Comment by Tom Pantelis [ 22/Oct/15 ]

I've programmatically restarted karaf in the past but it was a few years ago. Here's a link http://karaf.922171.n3.nabble.com/karaf-programmatic-restart-td4035671.html. It says to set the "karaf.restart" property which I vaguely remember. However I don't know if this is still valid with karaf 3.x.

Comment by Tom Pantelis [ 23/Oct/15 ]

Patch https://git.opendaylight.org/gerrit/#/c/27852/

Comment by Gary Wu [ 23/Oct/15 ]

In https://github.com/akka/akka/issues/18757 the Akka team is finding strange issues from our logs, so there may potentially be a bug they will need to address. Nonetheless, the functionality is supposed to be that the node can rejoin without having to remove itself from the seed node config, so we can move forward with our patch to restart the karaf container.

Generated at Wed Feb 07 19:55:26 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.