[CONTROLLER-1273] Clustering: Shard actors are terminated when another cluster node is restarted Created: 23/Apr/15  Updated: 25/Jul/23  Resolved: 05/May/15

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Tom Pantelis Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 3049

 Description   

I'm seeing strange behavior when a node in a 3 node cluster is restarted. It somehow causes the Shard and ShardManager actors in the other nodes in the cluster to terminate. I see these messages in the log:

2015-04-22 13:11:08,317 | INFO | lt-dispatcher-19 | Shard | 177 | 227 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | | Stopping Shard member-1-shard-topology-operational

2015-04-22 13:11:08,323 | INFO | lt-dispatcher-18 | ShardManager | 159 | 234 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | | Stopping ShardManager

There's no other messages in the log except the usual akka INFO messages about node addresses gated and nodes leaving and joining the cluster.

Note that this occurs after the node is started up and not after it is shutdown. Right after the Shard stopping messages above I see the akka message that the downed node is now re-joining:

2015-04-22 13:14:08,412 | INFO | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$3 | 74 | 220 - com.typesafe.akka.slf4j - 2.3.9 | | Cluster Node [akka.tcp://odl-cluster-rpc@127.0.0.1:2551] - Node [akka.tcp://odl-cluster-rpc@127.0.0.1:2555] is JOINING, roles []

I have all 3 nodes running in the same VM on different ports so I'm not sure if that's a factor but I've been running with this setup for a while without seeing this issue.



 Comments   
Comment by Moiz Raja [ 23/Apr/15 ]

Can you attach the full log? Besides the logging what other behavior are you observing?

Comment by Tom Pantelis [ 23/Apr/15 ]

The Shards and ShardManager disappear from JConsole.

I'm curious if you can reproduce this as well or if there's something fluky going on in my environment. I haven't dug into this yet with all the other patches going on right now.

(In reply to Moiz Raja from comment #1)
> Can you attach the full log? Besides the logging what other behavior are you
> observing?

Comment by Moiz Raja [ 23/Apr/15 ]

I haven't seen this problem yet The ActorSystem was shutdown possibly. Why this happened may be in the logs.

Comment by Tom Pantelis [ 30/Apr/15 ]

I found this link https://groups.google.com/forum/#!topic/akka-user/jleFC7P66ao which appears to be recent. This outlines the same issue I see - it's a bug in akka. The issue is related to a node becoming reachable after being auto-downed.

On my controller instances I still had the auto-down-unreachable-after setting to 10s. So I set it really high and the actor system shutdown did not occur. That's good news. So it seems auto-downing is problematic at this point.

However, the 3rd node which I had restarted didn't join back into the cluster. On the other 2 nodes I see this INFO message every few seconds: "Existing member [address of 3rd node] is trying to join, ignoring". Not sure what's going on there. Interestingly the restarted node did become a follower as it appears the leader was able to send heartbeats b/c we cache the remote actor address in the RaftActorContext. However the restarted node did not have peer addresses for the other 2 so it did not get ClusterMemberUp messages.

Also, akka's ClusterState mbean reported the stopped node as Unreachable which makes sense. However it also listed it in the members list as Up which doesn't seem right. It continued to report this state even after the node was restarted.

So it seems the cluster leader didn't Up the restarted node or the endpoint layer didn't report reachability as evidenced by the "ignoring" message. It's unclear what it was ignoring and why.

It seems akka's clustering code is either really funky or really buggy.

Comment by Tom Pantelis [ 30/Apr/15 ]

After reading other posts, e.g. https://groups.google.com/forum/#!topic/akka-user/AdRSv2yuwo4, it seems clear that a node with same host:port can't re-join until the previous member with same host:port has been removed from the cluster, which happens after auto-downing it. However, auto-downing seems to cause the endpoint layer bug that shuts down the actor system.

I really don't get the rationale behind their design.

Comment by Moiz Raja [ 30/Apr/15 ]

Going through the 2.3.10 release notes (http://akka.io/news/2015/04/23/akka-2.3.10-released.html) I spotted this,

  • remove wrong assertion in remoting, which could lead to ActorSystem termination when restarted remote ActorSystem connects after being quarantined

Maybe we should switch to akka 2.3.10 it has a bunch of remoting/clustering related fixes.

Comment by Tom Pantelis [ 30/Apr/15 ]

Yup - that's the issue. We should upgrade.

(In reply to Moiz Raja from comment #6)
> Going through the 2.3.10 release notes
> (http://akka.io/news/2015/04/23/akka-2.3.10-released.html) I spotted this,
>
> * remove wrong assertion in remoting, which could lead to ActorSystem
> termination when restarted remote ActorSystem connects after being
> quarantined
>
> Maybe we should switch to akka 2.3.10 it has a bunch of remoting/clustering
> related fixes.

Comment by Moiz Raja [ 30/Apr/15 ]

More details https://github.com/akka/akka/issues/17213

Comment by Tom Pantelis [ 01/May/15 ]

I tested upgrading to 2.3.10. I had auto-down-unreachable-after set to 10s and, after stopping and restarting a node after auto-down, the actor system shutdown issue didn't occur and the node successfully rejoined the cluster. So that issue is fixed.

I also tested with auto-down-unreachable-after set high. After restarting a node, it successfully rejoined the cluster. So that issue is fixed as well. Tried it twice.

Akka's ClusterState mbean still reported the member as both Up and Unreachable but I can live with that.

Looks like we have a stable akka release wrt to remoting.

Generated at Wed Feb 07 19:55:07 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.