[CONTROLLER-1670] Rejoining member can make other member forget remotely registered RPC Created: 12/May/17  Updated: 25/Jul/23  Resolved: 24/Aug/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8430

 Description   

This manifests as a failure controller-csit-3node-drb-partnheal-longevity-only-carbon/8.

It is a longevity test, the failure happened in iteration 4. Member-1 and member-2 had registered RPC implementations (returning constant-1 and constant-2 respectively). Member-3 did not have any implementation registered, and at the start of iteration, it was returning constant-2.
The test then isolates member-2 and waits for member-3 to start returning constant-1.

It is expected that after member-2 rejoins, member-3 should keep responding with constant-1, but the test encountered [0] 501.

I suspect this recent fix [1] is the cause, and should be fixed.

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-drb-partnheal-longevity-only-carbon/8/archives/log.html.gz#s1-t1-k3-k1-k1-k1-k1-k1-k1-k2-k1-k1-k6-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1-k3-k1-k1-k1-k2-k1-k4-k7-k1
[1] https://git.opendaylight.org/gerrit/56783



 Comments   
Comment by Tomas Cere [ 12/May/17 ]

Seems like member-1 was briefly unreachable on member-3 which would explain the forgotten rpc registration

Comment by Vratko Polak [ 23/May/17 ]

This is still happening occasionally.
Robot symptom: [2], this time in functional test.
Corresponding karaf.log [3] segment:
2017-05-22 19:01:59,364 | INFO | h for user karaf | command | 265 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Invoke_Rpc_On_Each_Node_Again
2017-05-22 19:01:59,447 | INFO | lt-dispatcher-32 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | Received UnreachableMember: memberName MemberName

{name=member-1}

, address: akka.tcp://opendaylight-cluster-data@10.29.12.72:2550

Note that there were messages like below. It is not clear whether 501 can happen even without UnreachableMember if Robot hits the time after shard stabilization, but before first reachable gossip.
2017-05-22 19:01:58,442 | INFO | lt-dispatcher-22 | kka://opendaylight-cluster-data) | 174 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.13.34:2550] - Ignoring received gossip from unreachable [UniqueAddress(akka.tcp://opendaylight-cluster-data@10.29.12.94:2550,1537119417)]

[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/719/archives/log.html.gz#s1-s4-t6-k2-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1-k3-k1-k1-k1-k2-k1-k4-k7-k1
[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/719/archives/odl3_karaf.log.gz

Comment by Vratko Polak [ 26/May/17 ]

Happened again on Sandbox. UnreachableMember happened briefly, just after rejoining the member. Karaf.log segment from member-3, a surviving member reacting to member-2 rejoining:
2017-05-25 21:04:49,363 | INFO | ult-dispatcher-4 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational: Received role changed for member-3-shard-toaster-operational from Leader to Follower
2017-05-25 21:04:49,739 | INFO | ult-dispatcher-4 | kka://opendaylight-cluster-data) | 174 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.14.193:2550] - Marking node(s) as REACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.29.13.208:2550, status = Up)]. Node roles [member-3]
2017-05-25 21:04:51,563 | INFO | lt-dispatcher-26 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Received UnreachableMember: memberName MemberName

{name=member-2}, address: akka.tcp://opendaylight-cluster-data@10.29.13.49:2550
2017-05-25 21:04:51,563 | INFO | ult-dispatcher-6 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-2}

, address: akka.tcp://opendaylight-cluster-data@10.29.13.49:2550
2017-05-25 21:04:51,563 | INFO | lt-dispatcher-26 | EntityOwnershipShard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-entity-ownership-operational: onPeerDown: PeerDown [memberName=member-2, peerId=member-2-shard-entity-ownership-operational]
2017-05-25 21:04:52,549 | INFO | ult-dispatcher-2 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Received ReachableMember: memberName MemberName

{name=member-2}, address: akka.tcp://opendaylight-cluster-data@10.29.13.49:2550
2017-05-25 21:04:52,549 | INFO | ult-dispatcher-2 | ShardInformation | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | updatePeerAddress for peer member-2-shard-default-config with address akka.tcp://opendaylight-cluster-data@10.29.13.49:2550/user/shardmanager-config/member-2-shard-default-config
2017-05-25 21:04:52,549 | INFO | lt-dispatcher-29 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Received ReachableMember: memberName MemberName{name=member-2}

, address: akka.tcp://opendaylight-cluster-data@10.29.13.49:2550

Comment by Vratko Polak [ 26/May/17 ]

Member-2 was isolated for around 41 seconds.

Comment by Vratko Polak [ 29/May/17 ]

Happened [4] on current stable/carbon.

[4] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-drb-partnheal-longevity-only-carbon/12/archives/log.html.gz#s1-t1-k3-k1-k1-k1-k1-k1-k1-k2-k1-k1-k6-k1-k1-k1-k1-k1-k1-k2-k1-k1-k1-k3-k1-k1-k1-k2-k1-k4-k7-k1

Comment by Vratko Polak [ 05/Jun/17 ]

Also this week: https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-drb-partnheal-longevity-only-carbon/13/console.log.gz

Comment by Vratko Polak [ 25/Jul/17 ]

> UnreachableMember happened briefly, just after rejoining the member.

Using a logging change [6] I have identified few lines in huge log [7] related to the robot failure [8]. This is around two seconds after the test rejoined member-2 (10.29.15.65), when member-3 (10.29.15.201 without local RPC registation) obtains gossip from the rejoining member-2, who has not yet detected member-1 (10.29.15.65) is reachable. It seems member-3 considers member-2 to be more up-to-date and marks member-1 as unreachable, thus losing access to remote RPC registration. Although it is possible member-3 already had wrong local reachability information for some reason.

From akka code side, the long "merge" line is reported here [9] referring to the computation done here [10].
The identified lines:

2017-07-25 12:05:27,768 | DEBUG | lt-dispatcher-39 | EndpointWriter | 174 - com.typesafe.akka.slf4j - 2.4.18 | received local message RemoteMessage: [ActorSelectionMessage(akka.cluster.GossipEnvelope@384a4579,Vector(system, cluster, cor
e, daemon),false)] to [Actor[akka://opendaylight-cluster-data/]]<+[akka://opendaylight-cluster-data/] from [Actor[akka.tcp://opendaylight-cluster-data@10.29.13.236:2550/system/cluster/core/daemon#975906890]()]
2017-07-25 12:05:27,775 | DEBUG | lt-dispatcher-39 | ClusterCoreDaemon | 174 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.15.201:2550] - Receiving gossip from [UniqueAddress(akka.tcp://opend
aylight-cluster-data@10.29.13.236:2550,1060309017)]
2017-07-25 12:05:27,778 | INFO | rd-dispatcher-41 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Received UnreachableMember: memberName MemberName

{name=member-1}, address: akka.tcp://o
pendaylight-cluster-data@10.29.15.65:2550
2017-07-25 12:05:27,778 | INFO | rd-dispatcher-42 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-1}

, address: akka.tcp://o
pendaylight-cluster-data@10.29.15.65:2550
2017-07-25 12:05:27,778 | DEBUG | lt-dispatcher-20 | ClusterCoreDaemon | 174 - com.typesafe.akka.slf4j - 2.4.18 | Couldn't establish a causal relationship between "remote" gossip and "local" gossip - Remote[Gossip(members = [Member(address
= akka.tcp://opendaylight-cluster-data@10.29.13.236:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@10.29.15.201:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@10.29.15.65:2550, status = Up)], overvie
w = GossipOverview(reachability = [akka.tcp://opendaylight-cluster-data@10.29.13.236:2550 -> akka.tcp://opendaylight-cluster-data@10.29.15.201:2550: Reachable [Reachable] (5), akka.tcp://opendaylight-cluster-data@10.29.13.236:2550 -> akka.tcp://opendayli
ght-cluster-data@10.29.15.65:2550: Unreachable [Unreachable] (3)], seen = [UniqueAddress(akka.tcp://opendaylight-cluster-data@10.29.13.236:2550,1060309017)]), version = VectorClock(003f54fb4960bf1c7f7b77ad60d9c1eb -> 5, 08e95a0c1ccb68238e08b7e282dd04df -
> 3, e2ecf54bf1a5bb7f8e94008a98ac3366 -> 7))] - Local[Gossip(members = [Member(address = akka.tcp://opendaylight-cluster-data@10.29.13.236:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@10.29.15.201:2550, status = Up), Member(a
ddress = akka.tcp://opendaylight-cluster-data@10.29.15.65:2550, status = Up)], overview = GossipOverview(reachability = [akka.tcp://opendaylight-cluster-data@10.29.15.65:2550 -> akka.tcp://opendaylight-cluster-data@10.29.13.236:2550: Unreachable [Unreach
able] (5)], seen = [UniqueAddress(akka.tcp://opendaylight-cluster-data@10.29.15.201:2550,-3420732)]), version = VectorClock(003f54fb4960bf1c7f7b77ad60d9c1eb -> 2, 08e95a0c1ccb68238e08b7e282dd04df -> 5, e2ecf54bf1a5bb7f8e94008a98ac3366 -> 8))] - merged th
em into [Gossip(members = [Member(address = akka.tcp://opendaylight-cluster-data@10.29.13.236:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@10.29.15.201:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data
@10.29.15.65:2550, status = Up)], overview = GossipOverview(reachability = [akka.tcp://opendaylight-cluster-data@10.29.13.236:2550 -> akka.tcp://opendaylight-cluster-data@10.29.15.201:2550: Reachable [Reachable] (5), akka.tcp://opendaylight-cluster-data@
10.29.13.236:2550 -> akka.tcp://opendaylight-cluster-data@10.29.15.65:2550: Unreachable [Unreachable] (3), akka.tcp://opendaylight-cluster-data@10.29.15.65:2550 -> akka.tcp://opendaylight-cluster-data@10.29.13.236:2550: Unreachable [Unreachable] (5)], se
en = []), version = VectorClock(003f54fb4960bf1c7f7b77ad60d9c1eb -> 5, 08e95a0c1ccb68238e08b7e282dd04df -> 5, e2ecf54bf1a5bb7f8e94008a98ac3366 -> 8))]

[6] https://git.opendaylight.org/gerrit/#/c/60727/1
[7] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-drb-partnheal-longevity-only-carbon/3/odl3_karaf.log.gz
[8] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-drb-partnheal-longevity-only-carbon/3/log.html.gz#s1-t1-k3-k1-k1-k1-k1-k1-k1-k2-k1-k1-k6-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1-k3-k1-k1-k1-k2-k1-k4-k7-k1
[9] https://github.com/akka/akka/blob/v2.4.17/akka-cluster/src/main/scala/akka/cluster/ClusterDaemon.scala#L736-L742
[10] https://github.com/akka/akka/blob/v2.4.17/akka-cluster/src/main/scala/akka/cluster/ClusterDaemon.scala#L704-L726

Comment by Robert Varga [ 24/Aug/17 ]

Intermediate churn is expected, as akka needs a few attempts to restore its vector clocks, during which time it may report incorrect reachability status.

Generated at Wed Feb 07 19:56:09 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.