[CONTROLLER-1670] Rejoining member can make other member forget remotely registered RPC Created: 12/May/17 Updated: 25/Jul/23 Resolved: 24/Aug/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8430 |
| Description |
|
This manifests as a failure controller-csit-3node-drb-partnheal-longevity-only-carbon/8. It is a longevity test, the failure happened in iteration 4. Member-1 and member-2 had registered RPC implementations (returning constant-1 and constant-2 respectively). Member-3 did not have any implementation registered, and at the start of iteration, it was returning constant-2. It is expected that after member-2 rejoins, member-3 should keep responding with constant-1, but the test encountered [0] 501. I suspect this recent fix [1] is the cause, and should be fixed. [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-drb-partnheal-longevity-only-carbon/8/archives/log.html.gz#s1-t1-k3-k1-k1-k1-k1-k1-k1-k2-k1-k1-k6-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1-k3-k1-k1-k1-k2-k1-k4-k7-k1 |
| Comments |
| Comment by Tomas Cere [ 12/May/17 ] |
|
Seems like member-1 was briefly unreachable on member-3 which would explain the forgotten rpc registration |
| Comment by Vratko Polak [ 23/May/17 ] |
|
This is still happening occasionally. , address: akka.tcp://opendaylight-cluster-data@10.29.12.72:2550 Note that there were messages like below. It is not clear whether 501 can happen even without UnreachableMember if Robot hits the time after shard stabilization, but before first reachable gossip. [2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/719/archives/log.html.gz#s1-s4-t6-k2-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1-k3-k1-k1-k1-k2-k1-k4-k7-k1 |
| Comment by Vratko Polak [ 26/May/17 ] |
|
Happened again on Sandbox. UnreachableMember happened briefly, just after rejoining the member. Karaf.log segment from member-3, a surviving member reacting to member-2 rejoining: 2017-05-25 21:04:51,563 | INFO | ult-dispatcher-6 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-2} , address: akka.tcp://opendaylight-cluster-data@10.29.13.49:2550 2017-05-25 21:04:52,549 | INFO | ult-dispatcher-2 | ShardInformation | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | updatePeerAddress for peer member-2-shard-default-config with address akka.tcp://opendaylight-cluster-data@10.29.13.49:2550/user/shardmanager-config/member-2-shard-default-config 2017-05-25 21:04:52,549 | INFO | lt-dispatcher-29 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Received ReachableMember: memberName MemberName{name=member-2} , address: akka.tcp://opendaylight-cluster-data@10.29.13.49:2550 |
| Comment by Vratko Polak [ 26/May/17 ] |
|
Member-2 was isolated for around 41 seconds. |
| Comment by Vratko Polak [ 29/May/17 ] |
|
Happened [4] on current stable/carbon. |
| Comment by Vratko Polak [ 05/Jun/17 ] |
|
Also this week: https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-drb-partnheal-longevity-only-carbon/13/console.log.gz |
| Comment by Vratko Polak [ 25/Jul/17 ] |
|
> UnreachableMember happened briefly, just after rejoining the member. Using a logging change [6] I have identified few lines in huge log [7] related to the robot failure [8]. This is around two seconds after the test rejoined member-2 (10.29.15.65), when member-3 (10.29.15.201 without local RPC registation) obtains gossip from the rejoining member-2, who has not yet detected member-1 (10.29.15.65) is reachable. It seems member-3 considers member-2 to be more up-to-date and marks member-1 as unreachable, thus losing access to remote RPC registration. Although it is possible member-3 already had wrong local reachability information for some reason. From akka code side, the long "merge" line is reported here [9] referring to the computation done here [10]. 2017-07-25 12:05:27,768 | DEBUG | lt-dispatcher-39 | EndpointWriter | 174 - com.typesafe.akka.slf4j - 2.4.18 | received local message RemoteMessage: [ActorSelectionMessage(akka.cluster.GossipEnvelope@384a4579,Vector(system, cluster, cor pendaylight-cluster-data@10.29.15.65:2550 2017-07-25 12:05:27,778 | INFO | rd-dispatcher-42 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-1} , address: akka.tcp://o [6] https://git.opendaylight.org/gerrit/#/c/60727/1 |
| Comment by Robert Varga [ 24/Aug/17 ] |
|
Intermediate churn is expected, as akka needs a few attempts to restore its vector clocks, during which time it may report incorrect reachability status. |