|
With additional debugs this seems like a very weird behavior in akka.
Rejoin of node1 seems to trigger an unreachable from node 2.
2017-05-31 12:57:31,375 | INFO | h for user karaf | command | 46 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Rejoining node1.
We get a couple of heartbeats first we dont get a response form member-1
2017-05-31 12:57:31,920 | DEBUG | lt-dispatcher-37 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat to [akka.tcp://opendaylight-cluster-data@172.17.0.4:2550]
2017-05-31 12:57:31,920 | DEBUG | lt-dispatcher-37 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat to [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550]
2017-05-31 12:57:31,921 | DEBUG | lt-dispatcher-38 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat response from [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550]
2017-05-31 12:57:32,919 | DEBUG | lt-dispatcher-60 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat to [akka.tcp://opendaylight-cluster-data@172.17.0.4:2550]
2017-05-31 12:57:32,920 | DEBUG | lt-dispatcher-60 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat to [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550]
2017-05-31 12:57:32,921 | DEBUG | lt-dispatcher-30 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat response from [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550]
2017-05-31 12:57:33,919 | DEBUG | lt-dispatcher-57 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat to [akka.tcp://opendaylight-cluster-data@172.17.0.4:2550]
2017-05-31 12:57:33,920 | DEBUG | lt-dispatcher-57 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat to [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550]
2017-05-31 12:57:33,921 | DEBUG | lt-dispatcher-60 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat response from [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550]
2017-05-31 12:57:34,920 | DEBUG | ult-dispatcher-3 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat to [akka.tcp://opendaylight-cluster-data@172.17.0.4:2550]
2017-05-31 12:57:34,920 | DEBUG | ult-dispatcher-3 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat to [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550]
2017-05-31 12:57:34,922 | DEBUG | lt-dispatcher-37 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat response from [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550]
Then we finally get a heartbeat from member-1
2017-05-31 12:57:35,919 | DEBUG | ult-dispatcher-3 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat to [akka.tcp://opendaylight-cluster-data@172.17.0.4:2550]
2017-05-31 12:57:35,920 | DEBUG | ult-dispatcher-3 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat to [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550]
2017-05-31 12:57:35,921 | DEBUG | lt-dispatcher-30 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat response from [akka.tcp://opendaylight-cluster-data@172.17.0.4:2550]
2017-05-31 12:57:35,922 | DEBUG | lt-dispatcher-30 | ClusterHeartbeatSender | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Heartbeat response from [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550]
2017-05-31 12:57:36,550 | INFO | lt-dispatcher-59 | kka://opendaylight-cluster-data) | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Marking node(s) as REACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.4:2550, status = Up)]. Node roles [member-3]
And then we receive a gossip from member-1 which seems to trigger the following message and an unreachable on member-2
2017-05-31 12:57:36,967 | DEBUG | ult-dispatcher-3 | ClusterCoreDaemon | 201 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@172.17.0.6:2550] - Receiving gossip from [UniqueAddress(akka.tcp://opendaylight-cluster-data@172.17.0.4:2550,20791887)]
2017-05-31 12:57:36,969 | DEBUG | ult-dispatcher-3 | ClusterCoreDaemon | 201 - com.typesafe.akka.slf4j - 2.4.17 | Couldn't establish a causal relationship between "remote" gossip and "local" gossip - Remote[Gossip(members = [Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.4:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.5:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.6:2550, status = Up)], overview = GossipOverview(reachability = [akka.tcp://opendaylight-cluster-data@172.17.0.4:2550 -> akka.tcp://opendaylight-cluster-data@172.17.0.5:2550: Unreachable [Unreachable] (7), akka.tcp://opendaylight-cluster-data@172.17.0.4:2550 -> akka.tcp://opendaylight-cluster-data@172.17.0.6:2550: Reachable [Reachable] (9)], seen = [UniqueAddress(akka.tcp://opendaylight-cluster-data@172.17.0.4:2550,20791887)]), version = VectorClock(950de35185ba8f3de15d6e342fd7dbf5 -> 11, ed3be0f6d0835e2fa024f22583639b6c -> 6, f0c01f773a4236a03ed6299dcab410f6 -> 4))] - Local[Gossip(members = [Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.4:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.5:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.6:2550, status = Up)], overview = GossipOverview(reachability = [akka.tcp://opendaylight-cluster-data@172.17.0.5:2550 -> akka.tcp://opendaylight-cluster-data@172.17.0.4:2550: Unreachable [Unreachable] (7)], seen = [UniqueAddress(akka.tcp://opendaylight-cluster-data@172.17.0.6:2550,-602086493), UniqueAddress(akka.tcp://opendaylight-cluster-data@172.17.0.5:2550,-1976714323)]), version = VectorClock(950de35185ba8f3de15d6e342fd7dbf5 -> 9, ed3be0f6d0835e2fa024f22583639b6c -> 7, f0c01f773a4236a03ed6299dcab410f6 -> 6))] - merged them into [Gossip(members = [Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.4:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.5:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@172.17.0.6:2550, status = Up)], overview = GossipOverview(reachability = [akka.tcp://opendaylight-cluster-data@172.17.0.4:2550 -> akka.tcp://opendaylight-cluster-data@172.17.0.5:2550: Unreachable [Unreachable] (7), akka.tcp://opendaylight-cluster-data@172.17.0.4:2550 -> akka.tcp://opendaylight-cluster-data@172.17.0.6:2550: Reachable [Reachable] (9), akka.tcp://opendaylight-cluster-data@172.17.0.5:2550 -> akka.tcp://opendaylight-cluster-data@172.17.0.4:2550: Unreachable [Unreachable] (7)], seen = []), version = VectorClock(950de35185ba8f3de15d6e342fd7dbf5 -> 11, ed3be0f6d0835e2fa024f22583639b6c -> 7, f0c01f773a4236a03ed6299dcab410f6 -> 6))]
2017-05-31 12:57:36,969 | INFO | ult-dispatcher-3 | ShardManager | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Received UnreachableMember: memberName MemberName
{name=member-2}
, address: akka.tcp://opendaylight-cluster-data@172.17.0.5:2550
2017-05-31 12:57:36,969 | INFO | lt-dispatcher-60 | ShardManager
Overall this seems very strange and possibly a bug on akka side since there are no missed heartbeats from either member in this case.
|
|
> And then we receive a gossip from member-1 which seems to trigger the
> following message and an unreachable on member-2
This is the same phenomenon as in CONTROLLER-1670 [10].
> We get a couple of heartbeats first
This is additional suspicious behavior, it does not happen in CONTROLLER-1670.
Thus it might be easier to focus on CONTROLLER-1670 hoping fixing that will also fix this.
Here is the critical part of huge karaf log [11] which made the suite detect leader movement [12] on Sandox:
2017-07-25 12:24:39,103 | DEBUG | lt-dispatcher-20 | EndpointWriter | 174 - com.typesafe.akka.slf4j - 2.4.18 | received local message RemoteMessage: [ActorSelectionMessage(akka.cluster.GossipEnvelope@350706c1,Vector(system, cluster, cor
e, daemon),false)] to [Actor[akka://opendaylight-cluster-data/]]<+[akka://opendaylight-cluster-data/] from [Actor[akka.tcp://opendaylight-cluster-data@10.29.15.173:2550/system/cluster/core/daemon#-465045840]()]
2017-07-25 12:24:39,104 | DEBUG | lt-dispatcher-20 | ClusterCoreDaemon | 174 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.12.156:2550] - Receiving gossip from [UniqueAddress(akka.tcp://opend
aylight-cluster-data@10.29.15.173:2550,-1961660878)]
2017-07-25 12:24:39,105 | DEBUG | lt-dispatcher-19 | ClusterCoreDaemon | 174 - com.typesafe.akka.slf4j - 2.4.18 | Couldn't establish a causal relationship between "remote" gossip and "local" gossip - Remote[Gossip(members = [Member(address
= akka.tcp://opendaylight-cluster-data@10.29.12.150:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@10.29.12.156:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@10.29.15.173:2550, status = Up)], overvi
ew = GossipOverview(reachability = [akka.tcp://opendaylight-cluster-data@10.29.15.173:2550 -> akka.tcp://opendaylight-cluster-data@10.29.12.150:2550: Unreachable [Unreachable] (37), akka.tcp://opendaylight-cluster-data@10.29.15.173:2550 -> akka.tcp://ope
ndaylight-cluster-data@10.29.12.156:2550: Reachable [Reachable] (39)], seen = [UniqueAddress(akka.tcp://opendaylight-cluster-data@10.29.15.173:2550,-1961660878)]), version = VectorClock(0509e936b03b785faf066e69a04e621a -> 38, 5b6c3568af1c83fa01ae443deac1
7614 -> 29, c50fc393e7d7dd3654a21a7880f96d0a -> 25))] - Local[Gossip(members = [Member(address = akka.tcp://opendaylight-cluster-data@10.29.12.150:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@10.29.12.156:2550, status = Up),
Member(address = akka.tcp://opendaylight-cluster-data@10.29.15.173:2550, status = Up)], overview = GossipOverview(reachability = [akka.tcp://opendaylight-cluster-data@10.29.12.150:2550 -> akka.tcp://opendaylight-cluster-data@10.29.15.173:2550: Unreachabl
e [Unreachable] (37)], seen = [UniqueAddress(akka.tcp://opendaylight-cluster-data@10.29.12.156:2550,615998638), UniqueAddress(akka.tcp://opendaylight-cluster-data@10.29.12.150:2550,-504912273)]), version = VectorClock(0509e936b03b785faf066e69a04e621a ->
36, 5b6c3568af1c83fa01ae443deac17614 -> 30, c50fc393e7d7dd3654a21a7880f96d0a -> 27))] - merged them into [Gossip(members = [Member(address = akka.tcp://opendaylight-cluster-data@10.29.12.150:2550, status = Up), Member(address = akka.tcp://opendaylight-cl
uster-data@10.29.12.156:2550, status = Up), Member(address = akka.tcp://opendaylight-cluster-data@10.29.15.173:2550, status = Up)], overview = GossipOverview(reachability = [akka.tcp://opendaylight-cluster-data@10.29.12.150:2550 -> akka.tcp://opendayligh
t-cluster-data@10.29.15.173:2550: Unreachable [Unreachable] (37), akka.tcp://opendaylight-cluster-data@10.29.15.173:2550 -> akka.tcp://opendaylight-cluster-data@10.29.12.150:2550: Unreachable [Unreachable] (37), akka.tcp://opendaylight-cluster-data@10.29
.15.173:2550 -> akka.tcp://opendaylight-cluster-data@10.29.12.156:2550: Reachable [Reachable] (39)], seen = []), version = VectorClock(0509e936b03b785faf066e69a04e621a -> 38, 5b6c3568af1c83fa01ae443deac17614 -> 30, c50fc393e7d7dd3654a21a7880f96d0a -> 27)
)]
2017-07-25 12:24:39,105 | INFO | rd-dispatcher-34 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Received UnreachableMember: memberName MemberName
{name=member-2}, address: akka.tcp://opendaylight-cluster-data@10.29.12.150:2550
2017-07-25 12:24:39,105 | DEBUG | lt-dispatcher-19 | ClusterRemoteWatcher | 174 - com.typesafe.akka.slf4j - 2.4.18 | Unwatching: [akka://opendaylight-cluster-data/user/rpc/registry -> akka.tcp://opendaylight-cluster-data@10.29.12.150:2550/user/rpc/broker]
2017-07-25 12:24:39,105 | DEBUG | lt-dispatcher-19 | ClusterRemoteWatcher | 174 - com.typesafe.akka.slf4j - 2.4.18 | Cleanup self watch of [akka.tcp://opendaylight-cluster-data@10.29.12.150:2550/user/rpc/broker]
2017-07-25 12:24:39,105 | INFO | rd-dispatcher-34 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-2}
, address: akka.tcp://opendaylight-cluster-data@10.29.12.150:2550
2017-07-25 12:24:39,105 | INFO | rd-dispatcher-30 | EntityOwnershipShard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-3-shard-entity-ownership-operational: onPeerDown: PeerDown [memberName=member-2, peerId=member-2-shard-entity-ownership-operational]
2017-07-25 12:24:39,105 | DEBUG | lt-dispatcher-21 | EndpointWriter | 174 - com.typesafe.akka.slf4j - 2.4.18 | sending message RemoteMessage: [Unwatch(Actor[akka.tcp://opendaylight-cluster-data@10.29.12.150:2550/user/rpc/broker#-944022860],Actorakka://opendaylight-cluster-data/system/remote-watcher#1339314764)] to [Actor[akka.tcp://opendaylight-cluster-data@10.29.12.150:2550/user/rpc/broker#-944022860]]<+[akka.tcp://opendaylight-cluster-data@10.29.12.150:2550/user/rpc/broker] from [Actor[akka://opendaylight-cluster-data/deadLetters]]
[10] https://bugs.opendaylight.org/show_bug.cgi?id=8430#c8
[11] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-cs-partnheal-longevity-only-carbon/2/odl3_karaf.log.gz
[12] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-cs-partnheal-longevity-only-carbon/2/log.html.gz#s1-s2-t1-k3-k1-k1-k1-k1-k1-k1-k2-k1-k1-k7-k2-k1-k2-k1-k2-k1-k6
|