[CONTROLLER-1713] RequestTimeoutException after remove-shard-replica with "transferred leadership to null" Created: 08/Jun/17  Updated: 25/Jul/23  Resolved: 13/Sep/17

Status: Verified
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8639

 Description   

Seen [0] on RelEng first time with module-based shard (tell-based protocol).
Member-3 was the old leader, member-2 became new leader.

This is similar to CONTROLLER-1693 in that member-2 has seen UnreachableMember after remove-shard-replica.
This is also similar to CONTROLLER-1705 in that the client (at member-3) has not properly reconnected to the new leader.

The quoted part of the title is seen in member-3 karaf.log [1]:
2017-06-08 07:26:38,108 | INFO | lt-dispatcher-29 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.Carbon | member-3-shard-default-config: Successfully transferred leadership to null in 3.074 s

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/736/log.html.gz#s1-s20-t1-k2-k9
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/736/odl3_karaf.log.gz



 Comments   
Comment by Vratko Polak [ 12/Jun/17 ]

Something similar happened [2] again. Karaf.log [3] contains the quoted part, but there are other suspicious lines, so I am not really sure this is the same Bug.

2017-06-11 11:53:06,783 | INFO | rd-dispatcher-38 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.Carbon | member-1-shard-default-config: Successfully transferred leadership to null in 2.319 s

[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/739/log.html.gz#s1-s36-t3-k2-k14
[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/739/odl1_karaf.log.gz

Comment by Vratko Polak [ 07/Sep/17 ]

Examining history of all job, I have found more recent (2017-08-24) occurrence [4].

After comparing karaf logs, I think this has to do with frontend reconnect happening while there is a member in PreLeader state.

First, let me copy INFO logs (copied from [5]) from a run where the test passed. The log belongs to member-1 (the old leader), the new leader is member-2. It is not clear from INFO level which member-1 component has first become aware that member-2 is a leader to allow successful reconnect (09:14:33,018), as member-1 Shard switched from Leader to Follower only after seeing (09:14:34,006) higher term from the other follower (member-3) AppendEntriesReply. This probably implies there was a time interval when both member-2 and member-1 claimed to be a Leader.

2017-09-05 09:14:32,878 | INFO | qtp481975108-122 | ClusterAdminRpcService | 182 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.2.Carbon | Removing replica for shard default memberName member-1, datastoreType Config
2017-09-05 09:14:32,896 | INFO | ult-dispatcher-2 | ClusterAdminRpcService | 182 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.2.Carbon | Successfully removed replica for shard default
2017-09-05 09:14:32,906 | INFO | rd-dispatcher-40 | tActorServerConfigurationSupport | 174 - org.opendaylight.controller.sal-akka-raft - 1.5.2.Carbon | member-1-shard-default-config: ServerConfigurationPayload [serverConfig=[ServerInfo [id=member-3-shard-default-config, isVoting=true], ServerInfo [id=member-2-shard-default-config, isVoting=true]]] has been successfully replicated to a majority of followers
2017-09-05 09:14:32,917 | INFO | rd-dispatcher-39 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-default-config, leaderId=null, leaderPayloadVersion=5]
2017-09-05 09:14:32,943 | INFO | lt-dispatcher-21 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} indicated no leadership, reconnecting it
org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374 is not the current leader
2017-09-05 09:14:32,960 | INFO | lt-dispatcher-21 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} reconnecting as ReconnectingClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}}
2017-09-05 09:14:32,960 | INFO | lt-dispatcher-21 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: refreshing backend for shard 0
2017-09-05 09:14:32,996 | INFO | lt-dispatcher-41 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: resolved shard 0 to ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.180:2550/user/shardmanager-config/member-2-shard-default-config#-754756015], sessionId=2, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=absent}
2017-09-05 09:14:32,996 | INFO | lt-dispatcher-41 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: resolving connection ReconnectingClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} to ConnectedClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.180:2550/user/shardmanager-config/member-2-shard-default-config#-754756015], sessionId=2, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=absent}}
2017-09-05 09:14:33,018 | INFO | lt-dispatcher-41 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: replaced connection ReconnectingClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} with ConnectedClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.180:2550/user/shardmanager-config/member-2-shard-default-config#-754756015], sessionId=2, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=absent}} in 21.22 ms
2017-09-05 09:14:34,006 | INFO | rd-dispatcher-39 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-1-shard-default-config (Leader): Term 2 in "AppendEntriesReply [term=2, success=false, followerId=member-3-shard-default-config, logLastIndex=12193, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]" message is greater than leader's term 1 - switching to Follower
2017-09-05 09:14:34,007 | INFO | rd-dispatcher-39 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-1-shard-default-config (Leader) :- Switching from behavior Leader to Follower, election term: 2
2017-09-05 09:14:34,008 | INFO | lt-dispatcher-22 | RoleChangeNotifier | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | RoleChangeNotifier for member-1-shard-default-config , received role change from Leader to Follower
2017-09-05 09:14:34,008 | INFO | rd-dispatcher-54 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-default-config, leaderId=null, leaderPayloadVersion=-1]
2017-09-05 09:14:34,008 | INFO | rd-dispatcher-54 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received role changed for member-1-shard-default-config from Leader to Follower
2017-09-05 09:14:34,969 | INFO | rd-dispatcher-37 | aftActorLeadershipTransferCohort | 174 - org.opendaylight.controller.sal-akka-raft - 1.5.2.Carbon | member-1-shard-default-config: Successfully transferred leadership to null in 2.052 s


Now, when this Bug happens, the old leader (member-1) INFO messages (copied from karaf log [6]) are nearly the same, except the unknown component never learns who the new leader (member-3) is, frontend keeps trying to connect to member-1 and never succeeds.

2017-08-24 15:24:13,620 | INFO | qtp412422320-101 | ClusterAdminRpcService | 182 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.2.Carbon | Removing replica for shard default memberName member-1, datastoreType Config
2017-08-24 15:24:13,654 | INFO | lt-dispatcher-26 | ClusterAdminRpcService | 182 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.2.Carbon | Successfully removed replica for shard default
2017-08-24 15:24:13,670 | INFO | rd-dispatcher-53 | tActorServerConfigurationSupport | 174 - org.opendaylight.controller.sal-akka-raft - 1.5.2.Carbon | member-1-shard-default-config: ServerConfigurationPayload [serverConfig=[ServerInfo [id=member-3-shard-default-config, isVoting=true], ServerInfo [id=member-2-shard-default-config, isVoting=true]]] has been successfully replicated to a majority of followers
2017-08-24 15:24:13,681 | INFO | rd-dispatcher-37 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-default-config, leaderId=null, leaderPayloadVersion=5]
2017-08-24 15:24:13,684 | INFO | lt-dispatcher-25 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#513397858, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} indicated no leadership, reconnecting it
org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#513397858 is not the current leader
2017-08-24 15:24:13,691 | INFO | lt-dispatcher-25 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#513397858, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} reconnecting as ReconnectingClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#513397858, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}}
2017-08-24 15:24:13,691 | INFO | lt-dispatcher-25 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: refreshing backend for shard 0
2017-08-24 15:24:13,744 | INFO | rd-dispatcher-53 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-1-shard-default-config: not currently leader, rejecting request ConnectClientRequest{target=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}

, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/temp/$q], minVersion=BORON, maxVersion=BORON}. isLeader: true, isLeaderActive: false,isLeadershipTransferInProgress: true.
2017-08-24 15:24:13,748 | WARN | monPool-worker-3 | AbstractShardBackendResolver | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | Failed to resolve shard
java.util.concurrent.TimeoutException: Connection attempt failed

Here is karaf log [7] segment from the new leader (member-3) showing transient PreLeader state.

2017-08-24 15:24:13,680 | INFO | rd-dispatcher-31 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-config, leaderId=null, leaderPayloadVersion=5]
2017-08-24 15:24:13,700 | INFO | rd-dispatcher-29 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-3-shard-default-config (Candidate): Starting new election term 2
2017-08-24 15:24:13,700 | INFO | rd-dispatcher-29 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-3-shard-default-config (Follower) :- Switching from behavior Follower to Candidate, election term: 2
2017-08-24 15:24:13,701 | INFO | lt-dispatcher-21 | RoleChangeNotifier | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | RoleChangeNotifier for member-3-shard-default-config , received role change from Follower to Candidate
2017-08-24 15:24:13,701 | INFO | rd-dispatcher-33 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-config, leaderId=null, leaderPayloadVersion=-1]
2017-08-24 15:24:13,701 | INFO | rd-dispatcher-33 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received role changed for member-3-shard-default-config from Follower to Candidate
2017-08-24 15:24:13,789 | INFO | rd-dispatcher-29 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-3-shard-default-config (Candidate): LastApplied index 10366 is behind last index 10376 - switching to PreLeader
2017-08-24 15:24:13,795 | INFO | rd-dispatcher-29 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-3-shard-default-config (Candidate) :- Switching from behavior Candidate to PreLeader, election term: 2
2017-08-24 15:24:13,813 | INFO | rd-dispatcher-32 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-config, leaderId=member-3-shard-default-config, leaderPayloadVersion=5]
2017-08-24 15:24:13,816 | INFO | lt-dispatcher-20 | RoleChangeNotifier | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | RoleChangeNotifier for member-3-shard-default-config , received role change from Candidate to PreLeader
2017-08-24 15:24:13,817 | INFO | rd-dispatcher-32 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received role changed for member-3-shard-default-config from Candidate to PreLeader
2017-08-24 15:24:13,838 | INFO | rd-dispatcher-31 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-3-shard-default-config (PreLeader) :- Switching from behavior PreLeader to Leader, election term: 2
2017-08-24 15:24:13,839 | INFO | lt-dispatcher-43 | RoleChangeNotifier | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | RoleChangeNotifier for member-3-shard-default-config , received role change from PreLeader to Leader
2017-08-24 15:24:13,839 | INFO | rd-dispatcher-32 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received role changed for member-3-shard-default-config from PreLeader to Leader
2017-08-24 15:24:13,840 | INFO | rd-dispatcher-32 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: All Shards are ready - data store config is ready, available count is 0
2017-08-24 15:24:14,714 | INFO | rd-dispatcher-32 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-3-shard-default-config (Leader): Cannot append entries because sender's term 1 is less than 2

[4] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-all-carbon/418/log.html.gz#s1-s20-t1-k2-k10-k2-k1
[5] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-all-carbon/432/odl1_karaf.log.gz
[6] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-all-carbon/418/odl1_karaf.log.gz
[7] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-all-carbon/418/odl3_karaf.log.gz

Comment by Robert Varga [ 11/Sep/17 ]

This looks like we are failing to invalidate the cache after we have shut down the local member.

Relevant logs:

2017-08-24 15:24:08,268 | INFO | h for user karaf | command | 498 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Local_Leader_Shutdown
2017-08-24 15:24:13,620 | INFO | qtp412422320-101 | ClusterAdminRpcService | 182 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.2.Carbon | Removing replica for shard default memberName member-1, datastoreType Config
2017-08-24 15:24:13,654 | INFO | lt-dispatcher-26 | ClusterAdminRpcService | 182 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.2.Carbon | Successfully removed replica for shard default
2017-08-24 15:24:13,670 | INFO | rd-dispatcher-53 | tActorServerConfigurationSupport | 174 - org.opendaylight.controller.sal-akka-raft - 1.5.2.Carbon | member-1-shard-default-config: ServerConfigurationPayload [serverConfig=[ServerInfo [id=member-3-shard-default-config, isVoting=true], ServerInfo [id=member-2-shard-default-config, isVoting=true]]] has been successfully replicated to a majority of followers
2017-08-24 15:24:13,681 | INFO | rd-dispatcher-37 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-default-config, leaderId=null, leaderPayloadVersion=5]
2017-08-24 15:24:13,684 | INFO | lt-dispatcher-25 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#513397858, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} indicated no leadership, reconnecting it

2017-08-24 15:24:14,715 | INFO | rd-dispatcher-35 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-default-config, leaderId=null, leaderPayloadVersi
on=-1]
2017-08-24 15:24:14,715 | INFO | rd-dispatcher-35 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received role changed for member-1-shard-default-config from Leader to Follower

2017-08-24 15:24:15,708 | INFO | rd-dispatcher-36 | aftActorLeadershipTransferCohort | 174 - org.opendaylight.controller.sal-akka-raft - 1.5.2.Carbon | member-1-shard-default-config: Successfully transferred leadership to null in 2.034 s
2017-08-24 15:24:15,719 | INFO | rd-dispatcher-34 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | Stopping Shard member-1-shard-default-config

After which we are continuing to get:

2017-08-24 15:24:20,799 | WARN | monPool-worker-3 | AbstractShardBackendResolver | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | Failed to resolve shard
java.util.concurrent.TimeoutException: Connection attempt failed
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129)[180:org.opendaylight.controller.sal-distributed-datastore:1.5.2.Carbon]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$connectShard$2(AbstractShardBackendResolver.java:142)[180:org.opendaylight.controller.sal-distributed-datastore:1.5.2.Carbon]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)[:1.8.0_141]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)[:1.8.0_141]
at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:443)[:1.8.0_141]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)[:1.8.0_141]
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)[:1.8.0_141]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)[:1.8.0_141]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)[:1.8.0_141]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-default-config)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.access.commands.ConnectClientRequest".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[154:com.typesafe.akka.actor:2.4.18]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[154:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[150:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[150:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[150:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)[154:com.typesafe.akka.actor:2.4.18]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)[154:com.typesafe.akka.actor:2.4.18]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)[154:com.typesafe.akka.actor:2.4.18]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)[154:com.typesafe.akka.actor:2.4.18]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_141]

Which seems to be pointing to the local shard actor, which obviously, is not responding. I do not believe this is a problem in AbstractShardBackendResolver, but rather in ShardManager.

Comment by Robert Varga [ 11/Sep/17 ]

To clarify the error resolution process (via actorContext.findPrimaryShardAsync(String)) is returning the local actor ref, eventhough it was already shut down.

This means that either primaryShardInfoCache is not invalidated, or ShardManager is returning the local actor when queried via FindPrimary.

In both cases this looks like a problem with ShardManager.

Comment by Robert Varga [ 11/Sep/17 ]

Guesstimate fix: https://git.opendaylight.org/gerrit/62973

Comment by Vratko Polak [ 12/Sep/17 ]

> Guesstimate fix: https://git.opendaylight.org/gerrit/62973

Merged. Ongoing Sandbox test [9] suggests that fixes this.

[9] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-ddb-clean-lead-shutdown-longevity-only-carbon/3/console

Comment by Vratko Polak [ 13/Sep/17 ]

> Sandbox test [9] suggests that fixes this.

Run #2 has failed without this fix, after 35 iterations which took around 75 minutes.
Run #3 has passed with this fix, after 23 hours. Setting to VERIFIED.

Generated at Wed Feb 07 21:54:18 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.