[CONTROLLER-1713] RequestTimeoutException after remove-shard-replica with "transferred leadership to null" Created: 08/Jun/17 Updated: 25/Jul/23 Resolved: 13/Sep/17 |
|
| Status: | Verified |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8639 |
| Description |
|
Seen [0] on RelEng first time with module-based shard (tell-based protocol). This is similar to The quoted part of the title is seen in member-3 karaf.log [1]: [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/736/log.html.gz#s1-s20-t1-k2-k9 |
| Comments |
| Comment by Vratko Polak [ 12/Jun/17 ] |
|
Something similar happened [2] again. Karaf.log [3] contains the quoted part, but there are other suspicious lines, so I am not really sure this is the same Bug. 2017-06-11 11:53:06,783 | INFO | rd-dispatcher-38 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.Carbon | member-1-shard-default-config: Successfully transferred leadership to null in 2.319 s [2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/739/log.html.gz#s1-s36-t3-k2-k14 |
| Comment by Vratko Polak [ 07/Sep/17 ] |
|
Examining history of After comparing karaf logs, I think this has to do with frontend reconnect happening while there is a member in PreLeader state. First, let me copy INFO logs (copied from [5]) from a run where the test passed. The log belongs to member-1 (the old leader), the new leader is member-2. It is not clear from INFO level which member-1 component has first become aware that member-2 is a leader to allow successful reconnect (09:14:33,018), as member-1 Shard switched from Leader to Follower only after seeing (09:14:34,006) higher term from the other follower (member-3) AppendEntriesReply. This probably implies there was a time interval when both member-2 and member-1 claimed to be a Leader. 2017-09-05 09:14:32,878 | INFO | qtp481975108-122 | ClusterAdminRpcService | 182 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.2.Carbon | Removing replica for shard default memberName member-1, datastoreType Config org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374 is not the current leader 2017-09-05 09:14:32,960 | INFO | lt-dispatcher-21 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0} , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} reconnecting as ReconnectingClientConnection{client=ClientIdentifier {frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}}2017-09-05 09:14:32,960 | INFO | lt-dispatcher-21 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: refreshing backend for shard 0 2017-09-05 09:14:32,996 | INFO | lt-dispatcher-41 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: resolved shard 0 to ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.180:2550/user/shardmanager-config/member-2-shard-default-config#-754756015], sessionId=2, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=absent} 2017-09-05 09:14:32,996 | INFO | lt-dispatcher-41 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: resolving connection ReconnectingClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0} , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} to ConnectedClientConnection{client=ClientIdentifier {frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.180:2550/user/shardmanager-config/member-2-shard-default-config#-754756015], sessionId=2, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=absent}}2017-09-05 09:14:33,018 | INFO | lt-dispatcher-41 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: replaced connection ReconnectingClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0} , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-543839374, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} with ConnectedClientConnection{client=ClientIdentifier {frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.180:2550/user/shardmanager-config/member-2-shard-default-config#-754756015], sessionId=2, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=absent}} in 21.22 ms2017-09-05 09:14:34,006 | INFO | rd-dispatcher-39 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-1-shard-default-config (Leader): Term 2 in "AppendEntriesReply [term=2, success=false, followerId=member-3-shard-default-config, logLastIndex=12193, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]" message is greater than leader's term 1 - switching to Follower 2017-09-05 09:14:34,007 | INFO | rd-dispatcher-39 | Shard | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | member-1-shard-default-config (Leader) :- Switching from behavior Leader to Follower, election term: 2 2017-09-05 09:14:34,008 | INFO | lt-dispatcher-22 | RoleChangeNotifier | 173 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.Carbon | RoleChangeNotifier for member-1-shard-default-config , received role change from Leader to Follower 2017-09-05 09:14:34,008 | INFO | rd-dispatcher-54 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-default-config, leaderId=null, leaderPayloadVersion=-1] 2017-09-05 09:14:34,008 | INFO | rd-dispatcher-54 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received role changed for member-1-shard-default-config from Leader to Follower 2017-09-05 09:14:34,969 | INFO | rd-dispatcher-37 | aftActorLeadershipTransferCohort | 174 - org.opendaylight.controller.sal-akka-raft - 1.5.2.Carbon | member-1-shard-default-config: Successfully transferred leadership to null in 2.052 s Now, when this Bug happens, the old leader (member-1) INFO messages (copied from karaf log [6]) are nearly the same, except the unknown component never learns who the new leader (member-3) is, frontend keeps trying to connect to member-1 and never succeeds. 2017-08-24 15:24:13,620 | INFO | qtp412422320-101 | ClusterAdminRpcService | 182 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.2.Carbon | Removing replica for shard default memberName member-1, datastoreType Config 2017-08-24 15:24:13,654 | INFO | lt-dispatcher-26 | ClusterAdminRpcService | 182 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.2.Carbon | Successfully removed replica for shard default 2017-08-24 15:24:13,670 | INFO | rd-dispatcher-53 | tActorServerConfigurationSupport | 174 - org.opendaylight.controller.sal-akka-raft - 1.5.2.Carbon | member-1-shard-default-config: ServerConfigurationPayload [serverConfig=[ServerInfo [id=member-3-shard-default-config, isVoting=true], ServerInfo [id=member-2-shard-default-config, isVoting=true]]] has been successfully replicated to a majority of followers 2017-08-24 15:24:13,681 | INFO | rd-dispatcher-37 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-default-config, leaderId=null, leaderPayloadVersion=5] 2017-08-24 15:24:13,684 | INFO | lt-dispatcher-25 | ClientActorBehavior | 178 - org.opendaylight.controller.cds-access-client - 1.1.2.Carbon | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0} , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#513397858, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} indicated no leadership, reconnecting it , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#513397858, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} , sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/temp/$q], minVersion=BORON, maxVersion=BORON}. isLeader: true, isLeaderActive: false,isLeadershipTransferInProgress: true. Here is karaf log [7] segment from the new leader (member-3) showing transient PreLeader state. 2017-08-24 15:24:13,680 | INFO | rd-dispatcher-31 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-3-shard-default-config, leaderId=null, leaderPayloadVersion=5] [4] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-all-carbon/418/log.html.gz#s1-s20-t1-k2-k10-k2-k1 |
| Comment by Robert Varga [ 11/Sep/17 ] |
|
This looks like we are failing to invalidate the cache after we have shut down the local member. Relevant logs: 2017-08-24 15:24:08,268 | INFO | h for user karaf | command | 498 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Local_Leader_Shutdown , cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#513397858, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} indicated no leadership, reconnecting it 2017-08-24 15:24:14,715 | INFO | rd-dispatcher-35 | ShardManager | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-default-config, leaderId=null, leaderPayloadVersi 2017-08-24 15:24:15,708 | INFO | rd-dispatcher-36 | aftActorLeadershipTransferCohort | 174 - org.opendaylight.controller.sal-akka-raft - 1.5.2.Carbon | member-1-shard-default-config: Successfully transferred leadership to null in 2.034 s After which we are continuing to get: 2017-08-24 15:24:20,799 | WARN | monPool-worker-3 | AbstractShardBackendResolver | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | Failed to resolve shard Which seems to be pointing to the local shard actor, which obviously, is not responding. I do not believe this is a problem in AbstractShardBackendResolver, but rather in ShardManager. |
| Comment by Robert Varga [ 11/Sep/17 ] |
|
To clarify the error resolution process (via actorContext.findPrimaryShardAsync(String)) is returning the local actor ref, eventhough it was already shut down. This means that either primaryShardInfoCache is not invalidated, or ShardManager is returning the local actor when queried via FindPrimary. In both cases this looks like a problem with ShardManager. |
| Comment by Robert Varga [ 11/Sep/17 ] |
|
Guesstimate fix: https://git.opendaylight.org/gerrit/62973 |
| Comment by Vratko Polak [ 12/Sep/17 ] |
|
> Guesstimate fix: https://git.opendaylight.org/gerrit/62973 Merged. Ongoing Sandbox test [9] suggests that fixes this. |
| Comment by Vratko Polak [ 13/Sep/17 ] |
|
> Sandbox test [9] suggests that fixes this. Run #2 has failed without this fix, after 35 iterations which took around 75 minutes. |