[CONTROLLER-1738] RequestTimeoutException due to "Shard has no current leader" after shutdown-shard-replica with ShardLeaderStateChanged not delivered Created: 04/Jul/17  Updated: 25/Jul/23

Status: Confirmed
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Duplicate
is duplicated by CONTROLLER-1714 RequestTimeoutException after remove-... Resolved
External issue ID: 8794

 Description   

This is similar to CONTROLLER-1717 but a different message has been lost here.
I believe CONTROLLER-1714 was exactly this, but with less evidence in karaf log.

The Robot failure [0] is the usual 120s timeout we can see caused by multiple bugs (from transaction writer for module-based shard, tell-based protocol):
RequestTimeoutException: Timed out after 120.029805238seconds

Looking at karaf log [1] of member-1 (writer, old leader), we can see leadership has been successfully transferred at 04:04:59,250 but the information about the new leader being there has been lost:
2017-07-04 04:04:59,252 | INFO | lt-dispatcher-42 | LocalActorRef | 174 - com.typesafe.akka.slf4j - 2.4.18 | Message [org.opendaylight.controller.cluster.datastore.messages.ShardLeaderStateChanged] from Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#145361760 to Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config/member-1-shard-default-config-notifier#-591265397 was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

So the member new it is a Follower, but it was unable to tell client who the new leader is.
2017-07-04 04:05:19,248 | WARN | monPool-worker-2 | AbstractShardBackendResolver | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | Failed to resolve shard
java.util.concurrent.TimeoutException: Shard has no current leader

Perhaps there is a common underlying Bug which causes occasional undelivered messages, and we see different symptoms depending on which message gets lost.

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/771/log.html.gz#s1-s20-t1-k2-k8
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/771/odl1_karaf.log.gz



 Comments   
Comment by Tom Pantelis [ 04/Jul/17 ]

Can you describe the detailed sequence of the test? You mention shutdown-shard-replica - I assume this is the "backdoor" RPC to gracefully shut down a shard actor unbeknownst to the ShardManager. However the ShardManager does not know the shard actor was shutdown so still has a record of it. I believe the ShardLeaderStateChanged goes to dead letters b/c the notifier actor goes away with the shard actor, which wouldn't matter if the ShardManager had initiated the shutdown thru the normal code paths.

I've warned about using that backdoor for testing - I still think it's better to use normal code paths.

Comment by Robert Varga [ 04/Jul/17 ]

Right, since the notifier is a child of the shard actor, it gets reaped alongside with it. https://git.opendaylight.org/gerrit/59210 adds a deathwatch, which should solve this case, I think.

Comment by Vratko Polak [ 06/Jul/17 ]

> You mention shutdown-shard-replica

Yes, it is a RPC implemented here [2]

The interesting fact is that the test passes most of the time on RelEng.
Here is the failure encountered on Sandbox run with debug logs [3].

[2] https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=blob;f=opendaylight/md-sal/samples/clustering-test-app/provider/src/main/java/org/opendaylight/controller/clustering/it/provider/MdsalLowLevelTestProvider.java;h=e0e8d99d1aab3eac8df847bf7075de6e15b0257e;hb=refs/heads/stable/carbon#l574
[3] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-cls-only-carbon/1/log.html.gz#s1-s2-t1-k2-k8

Comment by Vratko Polak [ 11/Jul/17 ]

As a workaround, we can change the tests [4] to use remove-shard-replica instead.
This bug would stay opened, but importance would be lower.

[4] https://git.opendaylight.org/gerrit/60140

Comment by Vratko Polak [ 12/Jul/17 ]

[4] merged, importance lowered.

Generated at Wed Feb 07 21:54:22 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.