[CONTROLLER-1738] RequestTimeoutException due to "Shard has no current leader" after shutdown-shard-replica with ShardLeaderStateChanged not delivered Created: 04/Jul/17 Updated: 25/Jul/23 |
|
| Status: | Confirmed |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 8794 | ||||||||
| Description |
|
This is similar to The Robot failure [0] is the usual 120s timeout we can see caused by multiple bugs (from transaction writer for module-based shard, tell-based protocol): Looking at karaf log [1] of member-1 (writer, old leader), we can see leadership has been successfully transferred at 04:04:59,250 but the information about the new leader being there has been lost: So the member new it is a Follower, but it was unable to tell client who the new leader is. Perhaps there is a common underlying Bug which causes occasional undelivered messages, and we see different symptoms depending on which message gets lost. [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/771/log.html.gz#s1-s20-t1-k2-k8 |
| Comments |
| Comment by Tom Pantelis [ 04/Jul/17 ] |
|
Can you describe the detailed sequence of the test? You mention shutdown-shard-replica - I assume this is the "backdoor" RPC to gracefully shut down a shard actor unbeknownst to the ShardManager. However the ShardManager does not know the shard actor was shutdown so still has a record of it. I believe the ShardLeaderStateChanged goes to dead letters b/c the notifier actor goes away with the shard actor, which wouldn't matter if the ShardManager had initiated the shutdown thru the normal code paths. I've warned about using that backdoor for testing - I still think it's better to use normal code paths. |
| Comment by Robert Varga [ 04/Jul/17 ] |
|
Right, since the notifier is a child of the shard actor, it gets reaped alongside with it. https://git.opendaylight.org/gerrit/59210 adds a deathwatch, which should solve this case, I think. |
| Comment by Vratko Polak [ 06/Jul/17 ] |
|
> You mention shutdown-shard-replica Yes, it is a RPC implemented here [2] The interesting fact is that the test passes most of the time on RelEng. [2] https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=blob;f=opendaylight/md-sal/samples/clustering-test-app/provider/src/main/java/org/opendaylight/controller/clustering/it/provider/MdsalLowLevelTestProvider.java;h=e0e8d99d1aab3eac8df847bf7075de6e15b0257e;hb=refs/heads/stable/carbon#l574 |
| Comment by Vratko Polak [ 11/Jul/17 ] |
|
As a workaround, we can change the tests [4] to use remove-shard-replica instead. |
| Comment by Vratko Polak [ 12/Jul/17 ] |
|
[4] merged, importance lowered. |