[CONTROLLER-1677] Shard currently has no leader. Try again later. Created: 15/May/17  Updated: 25/Jul/23  Resolved: 19/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Jakub Morvay
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8448

 Description   

This is happening in CSIT [0] for module-based shards with tell-based protocol, after long period of isolation. It is possible that suite logic is not correct [1].

After fixing the suite, a real bug might become visible, as karaf.log [2] shows multiple messages such as:
2017-05-14 06:26:44,925 | INFO | ult-dispatcher-3 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | member-3-shard-default-config (Leader): handleAppendEntriesReply - received unsuccessful reply: AppendEntriesReply [term=7, success=false, followerId=member-2-shard-default-config, logLastIndex=3561, logLastTerm=3, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], leader snapshotIndex: 41

That might be preventing the new writer from functioning correctly.

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/684/archives/log.html.gz#s1-s33-t2-k2-k23-k1-k8-k1-k1-k1
[1] https://git.opendaylight.org/gerrit/#/c/56651/3/csit/libraries/controller/DdbCommons.robot@188
[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/684/archives/odl3_karaf.log.gz



 Comments   
Comment by Robert Varga [ 15/May/17 ]

This requires a change to AbstractShardBackendResolver.resolveBackendInfo() so it reports a TimeoutException (which causes another connect attempt) when it sees NoShardLeaderException.

It will also require a change to the UT suite, as the integration tests are actually expecting this exception to be reported.

Comment by Vratko Polak [ 15/May/17 ]

> It is possible that suite logic is not correct [1].

Now [1] I think the logic is correct.

Comment by A H [ 17/May/17 ]

We are looking to build Carbon RC2 tomorrow 5/18 at 23:59 UTC time assuming there are no blocker bugs. Is there an ETA for when a fix can be merged and this bug resolved for stable/carbon branch?

Comment by Jakub Morvay [ 18/May/17 ]

(In reply to A H from comment #3)
> We are looking to build Carbon RC2 tomorrow 5/18 at 23:59 UTC time assuming
> there are no blocker bugs. Is there an ETA for when a fix can be merged and
> this bug resolved for stable/carbon branch?

Hi An,

There are couple of UTs failing because of the changes in this patch. I think fixing them is doable today (5/18).

Comment by Jakub Morvay [ 19/May/17 ]

https://git.opendaylight.org/gerrit/#/c/57074/

Generated at Wed Feb 07 19:56:10 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.