-
Bug
-
Resolution: Unresolved
-
None
-
None
-
None
-
Operating System: All
Platform: All
-
8794
This is similar to CONTROLLER-1717 but a different message has been lost here.
I believe CONTROLLER-1714 was exactly this, but with less evidence in karaf log.
The Robot failure [0] is the usual 120s timeout we can see caused by multiple bugs (from transaction writer for module-based shard, tell-based protocol):
RequestTimeoutException: Timed out after 120.029805238seconds
Looking at karaf log [1] of member-1 (writer, old leader), we can see leadership has been successfully transferred at 04:04:59,250 but the information about the new leader being there has been lost:
2017-07-04 04:04:59,252 | INFO | lt-dispatcher-42 | LocalActorRef | 174 - com.typesafe.akka.slf4j - 2.4.18 | Message [org.opendaylight.controller.cluster.datastore.messages.ShardLeaderStateChanged] from Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#145361760 to Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config/member-1-shard-default-config-notifier#-591265397 was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
So the member new it is a Follower, but it was unable to tell client who the new leader is.
2017-07-04 04:05:19,248 | WARN | monPool-worker-2 | AbstractShardBackendResolver | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | Failed to resolve shard
java.util.concurrent.TimeoutException: Shard has no current leader
Perhaps there is a common underlying Bug which causes occasional undelivered messages, and we see different symptoms depending on which message gets lost.
[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/771/log.html.gz#s1-s20-t1-k2-k8
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/771/odl1_karaf.log.gz
- is duplicated by
-
CONTROLLER-1714 RequestTimeoutException after remove-shard-replica (without apparent cause)
- Resolved