[CONTROLLER-1486] Clustering: Datastore may fail with "Shard XXX has no leader. Try again later" Created: 19/Feb/16 Updated: 19/Oct/17 Resolved: 19/Sep/16 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Post-Helium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Jozef Behran | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| External issue ID: | 5391 | ||||||||||||||||
| Description |
|
Found by clustering test run: https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-3node-clustering-only-beryllium/53/ 2016-02-18 22:11:04,009 | WARN | qtp862704672-67 | BrokerFacade | 211 - org.opendaylight.netconf.sal-rest-connector - 1.3.0.SNAPSHOT | Exception by reading OPERATIONAL via Restconf: /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=topology-netconf}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=netconf-test-device}] java.util.concurrent.ExecutionException: ReadFailedException{message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=topology-netconf}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=netconf-test-device}], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=topology-netconf}]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=netconf-test-device}], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-2-shard-topology-operational currently has no leader. Try again later.]]} at org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:63) and in odl1_karaf.log (the timestamp is quite weird, according to it the error below happened 2 minutes BEFORE the error above): 2016-02-18 22:09:12,905 | WARN | lt-dispatcher-50 | ConcurrentDOMDataBroker | 143 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | Tx: DOM-CHAIN-0-0 Error during phase CAN_COMMIT, starting Abort akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.11.66:2550/), Path(/user/shardmanager-config/member-3-shard-topology-config)]] after [5000 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)[128:com.typesafe.akka.actor:2.3.14] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c] at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)[128:com.typesafe.akka.actor:2.3.14] at java.lang.Thread.run(Thread.java:745)[:1.7.0_85] 2016-02-18 22:09:12,909 | ERROR | CommitFutures-1 | TopologyNodeWriter | 240 - org.opendaylight.netconf.topology - 1.0.0.SNAPSHOT | org.opendaylight.controller.md.sal.binding.impl.BindingDOMTransactionChainAdapter@63d9f743: TransactionChain(DOM-CHAIN-0-0) TransactionCommitFailedException {message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.11.66:2550/), Path(/user/shardmanager-config/member-3-shard-topology-config)]] after [5000 ms]]]}FAILED! 2016-02-18 22:09:12,909 | ERROR | CommitFutures-2 | TopologyNodeWriter | 240 - org.opendaylight.netconf.topology - 1.0.0.SNAPSHOT | topology-netconf: Transaction(init topology container) DOM-CHAIN-0-0 FAILED! TransactionCommitFailedException {message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.11.66:2550/), Path(/user/shardmanager-config/member-3-shard-topology-config)]] after [5000 ms]]]}According to the discussion with the developers, the most likely cause is something like this:
|
| Comments |
| Comment by Jozef Behran [ 19/Feb/16 ] |
|
Attachment logs.tgz has been added with description: Backup of logs from the test run exhibiting the failure |
| Comment by Tom Pantelis [ 19/Feb/16 ] |
|
The attached tar contains the log files with .xz extension which is some binary format. What is the .xz format and how to "decrypt" it? |
| Comment by Tom Pantelis [ 25/Feb/16 ] |
|
In looking at the logs, shard leaders were elected when the nodes started up, member-3 (10.30.11.66) became the leader of the topology-config shard: 2016-02-18 22:09:06,412 | INFO | lt-dispatcher-23 | ShardManager | 143 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | shard-manager-config: Received role changed for member-3-shard-topology-config from Candidate to Leader member-1 (10.30.11.144) became the leader of the topology-oper shard. The akka cluster leader was member-1 however it lost connection with member-3 about 5 sec later: 2016-02-18 22:09:11,817 | WARN | lt-dispatcher-22 | ClusterCoreDaemon | 129 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.11.144:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.30.11.66:2550, status = Up)] It reconnected 2 sec later: 2016-02-18 22:09:13,816 | INFO | lt-dispatcher-27 | kka://opendaylight-cluster-data) | 129 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.11.144:2550] - Marking node(s) as REACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.30.11.66:2550, status = Up)] Same with member-2. So there was an ~8-10 sec period of disconnect which resulted in the DataStoreUnavailableException and AskTimeoutException errors. What caused the disconnect? Don't know. The nodes were still starting up so it could've been due to a long GC pause or there was an actual issue in the network causing a temporary outage. For Boron, we intend to look into making it more resilient to short disconnects and timeouts, possibly by retrying transactions. |
| Comment by ranjithkumar_t [ 19/Sep/16 ] |
|
Steps to be followed:
Observation:
I have tested manually many times but the bug was not reproduced. I have attached the karaf log for your your reference. |
| Comment by ranjithkumar_t [ 19/Sep/16 ] |
|
Attachment 5391 logs.zip has been added with description: karaf logs for all odl controllers |
| Comment by Tom Pantelis [ 19/Sep/16 ] |
|
It appears inflight transactions failed due to a temporary disconnect. As mentioned there's work underway to retry transactions in this case. Marking this as a duplicate of |