[CONTROLLER-1486] Clustering: Datastore may fail with "Shard XXX has no leader. Try again later" Created: 19/Feb/16  Updated: 19/Oct/17  Resolved: 19/Sep/16

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Post-Helium
Fix Version/s: None

Type: Bug
Reporter: Jozef Behran Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive 5391 logs.zip     File logs.tgz    
Issue Links:
Blocks
is blocked by CONTROLLER-1483 akka.pattern.AskTimeoutException on f... Resolved
Duplicate
duplicates CONTROLLER-1483 akka.pattern.AskTimeoutException on f... Resolved
External issue ID: 5391

 Description   

Found by clustering test run: https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-3node-clustering-only-beryllium/53/
The relevant report is from odl2_karaf.log (see the test run artifacts or the attachment which contains a copy of the logs):

2016-02-18 22:11:04,009 | WARN | qtp862704672-67 | BrokerFacade | 211 - org.opendaylight.netconf.sal-rest-connector - 1.3.0.SNAPSHOT | Exception by reading OPERATIONAL via Restconf: /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=topology-netconf}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=netconf-test-device}

] java.util.concurrent.ExecutionException: ReadFailedException{message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=topology-netconf}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=netconf-test-device}

], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=topology-netconf}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=netconf-test-device}

], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-2-shard-topology-operational currently has no leader. Try again later.]]} at org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:63)

and in odl1_karaf.log (the timestamp is quite weird, according to it the error below happened 2 minutes BEFORE the error above):

2016-02-18 22:09:12,905 | WARN | lt-dispatcher-50 | ConcurrentDOMDataBroker | 143 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | Tx: DOM-CHAIN-0-0 Error during phase CAN_COMMIT, starting Abort akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.11.66:2550/), Path(/user/shardmanager-config/member-3-shard-topology-config)]] after [5000 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)[128:com.typesafe.akka.actor:2.3.14] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c] at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)[128:com.typesafe.akka.actor:2.3.14] at java.lang.Thread.run(Thread.java:745)[:1.7.0_85] 2016-02-18 22:09:12,909 | ERROR | CommitFutures-1 | TopologyNodeWriter | 240 - org.opendaylight.netconf.topology - 1.0.0.SNAPSHOT | org.opendaylight.controller.md.sal.binding.impl.BindingDOMTransactionChainAdapter@63d9f743: TransactionChain(DOM-CHAIN-0-0) TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.11.66:2550/), Path(/user/shardmanager-config/member-3-shard-topology-config)]] after [5000 ms]]]}

FAILED! 2016-02-18 22:09:12,909 | ERROR | CommitFutures-2 | TopologyNodeWriter | 240 - org.opendaylight.netconf.topology - 1.0.0.SNAPSHOT | topology-netconf: Transaction(init topology container) DOM-CHAIN-0-0 FAILED! TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.11.66:2550/), Path(/user/shardmanager-config/member-3-shard-topology-config)]] after [5000 ms]]]}

According to the discussion with the developers, the most likely cause is something like this:

  • Leader election fails or something tries to write to the datastore before the leader election is done.
  • Netconf topology hits the datastore failure and tries to restart.
  • Netconf topology crashes because it is already registered in entity ownership service.

CONTROLLER-1468 might be relevant as it is about datastore operation failure when leader is down (in this case it appears leader is not known yet).



 Comments   
Comment by Jozef Behran [ 19/Feb/16 ]

Attachment logs.tgz has been added with description: Backup of logs from the test run exhibiting the failure

Comment by Tom Pantelis [ 19/Feb/16 ]

The attached tar contains the log files with .xz extension which is some binary format. What is the .xz format and how to "decrypt" it?

Comment by Tom Pantelis [ 25/Feb/16 ]

In looking at the logs, shard leaders were elected when the nodes started up, member-3 (10.30.11.66) became the leader of the topology-config shard:

2016-02-18 22:09:06,412 | INFO | lt-dispatcher-23 | ShardManager | 143 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | shard-manager-config: Received role changed for member-3-shard-topology-config from Candidate to Leader

member-1 (10.30.11.144) became the leader of the topology-oper shard.

The akka cluster leader was member-1 however it lost connection with member-3 about 5 sec later:

2016-02-18 22:09:11,817 | WARN | lt-dispatcher-22 | ClusterCoreDaemon | 129 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.11.144:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.30.11.66:2550, status = Up)]

It reconnected 2 sec later:

2016-02-18 22:09:13,816 | INFO | lt-dispatcher-27 | kka://opendaylight-cluster-data) | 129 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.11.144:2550] - Marking node(s) as REACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.30.11.66:2550, status = Up)]

Same with member-2. So there was an ~8-10 sec period of disconnect which resulted in the DataStoreUnavailableException and AskTimeoutException errors.

What caused the disconnect? Don't know. The nodes were still starting up so it could've been due to a long GC pause or there was an actual issue in the network causing a temporary outage.

For Boron, we intend to look into making it more resilient to short disconnects and timeouts, possibly by retrying transactions.

Comment by ranjithkumar_t [ 19/Sep/16 ]

Steps to be followed:
=====================

  • Downloaded the latest distribution patch "distribution-karaf-0.5.0-20160902.020649-4739.tar.gz".
  • Changes made in akka.conf configuration file.
  • Installed required features for openstack (odl-ovsdb-openstack).
  • Clustering happens and initial network topology is initiated in operational data store.

Observation:
============

  • The following are the cluster nodes 10.106.138.110, 10.106.138.154 and 10.106.138.155.
  • Leader node is 10.106.138.110 the remaining two nodes are followers.
  • In 3 node cluster, If we logout or kill shard Leader node, in the remaining two nodes, one of the node will select as a Leader but I am not getting any exception like "DataStoreUnavailableException" for specific member in karaf log.

I have tested manually many times but the bug was not reproduced.

I have attached the karaf log for your your reference.

Comment by ranjithkumar_t [ 19/Sep/16 ]

Attachment 5391 logs.zip has been added with description: karaf logs for all odl controllers

Comment by Tom Pantelis [ 19/Sep/16 ]

It appears inflight transactions failed due to a temporary disconnect. As mentioned there's work underway to retry transactions in this case. Marking this as a duplicate of CONTROLLER-1483.

Generated at Wed Feb 07 19:55:40 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.