Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1486

Clustering: Datastore may fail with "Shard XXX has no leader. Try again later"

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Resolution: Duplicate
    • Post-Helium
    • None
    • clustering
    • None
    • Operating System: All
      Platform: All

    • 5391

    Description

      Found by clustering test run: https://jenkins.opendaylight.org/releng/view/netconf/job/netconf-csit-3node-clustering-only-beryllium/53/
      The relevant report is from odl2_karaf.log (see the test run artifacts or the attachment which contains a copy of the logs):

      2016-02-18 22:11:04,009 | WARN | qtp862704672-67 | BrokerFacade | 211 - org.opendaylight.netconf.sal-rest-connector - 1.3.0.SNAPSHOT | Exception by reading OPERATIONAL via Restconf: /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

      {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=topology-netconf}

      ]/node/node[

      {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=netconf-test-device}

      ] java.util.concurrent.ExecutionException: ReadFailedException{message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

      {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=topology-netconf}

      ]/node/node[

      {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=netconf-test-device}

      ], errorList=[RpcError [message=Error executeRead ReadData for path /(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

      {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=topology-netconf}

      ]/node/node[

      {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=netconf-test-device}

      ], severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard member-2-shard-topology-operational currently has no leader. Try again later.]]} at org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:63)

      and in odl1_karaf.log (the timestamp is quite weird, according to it the error below happened 2 minutes BEFORE the error above):

      2016-02-18 22:09:12,905 | WARN | lt-dispatcher-50 | ConcurrentDOMDataBroker | 143 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | Tx: DOM-CHAIN-0-0 Error during phase CAN_COMMIT, starting Abort akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.11.66:2550/), Path(/user/shardmanager-config/member-3-shard-topology-config)]] after [5000 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)[128:com.typesafe.akka.actor:2.3.14] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c] at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)[128:com.typesafe.akka.actor:2.3.14] at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)[128:com.typesafe.akka.actor:2.3.14] at java.lang.Thread.run(Thread.java:745)[:1.7.0_85] 2016-02-18 22:09:12,909 | ERROR | CommitFutures-1 | TopologyNodeWriter | 240 - org.opendaylight.netconf.topology - 1.0.0.SNAPSHOT | org.opendaylight.controller.md.sal.binding.impl.BindingDOMTransactionChainAdapter@63d9f743: TransactionChain(DOM-CHAIN-0-0) TransactionCommitFailedException

      {message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.11.66:2550/), Path(/user/shardmanager-config/member-3-shard-topology-config)]] after [5000 ms]]]}

      FAILED! 2016-02-18 22:09:12,909 | ERROR | CommitFutures-2 | TopologyNodeWriter | 240 - org.opendaylight.netconf.topology - 1.0.0.SNAPSHOT | topology-netconf: Transaction(init topology container) DOM-CHAIN-0-0 FAILED! TransactionCommitFailedException

      {message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.30.11.66:2550/), Path(/user/shardmanager-config/member-3-shard-topology-config)]] after [5000 ms]]]}

      According to the discussion with the developers, the most likely cause is something like this:

      • Leader election fails or something tries to write to the datastore before the leader election is done.
      • Netconf topology hits the datastore failure and tries to restart.
      • Netconf topology crashes because it is already registered in entity ownership service.

      CONTROLLER-1468 might be relevant as it is about datastore operation failure when leader is down (in this case it appears leader is not known yet).

      Attachments

        1. logs.tgz
          47 kB
          Jozef Behran

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Unassigned Unassigned
              jbehran@cisco.com Jozef Behran
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: