[CONTROLLER-1543] Cluster test regression: No cars can be added after 2 followers are taken down and then up Created: 30/Aug/16  Updated: 19/Oct/17  Resolved: 01/Sep/16

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: 0.4.0
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 6587
Priority: High

 Description   

As showed in this test for Boron:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/controller-csit-3node-clustering-only-boron/

There is a failure when trying to add cars after 2 instances are taking down and then they are recovered:

2016-08-30 18:30:07,383 | WARN | lt-dispatcher-17 | ConcurrentDOMDataBroker | 197 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.Boron-RC1 | Tx: DOM-18 Error during phase CAN_COMMIT, starting Abort
akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-car-config#1713972084)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[180:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)[180:com.typesafe.akka.actor:2.4.7]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_72-internal]
2016-08-30 18:30:07,382 | ERROR | ult-dispatcher-5 | LocalThreePhaseCommitCohort | 197 - org.opendaylight.controller.sal-distributed-datastore - 1.4.0.Boron-RC1 | Failed to prepare transaction member-1-datastore-config-fe-0-txn-18 on backend
akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-car-config#1713972084)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[180:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)[180:com.typesafe.akka.actor:2.4.7]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_72-internal]
2016-08-30 18:30:07,385 | WARN | CommitFutures-18 | RestconfImpl | 265 - org.opendaylight.netconf.sal-rest-connector - 1.4.0.Boron-RC1 | Error creating data config/car:cars
TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-car-config#1713972084)]] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".]]}

at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)[166:org.opendaylight.controller.sal-broker-impl:1.4.0.Boron-RC1]
at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)[166:org.opendaylight.controller.sal-broker-impl:1.4.0.Boron-RC1]
at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:103)[68:org.opendaylight.yangtools.util:1.0.0.Boron-RC1]
at org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.handleException(ConcurrentDOMDataBroker.java:220)[197:org.opendaylight.controller.sal-distributed-datastore:1.4.0.Boron-RC1]
at org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.access$000(ConcurrentDOMDataBroker.java:49)[197:org.opendaylight.controller.sal-distributed-datastore:1.4.0.Boron-RC1]
at org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$1.onFailure(ConcurrentDOMDataBroker.java:128)[197:org.opendaylight.controller.sal-distributed-datastore:1.4.0.Boron-RC1]
at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)[65:com.google.guava:18.0.0]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)[65:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)[65:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)[65:com.google.guava:18.0.0]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)[65:com.google.guava:18.0.0]
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:68)[65:com.google.guava:18.0.0]
at org.opendaylight.controller.cluster.datastore.SingleCommitCohortProxy$1.onComplete(SingleCommitCohortProxy.java:58)[197:org.opendaylight.controller.sal-distributed-datastore:1.4.0.Boron-RC1]
at akka.dispatch.OnComplete.internal(Future.scala:258)[180:com.typesafe.akka.actor:2.4.7]
at akka.dispatch.OnComplete.internal(Future.scala:256)[180:com.typesafe.akka.actor:2.4.7]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)[180:com.typesafe.akka.actor:2.4.7]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)[180:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)[180:com.typesafe.akka.actor:2.4.7]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)[180:com.typesafe.akka.actor:2.4.7]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)[180:com.typesafe.akka.actor:2.4.7]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)[180:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)[180:com.typesafe.akka.actor:2.4.7]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)[180:com.typesafe.akka.actor:2.4.7]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:409)[180:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-car-config#1713972084)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[180:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[176:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)[180:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)[180:com.typesafe.akka.actor:2.4.7]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_72-internal]



 Comments   
Comment by Luis Gomez [ 30/Aug/16 ]

Note that recovered instance normally come back well because they are marked as Down in remeining node.

Comment by Tom Pantelis [ 31/Aug/16 ]

After stopping the followers, the test tries to add a car which is expected to fail since the leader can't replicate. This step succeeds however when the follower is restarted, subsequent add car attempts time out and fail. When this step is removed the test succeeds.

The add car attempt occurs quickly before the Leader switches to IsolatedLeader. Thus the Leader persists and replicates and the transaction to transitioned to the COMMIT_PENDING state in the ShardDataTree. The transaction timer triggers every 15 sec and tries to remove and abort the transaction but since it's COMMIT_PENDING the ShardDataTree leaves it in the queue. When the follower is restarted and replication is achieved and the transaction is applied to state, the ShardDataTree should remove it from pending and transition it to COMMITTED. However this does not occur b/c the ClientRequestTracker state in the Leader is lost when it switches to IsolatedLeader. Thus the transaction ID isn't available to the ShardDataTree and the transaction is applied as a foreign candidate, leaving the transaction in the pending queue. This prevents subsequent transactions from making progress.

To fix this, I submitted https://git.opendaylight.org/gerrit/#/c/44899 to retain/copy the internal leader state when transitioning between Leader and IsolatedLeader.

Comment by Luis Gomez [ 01/Sep/16 ]

Now we are talking

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/controller-csit-3node-clustering-only-carbon/

Generated at Wed Feb 07 19:55:48 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.