[CONTROLLER-1218] Clustering : No cohort entry found for transaction exception occurs often during Netconf scale test Created: 17/Mar/15  Updated: 11/Jun/15  Resolved: 11/Jun/15

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: Post-Helium
Fix Version/s: None

Type: Bug
Reporter: Moiz Raja Assignee: Harman Singh
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 2860

 Description   

When running the netconf scale test with 10,000 devices the following exception is seen. This may be one of the root causes for the test failing sometimes,

2015-03-17 13:32:04,097 | WARN | WriteTxCommit-0 | DOMDataCommitCoordinatorImpl | 144 - org.opendaylight.controller.sal-broker-impl - 1.1.1.Helium-SR1-00004_1-SNAPSHOT | Tx: DOM-2450 Error during phase CAN_COMMIT, starting Abort
TransactionCommitFailedException

{message=canCommit execution failed, errorList=[RpcError [message=canCommit execution failed, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=java.lang.IllegalStateException: member-1-shard-inventory-operational: No cohort entry found for transaction member-1-txn-4914]]}

at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)[144:org.opendaylight.controller.sal-broker-impl:1.1.1.Helium-SR1-00004_1-SNAPSHOT]
at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)[144:org.opendaylight.controller.sal-broker-impl:1.1.1.Helium-SR1-00004_1-SNAPSHOT]
at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:80)[65:org.opendaylight.yangtools.util:0.6.3.Helium-SR1-00004_1-SNAPSHOT]
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMDataCommitCoordinatorImpl$CommitCoordinationTask.canCommitBlocking(DOMDataCommitCoordinatorImpl.java:186)[144:org.opendaylight.controller.sal-broker-impl:1.1.1.Helium-SR1-00004_1-SNAPSHOT]
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMDataCommitCoordinatorImpl$CommitCoordinationTask.call(DOMDataCommitCoordinatorImpl.java:150)[144:org.opendaylight.controller.sal-broker-impl:1.1.1.Helium-SR1-00004_1-SNAPSHOT]
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMDataCommitCoordinatorImpl$CommitCoordinationTask.call(DOMDataCommitCoordinatorImpl.java:127)[144:org.opendaylight.controller.sal-broker-impl:1.1.1.Helium-SR1-00004_1-SNAPSHOT]
at org.opendaylight.yangtools.util.concurrent.DeadlockDetectingListeningExecutorService$2.call(DeadlockDetectingListeningExecutorService.java:192)[65:org.opendaylight.yangtools.util:0.6.3.Helium-SR1-00004_1-SNAPSHOT]
at java.util.concurrent.FutureTask.run(Unknown Source)[:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)[:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)[:1.7.0_67]
at java.lang.Thread.run(Unknown Source)[:1.7.0_67]
Caused by: java.lang.IllegalStateException: member-1-shard-inventory-operational: No cohort entry found for transaction member-1-txn-4914
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:96)[329:org.opendaylight.controller.sal-distributed-datastore:1.1.1.Helium-SR1-00004_1-SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleCanCommitTransaction(Shard.java:450)[329:org.opendaylight.controller.sal-distributed-datastore:1.1.1.Helium-SR1-00004_1-SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.onReceiveCommand(Shard.java:278)[329:org.opendaylight.controller.sal-distributed-datastore:1.1.1.Helium-SR1-00004_1-SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430)[319:com.typesafe.akka.persistence.experimental:2.3.4]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:96)[321:org.opendaylight.controller.sal-clustering-commons:1.1.1.Helium-SR1-00004_1-SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:534)[314:com.typesafe.akka.actor:2.3.4]
at akka.persistence.Recovery$State$class.process(Recovery.scala:30)[319:com.typesafe.akka.persistence.experimental:2.3.4]
at akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103)[319:com.typesafe.akka.persistence.experimental:2.3.4]
at akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114)[319:com.typesafe.akka.persistence.experimental:2.3.4]
at akka.persistence.Recovery$class.aroundReceive(Recovery.scala:256)[319:com.typesafe.akka.persistence.experimental:2.3.4]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428)[319:com.typesafe.akka.persistence.experimental:2.3.4]
at akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82)[319:com.typesafe.akka.persistence.experimental:2.3.4]
at akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78)[319:com.typesafe.akka.persistence.experimental:2.3.4]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369)[319:com.typesafe.akka.persistence.experimental:2.3.4]
at akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428)[319:com.typesafe.akka.persistence.experimental:2.3.4]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)[314:com.typesafe.akka.actor:2.3.4]
at akka.actor.ActorCell.invoke(ActorCell.scala:487)[314:com.typesafe.akka.actor:2.3.4]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)[314:com.typesafe.akka.actor:2.3.4]
at akka.dispatch.Mailbox.run(Mailbox.scala:220)[314:com.typesafe.akka.actor:2.3.4]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)[314:com.typesafe.akka.actor:2.3.4]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[311:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[311:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[311:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[311:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]



 Comments   
Comment by Tom Pantelis [ 31/Mar/15 ]

We could also bump up the CohortEntry cache removal timeout to 5 or even 10 minutes. The timeout was put in to cleanup in case a remote client goes away in the middle of committing a transaction which should be pretty rare.

Comment by Moiz Raja [ 16/Apr/15 ]

After the fixes to restrict datastore operations till it is ready we now see operation and transaction commit timeouts happen at the default timeout levels.

With operation timeouts set to 6100s and transaction timeout set to 7000s we are able to get 10,000 devices through.

It appears that the rate limiting is certainly not kicking in because the transaction timeouts had to be adjusted.

There are 2 transactions done by the netconf connector,

1. Put the device in the config/operational store
2. Update the state of the device in the operational store

Not sure which transaction is failing.

Comment by Moiz Raja [ 10/Jun/15 ]

Andre,I know you run this test on a regular basis. Do you see this issue in Lithium?

Comment by Andrej Marcinek [ 11/Jun/15 ]

a did not see this error long time, all my netconf scale tests works well with 10k devices in lithium

Generated at Wed Feb 07 19:54:58 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.