[CONTROLLER-1311] Clustering : Exception in BGP scale test Created: 13/May/15  Updated: 30/Jun/15  Resolved: 30/Jun/15

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Post-Helium
Fix Version/s: None

Type: Bug
Reporter: Dana Kutenicsova Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 3195
Priority: High

 Description   

BGP scale tests 100k routes failed on following exception:

2015-05-11 12:57:40,915 | ERROR | ult-dispatcher-2 | Shard | 205 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-default-operational: Error handling BatchedModifications for Tx member-1-txn-313834
java.lang.IllegalStateException: Transaction ReadWriteShardDataTreeTransaction{id=member-1-txn-313834, closed=false, snapshot=MutableDataTree [modification=NodeModification [identifier=(urn:ietf:params:xml:ns:netconf:base:1.0)data, modificationType=TOUCH, childModification={....very long body...}} is open
at com.google.common.base.Preconditions.checkState(Preconditions.java:197)[94:com.google.guava:18.0.0]
at org.opendaylight.controller.cluster.datastore.ShardDataTreeTransactionChain.getSnapshot(ShardDataTreeTransactionChain.java:37)[212:org.opendaylight.controller.sal-distributed-datastore:1.2.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ShardDataTreeTransactionChain.newReadWriteTransaction(ShardDataTreeTransactionChain.java:54)[212:org.opendaylight.controller.sal-distributed-datastore:1.2.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ShardDataTree.newReadWriteTransaction(ShardDataTree.java:92)[212:org.opendaylight.controller.sal-distributed-datastore:1.2.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleBatchedModifications(ShardCommitCoordinator.java:163)[212:org.opendaylight.controller.sal-distributed-datastore:1.2.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleBatchedModifications(Shard.java:423)[212:org.opendaylight.controller.sal-distributed-datastore:1.2.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.onReceiveCommand(Shard.java:228)[212:org.opendaylight.controller.sal-distributed-datastore:1.2.0.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430)[202:com.typesafe.akka.persistence.experimental:2.3.10]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:97)[204:org.opendaylight.controller.sal-clustering-commons:1.2.0.SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:534)[197:com.typesafe.akka.actor:2.3.10]
at akka.persistence.Recovery$State$class.process(Recovery.scala:30)[202:com.typesafe.akka.persistence.experimental:2.3.10]
at akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103)[202:com.typesafe.akka.persistence.experimental:2.3.10]
at akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114)[202:com.typesafe.akka.persistence.experimental:2.3.10]
at akka.persistence.Recovery$class.aroundReceive(Recovery.scala:265)[202:com.typesafe.akka.persistence.experimental:2.3.10]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428)[202:com.typesafe.akka.persistence.experimental:2.3.10]
at akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82)[202:com.typesafe.akka.persistence.experimental:2.3.10]
at akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78)[202:com.typesafe.akka.persistence.experimental:2.3.10]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369)[202:com.typesafe.akka.persistence.experimental:2.3.10]
at akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428)[202:com.typesafe.akka.persistence.experimental:2.3.10]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)[197:com.typesafe.akka.actor:2.3.10]
at akka.actor.ActorCell.invoke(ActorCell.scala:487)[197:com.typesafe.akka.actor:2.3.10]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)[197:com.typesafe.akka.actor:2.3.10]
at akka.dispatch.Mailbox.run(Mailbox.scala:221)[197:com.typesafe.akka.actor:2.3.10]
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)[197:com.typesafe.akka.actor:2.3.10]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[194:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)[194:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)[194:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[194:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[194:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
2015-05-11 12:57:42,092 | WARN | lt-dispatcher-16 | Shard | 205 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | member-1-shard-default-operational: Transaction member-1-txn-313834 was timed out of the cache
2015-05-11 12:58:14,430 | WARN | lt-dispatcher-14 | TransactionProxy | 212 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Failed to acquire operation permit for transaction member-1-txn-313834



 Comments   
Comment by Jozef Behran [ 13/May/15 ]

Currently masked by BGPCEP-221 but a possible fix/workaround was suggested for that bug. When that workaround will be confirmed to be working, then this bug can be exposed and hunted down.

Comment by Moiz Raja [ 02/Jun/15 ]

Does this exception still happen? I've not seen it happen in the bgp scale test.

Comment by Moiz Raja [ 09/Jun/15 ]

Because of changes in CDS code - this particular situation with the BatchedModifications being sent to the Shard can only happen in the remote case - the local case is covered differently.

Comment by Tom Pantelis [ 12/Jun/15 ]

This exception means that it attempted to create a new read/write tx on the chain before the previous one was readied. However notice that the tx ID (member-1-txn-313834) associated with the BatchedModifications message is the same as the tx ID reported as still being open. Ruling out a tx ID being erroneously reused, that means there was a previous BatchedModifications message from the front-end for the same tx ID that created a read/write tx on the chain and put an entry in the cohortCache. A subsequent BatcheModifications message should find an entry already in the cohortCache and not create a new read/write tx. But in this case it found no entry in the cache and tried to create a new chained tx which indicates there was sufficient time in between the BatchedModifications messages for the entry to be timed out of the cache. Unfortunately we don't have the whole log to confirm this but it seems like that's what happened.

The code has since been refactored a bit so entries will no longer age out of the cache in between BatchedModifications message - they can only be aged out once they're enqueued on ready. So it doesn't appear this issue can occur now.

However, looking at the code more, I think there's some places with incomplete cleanup in some error code paths.

Comment by Tom Pantelis [ 16/Jun/15 ]

https://git.opendaylight.org/gerrit/#/c/22656/

Comment by Jozef Behran [ 30/Jun/15 ]

Not observed during the last 2 weeks of testing.

Generated at Wed Feb 07 19:55:13 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.