[CONTROLLER-996] Clustering : Exception thrown in Shard because a transaction was created on a chain when previous transaction was not yet ready Created: 05/Nov/14  Updated: 25/Jul/23  Resolved: 15/Nov/14

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Moiz Raja Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks CONTROLLER-1016 Clustering : BGP - Linkstate topology... Verified
is blocked by CONTROLLER-1006 Clustering : TransactionChain id crea... Resolved
is blocked by CONTROLLER-1007 Clustering : CDS sometimes creates tr... Resolved
External issue ID: 2318

 Description   

2014-10-31 11:29:44,814 | WARN | lt-dispatcher-42 | ShardManager | 248 - com.typesafe.akka.slf4j - 2.3.4 | Supervisor Strategy of resume applied
at com.google.common.base.Preconditions.checkState(Preconditions.java:176)
at org.opendaylight.controller.md.sal.dom.store.impl.DOMStoreTransactionChainImpl$Allocated.getSnapshot(DOMStoreTransactionChainImpl.java:68)
at org.opendaylight.controller.md.sal.dom.store.impl.DOMStoreTransactionChainImpl.getSnapshot(DOMStoreTransactionChainImpl.java:111)
at org.opendaylight.controller.md.sal.dom.store.impl.DOMStoreTransactionChainImpl.newReadWriteTransaction(DOMStoreTransactionChainImpl.java:131)
at org.opendaylight.controller.cluster.datastore.Shard.createTypedTransactionActor(Shard.java:508)
at org.opendaylight.controller.cluster.datastore.Shard.createTransaction(Shard.java:543)
at org.opendaylight.controller.cluster.datastore.Shard.createTransaction(Shard.java:530)
at org.opendaylight.controller.cluster.datastore.Shard.handleCreateTransaction(Shard.java:441)
at org.opendaylight.controller.cluster.datastore.Shard.onReceiveCommand(Shard.java:240)
at akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430)
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:96)
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:534)
at akka.persistence.Recovery$State$class.process(Recovery.scala:30)
at akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103)
at akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114)
at akka.persistence.Recovery$class.aroundReceive(Recovery.scala:256)
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428)
at akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82)
at akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78)
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369)
at akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2014-10-31 11:29:44,815 | WARN | lt-dispatcher-42 | OneForOneStrategy | 248 - com.typesafe.akka.slf4j - 2.3.4 | Previous transaction member-1-shard-inventory-operational-40 is not ready yet



 Comments   
Comment by Tom Pantelis [ 05/Nov/14 ]

I think the Tx chain is coming from the FlowCapableInventoryProvider and this error is related to CONTROLLER-997.

The FlowCapableInventoryProvider uses a Tx chain to continuously batch and submit modification operations on a separate thread. After it submits a Tx batch, it creates a new read-write Tx from the chain and starts a new batch. It does not wait for the Future from the the previous Tx to complete. This is valid. The semantics of a Tx chain are such that the modifications from a previously submitted Tx in the chain are visible to the next Tx without the client having to wait for the previous Tx to be committed. So as soon as the previous Tx is readied, the next Tx can be created and its snapshot will contain the modifications made by the previous Tx.

With the IMDS, i.e. w/o clustering, Tx's are readied synchronously and then submitted on a thread to be committed. However with the CDS, Tx's are readied async, i.e. the TransactionProxy sends a message to the ShardTransaction actor. So I think it's this timing difference that can cause issues with Tx chains and break the semantics.

Here's a scenario that can break with the CDS:

ReadWriteTransaction tx1 = txChain.newReadWriteTransaction()
tx1.write(id1, path1);
tx1.submit();

ReadWriteTransaction tx2 = txChain.newReadWriteTransaction()
tx2.write(id2, path2);

On tx.submit(), the ready operation is done async and may not complete before tx2 is created. If so, tx2 creation fails. With the IMDS, tx1 is readied when submit returns to the caller so this issue does not occur.

In the CDS, we need to ensure the previous Tx in a chain completes its ready operation before it attempts to create the next Tx.

I think this is the root cause of CONTROLLER-997. In the ex trace in 2319, the Tx create timed out. This ex appears to be the root cause - it was thrown back to akka which doesn't propagate it. Also, this ex incurred at 11:29:44 and the ex in 2319 occurred at 11:29:49, the latter being 5 seconds after which is the ask time out.

This issue may also be the cause of CONTROLLER-998 as well.

Comment by Tom Pantelis [ 06/Nov/14 ]

https://git.opendaylight.org/gerrit/#/c/12535/ for master

https://git.opendaylight.org/gerrit/#/c/12537/ for Helium.

Comment by Tom Pantelis [ 08/Nov/14 ]

Submitted follow-up patch https://git.opendaylight.org/gerrit/#/c/12582/ to master.

Need to cherry pick to stable/helium so keeping this bug open for now.

Comment by Tom Pantelis [ 15/Nov/14 ]

Merged follow-up patch https://git.opendaylight.org/gerrit/#/c/12878/ to helium

Generated at Wed Feb 07 19:54:25 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.