[CONTROLLER-884] Clustering: Store tree and candidate base differ Created: 22/Sep/14  Updated: 25/Jul/23  Resolved: 17/Oct/14

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Moiz Raja Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 2038

 Description   

This exception is seen during L2Switch testing. As per Robert Varga this can happen when " the datatree has been changed between the time the transaction was prepared (pre-commit) and actually committed"

I do not see how it is possible for this to happen because both in the Distributed Data store and the In Memory Datastore we prevent modification when the transaction is in the ready state. Preliminarily this appears to be a bug in InMemoryDataTree.

The one known way to reproduce this issue is by running the integration distribution with the L2 switch and doing a ping all with a 6 switch 2 host configuration.

2014-09-23 00:22:43,820 | WARN | ult-dispatcher-3 | OneForOneStrategy | 168 - com.typesafe.akka.slf4j - 2.3.4 | Store tree org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@5b70715b and candidate base org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@33223c43 differ.

2014-09-23 00:22:43,820 | WARN | ult-dispatcher-3 | ShardManager | 168 - com.typesafe.akka.slf4j - 2.3.4 | Supervisor Strategy of resume applied
at com.google.common.base.Preconditions.checkState(Preconditions.java:176)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.InMemoryDataTree.commit(InMemoryDataTree.java:119)
at org.opendaylight.controller.md.sal.dom.store.impl.InMemoryDOMDataStore$ThreePhaseCommitImpl.commit(InMemoryDOMDataStore.java:434)
at org.opendaylight.controller.md.sal.dom.store.impl.InMemoryDOMDataStore$ChainedTransactionCommitImpl.commit(InMemoryDOMDataStore.java:355)
at org.opendaylight.controller.cluster.datastore.Shard.commit(Shard.java:349)
at org.opendaylight.controller.cluster.datastore.Shard.applyState(Shard.java:534)
at org.opendaylight.controller.cluster.raft.RaftActor.onReceiveCommand(RaftActor.java:267)
at org.opendaylight.controller.cluster.datastore.Shard.onReceiveCommand(Shard.java:232)
at akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
at akka.persistence.Recovery$State$class.process(Recovery.scala:30)
at akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103)
at akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114)
at akka.persistence.Recovery$class.aroundReceive(Recovery.scala:256)
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428)
at akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82)
at akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78)
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369)
at akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



 Comments   
Comment by Tony Tkacik [ 23/Sep/14 ]

To debug this we need logs from clustered-datastore,
this seems as reordering of can-commit, precommit and others,
so to debug this we need to know how actually
clustered-datastore was processing transactions and commit phases.

This exception is raised only and only if somebody did something in terms

a.preCommit();
b.commit();
a.commit();

Comment by Moiz Raja [ 23/Sep/14 ]

The component which ensures that a preCommit for a certain transaction is immediately followed by a commit is the DataBroker. My understanding is that the DataBroker has a single submit thread and that is how it is able to ensure the order in a single instance.

This ordering mechanism breaks down in a cluster because you now have one submit thread per controller instance and therefore the chances of this exception happening in those cases are high.

It seems to me that the problem is with the API. We have prematurely broken down the API into a preCommit and commit when in reality these are supposed to be one atomic operation. The way it stands today it is hard to make this operation atomic in a cluster and still maintain proper error reporting when data conflicts arise This is further complicated by the fact that in a cluster a "commit" actually happens only when consensus is achieved - that is the data in the transaction needs to be persisted on a majority of the cluster nodes before the data is committed.

One possible solution for this problem
---------------------------------------

Since the commit in a cluster is based on consensus it looks like we may have to consider doing every commit in a cluster on a new transaction. The problem with this approach is that it may not detect conflicts - but it would ensure that every node in the cluster sees exactly the same data.

Comment by Tom Pantelis [ 23/Sep/14 ]

I'm trying to understand this. So I can see this occurring when the commit is forwarded to the shard and, after it persists, it replicates it async to the other nodes for consensus. At that point the Tx is complete from the DataBoker's perspective so its thread moves on to another Tx. Meanwhile the responses come back from the replicated nodes and the shard receives the ApplyState message which creates a local Tx and goes thru preCommit and commit directly to the IMDS. Since this doesn't go thru the DataBroker it could happen concurrently with another Tx resulting in the exception.

Is that correct?

Comment by Moiz Raja [ 23/Sep/14 ]

We send the CommitTransactionReply only when the commit is done.

ListenableFuture<Void> future = cohort.commit();

Futures.addCallback(future, new FutureCallback<Void>() {
@Override
public void onSuccess(Void v)

{ sender.tell(new CommitTransactionReply().toSerializable(), getSelf()); shardMBean.incrementCommittedTransactionCount(); shardMBean.setLastCommittedTransactionTime(System.currentTimeMillis()); }

@Override
public void onFailure(Throwable t)

{ LOG.error(t, "An exception happened during commit"); shardMBean.incrementFailedTransactionsCount(); sender.tell(new akka.actor.Status.Failure(t), getSelf()); }

});

So the DataBroker on the same instance cannot move forward till consensus is achieved. This problem should not be seen on a single instance.

Comment by Tom Pantelis [ 23/Sep/14 ]

OK - I see that now. On ForwardedCommitTransaction, the shard propagates the client actor (i.e. the ThreePhaseCommitCohortProxy) all the way thru the persistData, to the Leader and finally back to the Shard via ApplyState when consensus is achieved. Throughout all this, the ThreePhaseCommitCohortProxy is waiting for the CommitTransactionReply. So that sequence shouldn't be a problem b/c the DataBroker's commit thread should be blocked waiting on the ThreePhaseCommitCohortProxy's commit Future.

So what is the sequence/scenario that causes concurrent preCommit/commit?

One possible scenario I can see is when a transaction is created on a follower. In that case the shard forwards the CreateTransaction request to the leader. Therefore the ShardTransaction actor that gets returned exists on the leader as will the subsequent ThreePhaseCommitCohortProxy actor. So the transaction will be committed remotely by the follower (on its DataBroker thread). However this bypasses the leader's DataBroker and that's how we can end up with 2 ThreePhaseCommitCohortProxy actors preComitting and committing concurrently.

Does this sound correct?

(In reply to Moiz Raja from comment #4)
> We send the CommitTransactionReply only when the commit is done.
>
> ListenableFuture<Void> future = cohort.commit();
>
> Futures.addCallback(future, new FutureCallback<Void>() {
> @Override
> public void onSuccess(Void v)

{ > sender.tell(new > CommitTransactionReply().toSerializable(), getSelf()); > shardMBean.incrementCommittedTransactionCount(); > > shardMBean.setLastCommittedTransactionTime(System.currentTimeMillis()); > }

>
> @Override
> public void onFailure(Throwable t)

{ > LOG.error(t, "An exception happened during commit"); > shardMBean.incrementFailedTransactionsCount(); > sender.tell(new akka.actor.Status.Failure(t), > getSelf()); > }

> });
>
>
> So the DataBroker on the same instance cannot move forward till consensus is
> achieved. This problem should not be seen on a single instance.

Comment by Moiz Raja [ 24/Sep/14 ]

Yes. That is correct.

Comment by Tony Tkacik [ 24/Sep/14 ]

Seeing from discussion, this is controller-mdsal bug, not Yangtools, InMemoryDataTree correctly detected invalid scenario and raised exception.

Moving to controller - mdsal.

Comment by Tom Pantelis [ 24/Sep/14 ]

Ideally transactions initiated remotely would go thru the leader's DataBroker but that doesn't look feasible (at least not in the short term with the current design).

Long term, I think we need to move the three-phase coordination control from the DataBroker into the InMemoryDOMDataStore and make it atomic either via locking or a single-threaded executor. This would also provide parallelism between the shards to gain better performance.

If we need a short term fix (hack) for RC2, it seems to me we would need to lock the shard when the three-phase commit is started (or maybe on preCommit) and unlock when the commit is complete. This isn't ideal and would be very risky - we'd have to handle remote none failures in the middle of the three-phase commit to avoid deadlock.

What do you think, Moiz?

Comment by Moiz Raja [ 24/Sep/14 ]

Tom, Locking the Shard is too risky for RC2. Plus doing this will make the performance so bad that it wouldn't be worth shipping the clustering feature. To me it seems that we have a major limitation with the InMemoryStore right now - we need to overcome that first.

For Helium this failure is not going to be disastrous because the Shard does resume and the failed transaction does get committed (as a separate transaction). This is likely to be ok.

Comment by Tom Pantelis [ 25/Sep/14 ]

Yeah - locking is risky although I don't think it would have significant impact on performance. Transactions thru the local broker are already serial and can't run concurrent so the lock would be uncontended with negligible overhead. It would only be contended with remote transactions. Anyway, I'd rather not have to do it - it was just a suggestion if we needed a short term fix for RC2.

We'll have time to discuss this issue next week.

(In reply to Moiz Raja from comment #9)
> Tom, Locking the Shard is too risky for RC2. Plus doing this will make the
> performance so bad that it wouldn't be worth shipping the clustering
> feature. To me it seems that we have a major limitation with the
> InMemoryStore right now - we need to overcome that first.
>
> For Helium this failure is not going to be disastrous because the Shard does
> resume and the failed transaction does get committed (as a separate
> transaction). This is likely to be ok.

Comment by Robert Varga [ 25/Sep/14 ]

This is not really a limitation, as there needs be an order in which the transactions are applied. canCommit/preCommit work on top of last committed state, so performing two precommits concurrently and then trying to apply the second one will fail. You either have a single coordinator or coordinate the two coordinators' actions.

I do not believe we should be pulling the coordinator thread into the datastore itself – it will force a thread-per-datastore model, plus add queueing overhead (which we know is huge) to any use case where we interact with multiple datastore instances.

Comment by Tom Pantelis [ 25/Sep/14 ]

I disagree - I think we need to move coordination into the data store. A thread-per-datastore is what will give us parallelism between shards which is an advantage of the CDS that we cannot utilize right now. I believe this is what will provide the most performance gains. And as Moiz pointed out, trying to coordinate preCommit and commit across a cluster is very difficult and problematic. We're probably going to have to end up making it atomic in the IMDS anyway.

Re: queueing overhead, all Tx's are queued now so I assume you're referring to a single Tx accessing multiple shards. Currently that would entail 1 queueing. However I don't think the added per-shard queueing overhead would outweigh the performance gains of parallelism, specifically parallelism between multiple Tx's.

(In reply to Robert Varga from comment #11)
> This is not really a limitation, as there needs be an order in which the
> transactions are applied. canCommit/preCommit work on top of last committed
> state, so performing two precommits concurrently and then trying to apply
> the second one will fail. You either have a single coordinator or coordinate
> the two coordinators' actions.
>
> I do not believe we should be pulling the coordinator thread into the
> datastore itself – it will force a thread-per-datastore model, plus add
> queueing overhead (which we know is huge) to any use case where we interact
> with multiple datastore instances.

Comment by Robert Varga [ 25/Sep/14 ]

Tom, you are assuming only your particular use case. It is completely valid scenario to coordinate multiple IMDS instances using a single thread. Moving coordination into IMDS will force multiple levels of queueing and coordination.

Please fix the problem at its root, not by shifting design of components which work precisely as designed.

Comment by Tom Pantelis [ 25/Sep/14 ]

I don't quite get your standpoint on this. The CDS sub-allocates IMDS instances (i.e. shards) that can potentially be committed to concurrently. Do you not see that you lose this potential concurrency by coordinating at a higher level which has no notion of the sub-allocation potential of the underlying data store? I don't understand why you see a problem with letting the underlying data store impl coordinate commits as it sees fit.

IMO, shard concurrency is eventually what's going to yield performance approaching and maybe exceeding the straight in-memory data store implementation.

What do you think the problem is at its root? From my standpoint (and I believe Moiz's) is that the root problem is that the commit coordination is currently done at the broker high-level.

Anyway - we can talk about this next week in person.

(In reply to Robert Varga from comment #13)
> Tom, you are assuming only your particular use case. It is completely valid
> scenario to coordinate multiple IMDS instances using a single thread. Moving
> coordination into IMDS will force multiple levels of queueing and
> coordination.
>
> Please fix the problem at its root, not by shifting design of components
> which work precisely as designed.

Comment by Moiz Raja [ 25/Sep/14 ]

Robert from an API perspective IMDS can only be used correctly if canCommit/preCommit/commit are called together. So it is meaningless IMO to expose canCommit and preCommit. I happen to think that is at the root of this problem.

The only other approach which we could take is to create a completely different DataBroker to be used for the CDS. But this will require duplication of code which I would prefer to avoid.

Comment by Robert Varga [ 26/Sep/14 ]

I am not saying that parallelism across shards is not necessary. What I am saying is that parallelism inside a shard, as represented by a single instance of IMDS does not make sense, because the cost of analysis the fact two transactions do not conflict requires amount of work comparable to actually committing them.

The fact that CDS does not preserve sequencing within a single shard is the root of the problem. The observation that "canCommit/preCommit/commit need to be called together" is not correct – they need to be called in sequence, which is not the same thing. While it may seem that you can just say 'commit', that ignores distributed nature of world and prevents coordination of multiple entities which need to agree on whether a transaction is in or out. For details please see the expensive literature explaining the different commit protocols (2PC and (E)3PC). This is precisely why the datastore exposes a cohort and databroker acts as the coordinator. Please note that the broker coordinates two instances of IMDS.

To give you a concrete use case, which would be set back by the changes you propose: I have a netty-driven TCP server. For each client I instantiate a pair of datastores: one persistent (backed by, say, a filesystem) and one IMDS. Now I have the netty threadpool and receive requests from client – writes I want to go to both, reads I satisfy from IMDS. Netty makes sure a channel (thus a pair of PDS and IMDS) is touched only by a single thread – which means I can read the request and service it inside this thread – no need to additional queueing implied by what Tom requests. Furthermore I have the 3PC interface available, which allows me to drive the two stores asynchronously towards consistent result – which I would not be able to do if Moiz's proposal would get implemented.

IMDS gives you MVCC behavior, but requires that you provide a proper sequence of transaction. The precondition triggered because you do not sequence transactions and have two of them running concurrently. This seems to indicate that CDS does not provide ordering guarantees within a shard – which would be very bad news (for example if two applications can interact between themselves and CDS at the same time, counter-intuitive things can happen).

Comment by Tom Pantelis [ 26/Sep/14 ]

Ok - if you want keep the current DataBroker behavior, then we'll have to implement our own DataBroker for CDS, which is fine. To address this issue, transactions initiated remotely would have to go thru the DataBroker and thus a DataBroker actor component will be needed.

Comment by Tom Pantelis [ 09/Oct/14 ]

Submitted https://git.opendaylight.org/gerrit/#/c/11795/

Comment by Tom Pantelis [ 15/Oct/14 ]

Submitted https://git.opendaylight.org/gerrit/#/c/11966/

Comment by Tom Pantelis [ 15/Oct/14 ]

Oops wrong bug...

(In reply to Tom Pantelis from comment #19)
> Submitted https://git.opendaylight.org/gerrit/#/c/11966/

Generated at Wed Feb 07 19:54:07 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.