[CONTROLLER-1578] Multi-threaded AbstractDataBrokerTest/s fail with TransactionCommitFailedException Caused by: java.lang.IllegalStateException: Store tree ... and candidate base ... differ. Created: 13/Jan/17  Updated: 25/Jul/23  Resolved: 16/Feb/17

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Michael Vorburger Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File bug7538_stacktrace.txt    
Issue Links:
Blocks
blocks GENIUS-51 idmanager fails with TransactionCommi... Resolved
Duplicate
is duplicated by GENIUS-51 idmanager fails with TransactionCommi... Resolved
External issue ID: 7538
Priority: High

 Description   

While working on the new style end2end component ElanServiceTest in (ongoing, WIP) https://git.opendaylight.org/gerrit/#/c/44000/, I've just hit the exception shown below.

The test (currently local code, will push to c/44000 shortly) is really simple, it does a createElanInstance() on an IElanService in first @Test public void createElanInstance(), which works but the test then fails in the assert comparison due to another completely unrelated problem, so it quickly moves on to the next test, which does createElanInstance() again and then addElanInterface(), and that's what fails with below.

This code itself is simple single threaded - but the elanmanager datatree listeners are async, of course...

While I don't fully understand what's going on here yet, I suspect that perhaps one of the background threads processing from aysnc data store change listeners modified the data store concurrently and just before the Tx submit() in the addElanInterface() ?

Does the error below mean some sort of like optimistic lock violation? Are we missing some sort of retry logic? Do we have utilities for this? Is this a problem particular to 1 line in elanmanager, or a more general issue that we need to properly retry data store write actions everywhere?

java.lang.RuntimeException: TransactionCommitFailedException

{message=commit execution failed, errorList=[RpcError [message=commit execution failed, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=java.lang.IllegalStateException: Store tree org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@5243d730 and candidate base org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@45b2d17b differ.]]}

at org.opendaylight.genius.mdsalutil.MDSALUtil.syncWrite(MDSALUtil.java:536)
at org.opendaylight.netvirt.elan.internal.ElanServiceProvider.addElanInterface(ElanServiceProvider.java:277)
at org.opendaylight.netvirt.elanmanager.tests.ElanServiceTest.addElanInterface(ElanServiceTest.java:80)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.opendaylight.infrautils.inject.guice.testutils.GuiceRule$1.evaluate(GuiceRule.java:83)
at org.opendaylight.infrautils.testutils.RunUntilFailureRule$RunUntilFailureStatement.evaluate(RunUntilFailureRule.java:59)
at org.opendaylight.infrautils.testutils.LogRule$1.evaluate(LogRule.java:45)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.opendaylight.infrautils.testutils.RunUntilFailureClassRule$RunUntilFailureStatement.evaluate(RunUntilFailureClassRule.java:46)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:678)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
Caused by: TransactionCommitFailedException

{message=commit execution failed, errorList=[RpcError [message=commit execution failed, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=java.lang.IllegalStateException: Store tree org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@5243d730 and candidate base org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@45b2d17b differ.]]}

at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)
at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)
at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:87)
at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:37)
at org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.mapException(MappingCheckedFuture.java:60)
at org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:64)
at org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.get(MappingCheckedFuture.java:77)
at com.google.common.util.concurrent.AbstractCheckedFuture.checkedGet(AbstractCheckedFuture.java:78)
at org.opendaylight.genius.datastoreutils.SingleTransactionDataBroker.syncWrite(SingleTransactionDataBroker.java:178)
at org.opendaylight.genius.mdsalutil.MDSALUtil.syncWrite(MDSALUtil.java:533)
... 32 more
Caused by: java.lang.IllegalStateException: Store tree org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@5243d730 and candidate base org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@45b2d17b differ.
at org.opendaylight.yangtools.yang.data.impl.schema.tree.InMemoryDataTree.commit(InMemoryDataTree.java:119)
at org.opendaylight.controller.md.sal.dom.store.impl.InMemoryDOMDataStore.commit(InMemoryDOMDataStore.java:248)
at org.opendaylight.controller.md.sal.dom.store.impl.InMemoryDOMStoreThreePhaseCommitCohort.commit(InMemoryDOMStoreThreePhaseCommitCohort.java:105)
at org.opendaylight.controller.md.sal.dom.broker.impl.CommitCoordinationTask.commitAll(CommitCoordinationTask.java:211)
at org.opendaylight.controller.md.sal.dom.broker.impl.CommitCoordinationTask.commitBlocking(CommitCoordinationTask.java:186)
at org.opendaylight.controller.md.sal.dom.broker.impl.CommitCoordinationTask.call(CommitCoordinationTask.java:65)
at org.opendaylight.controller.md.sal.dom.broker.impl.CommitCoordinationTask.call(CommitCoordinationTask.java:29)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:299)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:58)
at org.opendaylight.controller.md.sal.dom.broker.impl.SerializedDOMDataBroker.submit(SerializedDOMDataBroker.java:70)
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMForwardedWriteTransaction.submit(DOMForwardedWriteTransaction.java:144)
at org.opendaylight.controller.md.sal.binding.impl.AbstractWriteTransaction.doSubmit(AbstractWriteTransaction.java:134)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMWriteTransactionAdapter.submit(BindingDOMWriteTransactionAdapter.java:83)
... 34 more



 Comments   
Comment by Michael Vorburger [ 13/Jan/17 ]

Attachment bug7538_stacktrace.txt has been added with description: Stack trace, same as in bug's comment, but more readable in an attachment without line wrapping

Comment by Michael Vorburger [ 13/Jan/17 ]

I half suspected that perhaps this could be an issue more to do with the test setup, because the RunUntilFailureRule re-creates all services N times, which in real production would never happen.

So I've further simplified it, and removed much of the test infra (incl. e.g. TestableDataTreeChangeListenerModule) now reproduce this in a test which sets up things up only once, and then simply by doing createElanInstance() in a loop - it works N (13, 5, ..) times and the N+1'th time you get this exception.

With this simplified ElanServiceCreate100InstancesTest, we can demonstrate that this is some concurrency issue - if we comment out the body of ElanInstanceManager.add(), which is an AsyncDataTreeChangeListenerBase, then this "TransactionCommitFailedException Caused by: java.lang.IllegalStateException: Store tree ... and candidate base ... differ." does not occur anymore, even if we loop 1000 times to do createElanInstance().

See https://git.opendaylight.org/gerrit/#/c/50460/ for the reproducer (it will only consist of ElanServiceCreate100InstancesTest; all other classes of that change are about to go into master as part of https://git.opendaylight.org/gerrit/#/c/44000/).

So what are we doing wrong that this is happening?

Comment by Tom Pantelis [ 02/Feb/17 ]

The error means transactions are being committed concurrently. This is controlled by the ListeningExecutorService passed to the SerializedDOMDataBroker. The submitted transactions must be presented to the SerializedDOMDataBroker serially and not interleaved. The DataBrokerTestCustomizer passes in a calling thread executor via MoreExecutors.newDirectExecutorService(). This is fine as long as transactions are submitted transactions from one thread only as is normally done in tests and as is done by createElanInstance.

The ElanServiceCreate100InstancesListener submits transactions from the notification callback. The notification of listeners is controlled by the ExecutorService passed to the InMemoryDOMDataStore. The test setup also uses MoreExecutors.newDirectExecutorService() which would seem to be OK since listeners are notified after commit completes.

The issue is related to the choice of executors passed to the SerializedDOMDataBroker and InMemoryDOMDataStore. The bottom line is that using direct executors is problematic. The executor passed to the SerializedDOMDataBroker must be single-threaded unless transactions are only submitted serially by the test method. InMemoryDOMDataStore executor should also not be a direct executor (could be single-threaded or multi-threaded).

Therefore I would suggest modifying the test setup to pass single-threaded executors and then see if the error occurs.

Comment by Michael Vorburger [ 02/Feb/17 ]

Copy/paste of private email exchange for future reference, this is from before Tom's comment above:

I've further "dumbed down" the situation into an example test that does very little now, I've stripped all "functional" code related to genius elan, and can still reproduce it: So we have one main thread doing what you can see in https://git.opendaylight.org/gerrit/#/c/50460/3/vpnservice/elanmanager/elanmanager-impl/src/test/java/org/opendaylight/netvirt/elanmanager/tests/ElanServiceCreate100InstancesTest.java - it basically creates an instance of some date.

Then, in another thread in the background, see https://git.opendaylight.org/gerrit/#/c/50460/3/vpnservice/elanmanager/elanmanager-impl/src/test/java/org/opendaylight/netvirt/elanmanager/tests/ElanServiceCreate100InstancesListener.java, which is a AsyncDataTreeChangeListenerBase (some genius thing I'm not sure you're familiar with but basically some infra for async change listeners), it basically does a merge() on the data that was write()'n above; but note that the instance merge is NOT the one passed to the listener, but a new one created.

Comment by Michael Vorburger [ 02/Feb/17 ]

> The executor passed to the SerializedDOMDataBroker must be single-threaded (...)
> InMemoryDOMDataStore executor should also not be a direct executor (could be
> single-threaded or multi-threaded).
> Therefore I would suggest modifying the test setup to pass
> single-threaded executors and then see if the error occurs.

Yes, this did the trick; confirming! Yu da man - once again; thank you so much for this tip.

Actually I was already (surprisingly) close, I had started toying with those Executors under https://git.opendaylight.org/gerrit/#/c/51307/ yesterday, before getting your comment above, based on the hint you gave me verbally in the Kernel call 2 weeks ago, but initially (patch set 1) made the made mistake of using a newCachedThreadPool instead of a newSingleThreadExecutor - close, but the wrong choice!

I'll clean up https://git.opendaylight.org/gerrit/#/c/51307/ to have it become a a fix for this ... let's communicate via code review on that Gerrit to try to wrap this up.

Comment by Michael Vorburger [ 02/Feb/17 ]

Moved bug from product netvirt to controller, as this turned out to really be an issue in what people refer to as AbstractDataBrokerTest infrastructure (which is in controller), and not something done wrong in the elanmanager related code in netvirt.

Comment by Tom Pantelis [ 02/Feb/17 ]

(In reply to Michael Vorburger from comment #6)
> > The executor passed to the SerializedDOMDataBroker must be single-threaded (...)
> > InMemoryDOMDataStore executor should also not be a direct executor (could be
> > single-threaded or multi-threaded).
> > Therefore I would suggest modifying the test setup to pass
> > single-threaded executors and then see if the error occurs.
>
> Yes, this did the trick; confirming! Yu da man - once again; thank you so
> much for this tip.
>
> Actually I was already (surprisingly) close, I had started toying with those
> Executors under https://git.opendaylight.org/gerrit/#/c/51307/ yesterday,
> before getting your comment above, based on the hint you gave me verbally in
> the Kernel call 2 weeks ago, but initially (patch set 1) made the made
> mistake of using a newCachedThreadPool instead of a newSingleThreadExecutor
> - close, but the wrong choice!
>
> I'll clean up https://git.opendaylight.org/gerrit/#/c/51307/ to have it
> become a a fix for this ... let's communicate via code review on that Gerrit
> to try to wrap this up.

Cool!

Comment by Michael Vorburger [ 02/Feb/17 ]

> The executor passed to the SerializedDOMDataBroker must be single-threaded

This part is what really did the trick, and what https://git.opendaylight.org/gerrit/#/c/51307/ fixes.

> InMemoryDOMDataStore executor should also not be a direct executor
> (could be single-threaded or multi-threaded).

This part is less trivial, and seems not to be directly related to fixing this (the original problem above gets fixed with changing only the executor passed to the SerializedDOMDataBroker and not doing anything about the InMemoryDOMDataStore executor). Changing this causes 3 tests in controller to fail (and who knows where else?!), but I've nevertheless opened 2-3 other new Gerrits to explore this (separately from above's c/51307, so that can go in independently ASAP):

  • https://git.opendaylight.org/gerrit/#/c/51369/ - also failing the same tests as above, switched to InMemoryDOMDataStoreFactory which instead of the current MoreExecutors.newDirectExecutorService uses SpecialExecutors.newBlockingBoundedFastThreadPool, which appears to be a home made (wow) concurrent util that's a hybrid between (normally) multi-threaded but (sometimes, quote: "if the queue is full ... running them in the same thread as the caller") what looks like the equivalent of a direct executor [is this a good idea?]
Comment by Tom Pantelis [ 03/Feb/17 ]

It might make sense to make the executors configurable. So by default use direct executors and override appropriately per test.

Comment by Michael Vorburger [ 03/Feb/17 ]

> make the executors configurable.

FYI The executor passed to the SerializedDOMDataBroker is already configurable, by DataBrokerTestCustomizer's getCommitCoordinatorExecutor().

https://git.opendaylight.org/gerrit/#/c/51307/ proposes to change DataBrokerTestCustomizer's getCommitCoordinatorExecutor() default from newDirectExecutorService() to newSingleThreadExecutor()

This is because it seems to me that we can consider the default newDirectExecutorService() a "bug" - it means that one cannot directly use the AbstractDataBrokerTest in a multi-threaded test.

As I said, changing this one (but not the InMemoryDOMDataStore executor; see below) seems to have no negative effect, all tests still pass; at least do in controller do, and those I wrote in netvirt and genius pass BETTER (without the problem shown above) with this change. Therefore I would think this change of default is a Good Thing (TM) to do.

> So by default use direct executors and override appropriately per test.

Of course, if one KNOWS about all of this, one can already @Override getCommitCoordinatorExecutor() in a test, via a custom subclass of DataBrokerTestCustomizer, not AbstractDataBrokerTest. The problem is that I doubt mere mortals know about these internal details - usually, one just wants to write functional tests against DataBroker, not spend days investigating the internal threading model of the test infrastructure (as I have here).

If you object to the change of the default, would you like to -1
https://git.opendaylight.org/gerrit/#/c/51307/ ? If you agree to the change of this default, would you like to +1 and merge c/51307 ?

For the InMemoryDOMDataStore executor choice, we could make DataBrokerTestCustomizer, or even InMemoryDOMDataStoreFactory (thoughts where best?), more configurable. But that's almost a peripheral orthogonal parallel discussion at this point - that part seems to have nothing to do with fixing this bug.

Comment by Michael Vorburger [ 06/Feb/17 ]

https://git.opendaylight.org/gerrit/#/c/51307/ broke existing green test(s) in at least genius (org.opendaylight.genius.test.MdSalUtilTest) and netvirt , possibly more widely.

Thus reverting in https://git.opendaylight.org/gerrit/#/c/51467/ ...

I'll propose a new change to controller to make this an option instead.

Comment by Michael Vorburger [ 06/Feb/17 ]

> https://git.opendaylight.org/gerrit/#/c/51307/ broke existing green test(s)

FTR: Any such tests which this broke were technically strictly speaking wrong, as they made an assumption which the DataBroker API never guaranteed ... see e.g. https://git.opendaylight.org/gerrit/#/c/51487/ for an example how to fix such tests.

> I'll propose a new change to controller to make this an option instead.

==> https://git.opendaylight.org/gerrit/#/c/51486/

Comment by Michael Vorburger [ 07/Feb/17 ]

> tests which this broke were technically strictly speaking wrong
> see e.g. https://git.opendaylight.org/gerrit/#/c/51487/
> for an example how to fix such tests.

as well as https://git.opendaylight.org/gerrit/#/c/51490/

Comment by Michael Vorburger [ 16/Feb/17 ]

Done

Generated at Wed Feb 07 19:55:54 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.