[CONTROLLER-1578] Multi-threaded AbstractDataBrokerTest/s fail with TransactionCommitFailedException Caused by: java.lang.IllegalStateException: Store tree ... and candidate base ... differ. Created: 13/Jan/17 Updated: 25/Jul/23 Resolved: 16/Feb/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | mdsal |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Michael Vorburger | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| External issue ID: | 7538 | ||||||||||||||||
| Priority: | High | ||||||||||||||||
| Description |
|
While working on the new style end2end component ElanServiceTest in (ongoing, WIP) https://git.opendaylight.org/gerrit/#/c/44000/, I've just hit the exception shown below. The test (currently local code, will push to c/44000 shortly) is really simple, it does a createElanInstance() on an IElanService in first @Test public void createElanInstance(), which works but the test then fails in the assert comparison due to another completely unrelated problem, so it quickly moves on to the next test, which does createElanInstance() again and then addElanInterface(), and that's what fails with below. This code itself is simple single threaded - but the elanmanager datatree listeners are async, of course... While I don't fully understand what's going on here yet, I suspect that perhaps one of the background threads processing from aysnc data store change listeners modified the data store concurrently and just before the Tx submit() in the addElanInterface() ? Does the error below mean some sort of like optimistic lock violation? Are we missing some sort of retry logic? Do we have utilities for this? Is this a problem particular to 1 line in elanmanager, or a more general issue that we need to properly retry data store write actions everywhere? java.lang.RuntimeException: TransactionCommitFailedException {message=commit execution failed, errorList=[RpcError [message=commit execution failed, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=java.lang.IllegalStateException: Store tree org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@5243d730 and candidate base org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@45b2d17b differ.]]} at org.opendaylight.genius.mdsalutil.MDSALUtil.syncWrite(MDSALUtil.java:536) at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37) |
| Comments |
| Comment by Michael Vorburger [ 13/Jan/17 ] |
|
Attachment bug7538_stacktrace.txt has been added with description: Stack trace, same as in bug's comment, but more readable in an attachment without line wrapping |
| Comment by Michael Vorburger [ 13/Jan/17 ] |
|
I half suspected that perhaps this could be an issue more to do with the test setup, because the RunUntilFailureRule re-creates all services N times, which in real production would never happen. So I've further simplified it, and removed much of the test infra (incl. e.g. TestableDataTreeChangeListenerModule) now reproduce this in a test which sets up things up only once, and then simply by doing createElanInstance() in a loop - it works N (13, 5, ..) times and the N+1'th time you get this exception. With this simplified ElanServiceCreate100InstancesTest, we can demonstrate that this is some concurrency issue - if we comment out the body of ElanInstanceManager.add(), which is an AsyncDataTreeChangeListenerBase, then this "TransactionCommitFailedException Caused by: java.lang.IllegalStateException: Store tree ... and candidate base ... differ." does not occur anymore, even if we loop 1000 times to do createElanInstance(). See https://git.opendaylight.org/gerrit/#/c/50460/ for the reproducer (it will only consist of ElanServiceCreate100InstancesTest; all other classes of that change are about to go into master as part of https://git.opendaylight.org/gerrit/#/c/44000/). So what are we doing wrong that this is happening? |
| Comment by Tom Pantelis [ 02/Feb/17 ] |
|
The error means transactions are being committed concurrently. This is controlled by the ListeningExecutorService passed to the SerializedDOMDataBroker. The submitted transactions must be presented to the SerializedDOMDataBroker serially and not interleaved. The DataBrokerTestCustomizer passes in a calling thread executor via MoreExecutors.newDirectExecutorService(). This is fine as long as transactions are submitted transactions from one thread only as is normally done in tests and as is done by createElanInstance. The ElanServiceCreate100InstancesListener submits transactions from the notification callback. The notification of listeners is controlled by the ExecutorService passed to the InMemoryDOMDataStore. The test setup also uses MoreExecutors.newDirectExecutorService() which would seem to be OK since listeners are notified after commit completes. The issue is related to the choice of executors passed to the SerializedDOMDataBroker and InMemoryDOMDataStore. The bottom line is that using direct executors is problematic. The executor passed to the SerializedDOMDataBroker must be single-threaded unless transactions are only submitted serially by the test method. InMemoryDOMDataStore executor should also not be a direct executor (could be single-threaded or multi-threaded). Therefore I would suggest modifying the test setup to pass single-threaded executors and then see if the error occurs. |
| Comment by Michael Vorburger [ 02/Feb/17 ] |
|
Copy/paste of private email exchange for future reference, this is from before Tom's comment above: I've further "dumbed down" the situation into an example test that does very little now, I've stripped all "functional" code related to genius elan, and can still reproduce it: So we have one main thread doing what you can see in https://git.opendaylight.org/gerrit/#/c/50460/3/vpnservice/elanmanager/elanmanager-impl/src/test/java/org/opendaylight/netvirt/elanmanager/tests/ElanServiceCreate100InstancesTest.java - it basically creates an instance of some date. Then, in another thread in the background, see https://git.opendaylight.org/gerrit/#/c/50460/3/vpnservice/elanmanager/elanmanager-impl/src/test/java/org/opendaylight/netvirt/elanmanager/tests/ElanServiceCreate100InstancesListener.java, which is a AsyncDataTreeChangeListenerBase (some genius thing I'm not sure you're familiar with but basically some infra for async change listeners), it basically does a merge() on the data that was write()'n above; but note that the instance merge is NOT the one passed to the listener, but a new one created. |
| Comment by Michael Vorburger [ 02/Feb/17 ] |
|
> The executor passed to the SerializedDOMDataBroker must be single-threaded (...) Yes, this did the trick; confirming! Yu da man - once again; thank you so much for this tip. Actually I was already (surprisingly) close, I had started toying with those Executors under https://git.opendaylight.org/gerrit/#/c/51307/ yesterday, before getting your comment above, based on the hint you gave me verbally in the Kernel call 2 weeks ago, but initially (patch set 1) made the made mistake of using a newCachedThreadPool instead of a newSingleThreadExecutor - close, but the wrong choice! I'll clean up https://git.opendaylight.org/gerrit/#/c/51307/ to have it become a a fix for this ... let's communicate via code review on that Gerrit to try to wrap this up. |
| Comment by Michael Vorburger [ 02/Feb/17 ] |
|
Moved bug from product netvirt to controller, as this turned out to really be an issue in what people refer to as AbstractDataBrokerTest infrastructure (which is in controller), and not something done wrong in the elanmanager related code in netvirt. |
| Comment by Tom Pantelis [ 02/Feb/17 ] |
|
(In reply to Michael Vorburger from comment #6) Cool! |
| Comment by Michael Vorburger [ 02/Feb/17 ] |
|
> The executor passed to the SerializedDOMDataBroker must be single-threaded This part is what really did the trick, and what https://git.opendaylight.org/gerrit/#/c/51307/ fixes. > InMemoryDOMDataStore executor should also not be a direct executor This part is less trivial, and seems not to be directly related to fixing this (the original problem above gets fixed with changing only the executor passed to the SerializedDOMDataBroker and not doing anything about the InMemoryDOMDataStore executor). Changing this causes 3 tests in controller to fail (and who knows where else?!), but I've nevertheless opened 2-3 other new Gerrits to explore this (separately from above's c/51307, so that can go in independently ASAP):
|
| Comment by Tom Pantelis [ 03/Feb/17 ] |
|
It might make sense to make the executors configurable. So by default use direct executors and override appropriately per test. |
| Comment by Michael Vorburger [ 03/Feb/17 ] |
|
> make the executors configurable. FYI The executor passed to the SerializedDOMDataBroker is already configurable, by DataBrokerTestCustomizer's getCommitCoordinatorExecutor(). https://git.opendaylight.org/gerrit/#/c/51307/ proposes to change DataBrokerTestCustomizer's getCommitCoordinatorExecutor() default from newDirectExecutorService() to newSingleThreadExecutor() This is because it seems to me that we can consider the default newDirectExecutorService() a "bug" - it means that one cannot directly use the AbstractDataBrokerTest in a multi-threaded test. As I said, changing this one (but not the InMemoryDOMDataStore executor; see below) seems to have no negative effect, all tests still pass; at least do in controller do, and those I wrote in netvirt and genius pass BETTER (without the problem shown above) with this change. Therefore I would think this change of default is a Good Thing (TM) to do. > So by default use direct executors and override appropriately per test. Of course, if one KNOWS about all of this, one can already @Override getCommitCoordinatorExecutor() in a test, via a custom subclass of DataBrokerTestCustomizer, not AbstractDataBrokerTest. The problem is that I doubt mere mortals know about these internal details - usually, one just wants to write functional tests against DataBroker, not spend days investigating the internal threading model of the test infrastructure (as I have here). If you object to the change of the default, would you like to -1 For the InMemoryDOMDataStore executor choice, we could make DataBrokerTestCustomizer, or even InMemoryDOMDataStoreFactory (thoughts where best?), more configurable. But that's almost a peripheral orthogonal parallel discussion at this point - that part seems to have nothing to do with fixing this bug. |
| Comment by Michael Vorburger [ 06/Feb/17 ] |
|
https://git.opendaylight.org/gerrit/#/c/51307/ broke existing green test(s) in at least genius (org.opendaylight.genius.test.MdSalUtilTest) and netvirt Thus reverting in https://git.opendaylight.org/gerrit/#/c/51467/ ... I'll propose a new change to controller to make this an option instead. |
| Comment by Michael Vorburger [ 06/Feb/17 ] |
|
> https://git.opendaylight.org/gerrit/#/c/51307/ broke existing green test(s) FTR: Any such tests which this broke were technically strictly speaking wrong, as they made an assumption which the DataBroker API never guaranteed ... see e.g. https://git.opendaylight.org/gerrit/#/c/51487/ for an example how to fix such tests. > I'll propose a new change to controller to make this an option instead. |
| Comment by Michael Vorburger [ 07/Feb/17 ] |
|
> tests which this broke were technically strictly speaking wrong |
| Comment by Michael Vorburger [ 16/Feb/17 ] |
|
Done |