[NETVIRT-937] DataStoreJobCoordinator java.lang.NullPointerException: at index 0 Created: 04/Oct/17 Updated: 12/Dec/18 Resolved: 28/Nov/17 |
|
| Status: | Resolved |
| Project: | netvirt |
| Component/s: | General |
| Affects Version/s: | Carbon |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Sam Hague | Assignee: | Michael Vorburger |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 9238 |
| Priority: | High |
| Description |
|
2017-10-03 20:15:53,092 | INFO | nPool-1-worker-2 | ElanInstanceManager | 341 - org.opendaylight.netvirt.elanmanager-impl - 0.4.2.SNAPSHOT | Deleting the elanInterface present under ConfigDS:97fe230c-b83b-4bb4-bfc8-af085a471e92 |
| Comments |
| Comment by Sam Hague [ 04/Oct/17 ] |
|
Attachment odl1_karaf.log.tar.xz has been added with description: odl1_karaf.log |
| Comment by Sam Hague [ 04/Oct/17 ] |
|
older job with issue: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-gate-stateful-oxygen/89/odl1_karaf.log.gz |
| Comment by Michael Vorburger [ 04/Oct/17 ] |
|
Let me see if I can figure out what is going on in this one here... |
| Comment by Michael Vorburger [ 04/Oct/17 ] |
|
+Cc Faseela, Vivek & David, who were on private email thread re. this: https://git.opendaylight.org/gerrit/#/c/63965/ illustrates via new @Test bug9238CallableListWithNull() how this can happen in JobCoordinator.. that test fails with the exact same stack trace as the one Sam provided above, from a CSIT. Basically, we see that some code somewhere is returning a (List containining a) null ListenableFuture. I suspect its a bug where someone wanted to return a completed ListenableFuture of value null, but instead directly returned a null... The trouble is that we cannot yet know what the culprit for this is... but c/63965 also adds additional logging, which should help us find out. https://git.opendaylight.org/gerrit/#/c/63966/ then does the same in genius' DataStoreJobCoordinator as c/63965 does in infrautils' JobCoordinatorImpl. (And having had to do this twice is further motivation for finally wrapping up https://git.opendaylight.org/gerrit/#/q/topic:genius.DJC-to-infra.JC!) Once c/63965 and c/63966 will have been merged, you'll get another exception in logs (""Runnnable likely about to terminate thread due to uncaught exception; but here is useful debugging context:"), please post that here when seen and that will then tell us where the root cause for this is where a null List element is returned by mistake, which should then let us fix this for good (easily). BTW: https://git.opendaylight.org/gerrit/#/c/63961/ has a more medium term solution which will help detect the root cause of this problem in the future at code/build instead of runtime!! But there's a lot more work in all of genius & netvirt TBD until we'll reach that Nirvana.. but it's still a step. PS: https://git.opendaylight.org/gerrit/#/c/63963/ was first stab at making it possible to find it, but that was wrong and won't help in this specific case (because it's async; otherwise it would have, and the NullSafeCollections utility created there has value for other future problems like that IMHO). |
| Comment by Michael Vorburger [ 06/Oct/17 ] |
|
> Once c/63965 and c/63966 will have been merged, you'll get another all merged now, so you should soon start seeing the new messsage in CSIT... |
| Comment by Michael Vorburger [ 26/Oct/17 ] |
|
I'd like to wrap up issuses that are currently assigned to me in JIRA.. normally recent CSIT should include new error messages like "Runnnable likely about to terminate thread due to uncaught exception;" (NB Runnnable is misspelt with 3 instead of 2 n!) - have any of you spotted these meanwhile, or could look out for them? Without hearing back from you within 1 week, I'd like to close this issue. |
| Comment by Sam Hague [ 19/Nov/17 ] |
|
Here is an example that has the log: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-gate-stateful-oxygen/346/odl_1/odl1_karaf.log.gz What do we do now? I guess the bug is still there but somehow this extra debug points us to the issue? 2017-11-17 16:42:45,420 | ERROR | nPool-1-worker-3 | JobCoordinatorImpl | 261 - org.opendaylight.infrautils.jobcoordinator-impl - 1.3.0.SNAPSHOT | Runnnable likely about to terminate thread due to uncaught exception; but here is useful debugging context: JobEntry{key='elaninterface-ff5d93d3-cade-4164-9599-9431c4caa71b', mainWorker=org.opendaylight.netvirt.elan.internal.ElanInstanceManager$$Lambda$768/1055682102@4fc14129, rollbackWorker=null, retryCount=6, futures=[null, com.google.common.util.concurrent.ImmediateFuture$ImmediateSuccessfulCheckedFuture@4bfecc09]}
java.lang.NullPointerException: at index 0
2017-11-17 16:42:45,421 | ERROR | nPool-1-worker-3 | JobCoordinatorImpl | 261 - org.opendaylight.infrautils.jobcoordinator-impl - 1.3.0.SNAPSHOT | Thread terminated due to uncaught exception: ForkJoinPool-1-worker-3
java.lang.NullPointerException: at index 0
2017-11-17 16:42:45,421 | ERROR | nPool-1-worker-1 | JobCoordinatorImpl | 261 - org.opendaylight.infrautils.jobcoordinator-impl - 1.3.0.SNAPSHOT | Runnnable likely about to terminate thread due to uncaught exception; but here is useful debugging context: JobEntry{key='elaninterface-c5edf810-4ec7-43a9-9f6b-11a5ff3747b1', mainWorker=org.opendaylight.netvirt.elan.internal.ElanInstanceManager$$Lambda$768/1055682102@12d144e3, rollbackWorker=null, retryCount=6, futures=[null, com.google.common.util.concurrent.ImmediateFuture$ImmediateSuccessfulCheckedFuture@4bfecc09]}
java.lang.NullPointerException: at index 0
2017-11-17 16:42:45,421 | ERROR | nPool-1-worker-1 | JobCoordinatorImpl | 261 - org.opendaylight.infrautils.jobcoordinator-impl - 1.3.0.SNAPSHOT | Thread terminated due to uncaught exception: ForkJoinPool-1-worker-1
java.lang.NullPointerException: at index 0
|
| Comment by Michael Vorburger [ 20/Nov/17 ] |
|
> What do we do now? I guess the bug is still there but somehow this extra debug points us to the issue? yeah, just like I said in the previous comment - did you have a chance to actually read that? The new output now shows that something in ElanInstanceManager submits a DJC job with null future. I'm moving this issue back from In Progress to In Review and will propose a fix when I have time. |
| Comment by Michael Vorburger [ 21/Nov/17 ] |
|
I've just analysed this one, but I'm not yet getting what's happening here: So the log above shows us that we must have an enqueueJob which return a List<ListenableFuture<Void> where the 1st of 2 such Future is null, note the clear futures=null, ... - and we can see that it's one from a lambda in ElanInstanceManager. However, in ElanInstanceManager there are only 2 such lambdas, first one in line 110 in remove which returns 2 futures, first one from elanInterfaceManager.removeElanInterface() in line 90 and second one one directly from WriteTransaction.submit in line 118. If we trust that WriteTransaction.submit will never return null (or could that happen?), then it would have to come from ElanInterfaceManager.removeElanInterface(), however that in lines 275 & 276 uses ElanUtils.waitForTransactionToComplete which is again just a WriteTransaction.submit ? (We'll ignore the fact that this synchronous blocking completely defeats the purpose of using the JC in the first place.) The other such lambda is in line 155 in update; that looks wrong, and proposed fix is in https://git.opendaylight.org/gerrit/#/c/65802/ - but that cannot lead to this NPE, because it return a null List (which jobcoordinator handles..) and the List with a null element - which we clearly see now in the log. What am I missing? |
| Comment by Michael Vorburger [ 21/Nov/17 ] |
|
https://git.opendaylight.org/gerrit/65803 adds explicit null checks to all returned WriteTransaction's submit() JUST in case there is some obscure CDS bug which makes that return null sometimes - but I doubt it. Best I can think of though, so let us retry with that (and c/65802 - JUST in case), and then see again. |
| Comment by Tom Pantelis [ 21/Nov/17 ] |
|
The futures List in remove is reused across the submitted jobs in the forEach loop: DataStoreJobCoordinator dataStoreJobCoordinator = DataStoreJobCoordinator.getInstance();
ElanUtils.removeAndGetElanInterfaces(elanName).forEach(elanInterfaceName -> {
dataStoreJobCoordinator.enqueueJob(ElanUtils.getElanInterfaceJobKey(elanInterfaceName), () -> {
WriteTransaction writeConfigTxn = broker.newWriteOnlyTransaction();
...
futures.add(writeConfigTxn.submit());
return futures;
}, ElanConstants.JOB_MAX_RETRIES);
});
This looks very suspicious and unsafe as the List is accessed by multiple threads. I think this is what causes the strange behavior with a null element in the list - tx.submit() never returns null. Each job should return it's own local futures List. I'll submit a patch to address this ... |
| Comment by Michael Vorburger [ 22/Nov/17 ] |
|
> I'll submit a patch to address this ... ==> https://git.opendaylight.org/gerrit/#/c/65821/ FTR to anyone else re-reviewing this in the future: The important fix is the ditching of the outside List<ListenableFuture<Void>> futures usage inside the forEach in remove(), because that multi-threaded use was badly wrong: Replacing of the futures.add(writeConfigTxn.submit()); by return Collections.singletonList(writeConfigTxn.submit()); which is perfectly fine, as the goal is for the JC to handle that list of futures, not to have such a list spanning accross the forEach. However the removal of the futures.addAll() of what elanInterfaceManager.removeElanInterface() returns is a kind of unrelated minor clean up fall out - that part is not what actually fixes the concurrency issue here. I'm just writing this down here because it initially confused me while reviewing it. shague can you confirm that you're good with tpantelis fixing the problem originally raised in this issue - OK to close? |
| Comment by Michael Vorburger [ 22/Nov/17 ] |
|
PS: Not directly related to this issue, so to be followed-up in separately raised future changes, on staring more at code around this I have to agree with this comment from tpantelis in the Commit Message of c/65821, quote:
It looks like there is some mix up which makes no sense to me either; I'll propose something about this in 1-2 follow-up changes. |
| Comment by Michael Vorburger [ 22/Nov/17 ] |
|
> It looks like there is some mix up which makes no sense to me either; I'll propose something about this in 1-2 follow-up changes. |
| Comment by Michael Vorburger [ 12/Dec/18 ] |
|
FTR: Something related to this has just come up again in https://git.opendaylight.org/gerrit/#/c/78678/5/jobcoordinator/jobcoordinator-impl/src/main/java/org/opendaylight/infrautils/jobcoordinator/internal/JobCoordinatorImpl.java@416 ... |