[CONTROLLER-1756] OOM due to huge Map in ShardDataTree Created: 23/Aug/17 Updated: 13/Nov/17 Resolved: 19/Sep/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | mdsal |
| Affects Version/s: | Carbon |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Michael Vorburger | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| External issue ID: | 9034 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
We're seeing an OOM in Red Hat internal scale testing: Our scenario is a cluster of 3 nodes with odl-netvirt-openstack being stress tested by OpenStack's rally benchmarking tool. The ODL version that this is being seen with is a a Carbon built last Thursday; specifically https://nexus.opendaylight.org/content/repositories/opendaylight-carbon-epel-7-x86_64-devel/org/opendaylight/integration-packaging/opendaylight/6.2.0-0.1.20170817rel1931.el7.noarch/opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch.rpm We've started testing with giving all 3 ODL node VMs just 2 GB, in an effort to understand ODL memory requirements better. If it's just "normal" that we cannot run with 2 GB in such a "real world" scenario, we'll be gradually increasing Xmx in this environment - but we wanted to get community feedback for this OOM with 2 GB already. I'll attach, or provide links, to the usual HPROF, plus a "Leak Suspects" report produced by https://www.eclipse.org/mat/, plus the Karaf log. Basically what we're seeing is a huge (>1 GB) Map in ShardDataTree (I'm not sure if that's its Map<LocalHistoryIdentifier, ShardDataTreeTransactionChain> transactionChains or its Map<Payload, Runnable> replicationCallbacks). As far as I can tell from my limited understanding, this is not the same as The Karaf log, among other errors which are less relevant in this context AFAIK, shows: (1) a lot of those "ERROR ShardDataTree org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | member-0-shard-default-operational: Failed to commit transaction ... java.lang.IllegalStateException: Store tree org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@78fe0203 and candidate base org.opendaylight.yangtools.yang.data.api.schema.tree.spi (2) at the very end before it blows up, seems to indicate genius' lockmanager perhaps not being too happy due to " Waiting for the lock ... is timed out. retrying again" - probably just an impact of this OOM? Or could that (lockmanager) somehow be related and actually be the cause not the effect - could "bad application code" (like not closing DataBroker transaction correctly, or something like that?) cause this OOM? |
| Comments |
| Comment by Michael Vorburger [ 23/Aug/17 ] |
|
> links, to the usual HPROF, plus a "Leak Suspects" report produced by https://www.eclipse.org/mat/, plus the Karaf log. PLUS a thread stack dump https://drive.google.com/open?id=0B7gTXYDlI5sLOVByTFNZN0JFNHc |
| Comment by Michael Vorburger [ 23/Aug/17 ] |
|
> could "bad application code" (like not closing DataBroker I don't suppose https://git.opendaylight.org/gerrit/#/c/62196/ could fix this kind of problem - is it important that all transactions are always either submitted or canceled? |
| Comment by Michael Vorburger [ 23/Aug/17 ] |
|
> don't suppose https://git.opendaylight.org/gerrit/#/c/62196/ could fix According to https://lists.opendaylight.org/pipermail/mdsal-dev/2017-August/001311.html and a subsequent chat I've had with Robert on #opendaylight IRC, that fixes ANOTHER (possible) OOM, related to a ReadWriteTransaction - which does need fixing, thus https://git.opendaylight.org/gerrit/#/c/62205/ MUST go into stable/carbon. However, according to Robert, the OOM seen here is due to a TransactionChain (not a ReadWriteTransaction) which is not closed somewhere - and we don't know where, yet - I'll see if I can find it... |
| Comment by Michael Vorburger [ 23/Aug/17 ] |
|
> a TransactionChain (not a ReadWriteTransaction) which is not closed somewhere https://git.opendaylight.org/gerrit/#/c/62216/ fixes something I stumbled upon in netvirt bgpmanager BgpUtil which could be related to that; not sure, but let's get it in as well (together c/62205). > rovarga: try to reproduce first with the fix for 9028 I'm not very clear as to the relationship of this to Meanwhile I've also seen a number of errors in the (huge) Karaf log from this which I missed earlier and which may be related to this (e.g. openflowplugin IllegalStateException: Attempted to close chain with outstanding transaction PingPongTransaction), and will now be opening more fine grained individual sub-bugs blocking this one for those points. |
| Comment by A H [ 23/Aug/17 ] |
|
A patch was submitted to revert the changes and fix this bug in Carbon SR2: https://git.opendaylight.org/gerrit/#/c/62205/ To better assess the impact of this bug and fix, could someone from your team please help us identify the following: |
| Comment by Michael Vorburger [ 23/Aug/17 ] |
|
> Regression: Is this bug a regression of functionality/performance/feature compared to Carbon? Dunno, but most likely this memory leak is present in Carbon as well. > Severity: Could you elaborate on the severity of this bug? Yes; without this fixed, ODL is not usable at scale. > Workaround: Is there a workaround such that we can write a release note instead? Nope. > Testing: Could you also elaborate on the testing of this patch? Only reproducible during real scale testing using e.g. OpenStack's rally benchmarking tool. Will be re-tested internally after patches (which are low risk) have been merged to stable branch. > Impact: Does this fix impact any dependent projects? I'm not sure that this means. |
| Comment by Michael Vorburger [ 23/Aug/17 ] |
|
> seeing is a huge (>1 GB) Map in ShardDataTree (I'm not sure if that's closer look with MAT reveals it's (obviously) transactionChains. The real difficulty is finding what application code created the TransactionChain that lead to this in the first place... Other than "stabbing around in the dark" (which is basically what I've been doing above so far), I can't see yet how to trace back the ShardDataTreeTransactionChain instances in that Map back to what originally caused them to not be closed. Reminds me vaguely of work I did on a similar problem, in JDBC, like 15 years ago. At the time, we wrote this kind of wrapping JDBC driver which would do the throw { new Exception(); }catch (Exception e) kinda thing when you opened a connection, keep that around somewhere, and when we saw lots of open and never closed connections, could look at that to find the culprit. – I wonder if it would be possible to write something like this for ODL mdsal... at the ShardDataTreeTransactionChain level I cannot see how, because from the little I understand that's after Akka, so I don't see how to trace that back up to the application. But maybe on the other end at the DataBroker API we could do such tracking... similar to what I helped Josh came up with at the time for that mdsal-trace thing we did. |
| Comment by Michael Vorburger [ 24/Aug/17 ] |
|
> look at that to find the culprit. – I wonder if it would be possible to according to https://lists.opendaylight.org/pipermail/mdsal-dev/2017-August/001316.html, mdsal already has such a feature, via debug-transactions=true in a datastore.cfg! None of the documentation available for OpenDaylight on Amazon.com |
| Comment by Michael Vorburger [ 24/Aug/17 ] |
|
> mdsal already has such a feature On https://lists.opendaylight.org/pipermail/mdsal-dev/2017-August/001323.html Tom has clarified that while there is a feature vaguely related to this (transaction-debug-context-enabled=true in a datastore.cfg, not "debug-transactions"), it will not help to find out the real root cause culprit of this OOM. Therefore, I'd like to suggest https://git.opendaylight.org/gerrit/#/c/62294/ as a new tool which can be used to find the root cause of this (and in the future similar) OOMs. |
| Comment by Robert Varga [ 25/Aug/17 ] |
|
Can we re-create this in CSIT? I really think an audit of how the components in the netvirt stack are using TransactionChains is in order. It's not that it is hundreds of classes: ./netvirt/vpnservice/bgpmanager/bgpmanager-impl/src/main/java/org/opendaylight/netvirt/bgpmanager/BgpUtil.java:import org.opendaylight.controller.md.sal.binding.api.BindingTransactionChain; |
| Comment by Robert Varga [ 25/Aug/17 ] |
|
Also, can we get that internal test hooked up with jenkins and/or get logs from it? |
| Comment by Michael Vorburger [ 28/Aug/17 ] |
|
> I really think an audit of how the components in the Yeah.. I'm planning to go at this not via blind code audits, but by coming up with something which will properly track ANY non closed TransactionChain (and in turn its Txs) - watch > Can we re-create this in CSIT? As far as I know, this error has never been hit in CSIT - probably because those either do not run long enough, or do not "stress test" resources as much? > get that internal test hooked up with jenkins and/or get logs from it? Unfortunately not, because the Red Hat internal scale testing shared infrastructure environment where this was found is currently manually exercised per release, not a permanently available CD/CI type environment with fully automated jobs which hits this. It could be cool to have that, and I'll raise it internally, but this is more of a medium term idea, not feasible for tomorrow, for various reasons. |
| Comment by Michael Vorburger [ 05/Sep/17 ] |
|
Therefore, I've removed a number of unrelated bugs that this issue previously used to "depend on". I expect (hope!) that we'll be able to re-verify and then close this issue as a duplicate of openflowplugin |
| Comment by Kit Lou [ 06/Sep/17 ] |
|
Should we elevate |
| Comment by Michael Vorburger [ 06/Sep/17 ] |
|
> Should we elevate Yes, good idea. I've done it, and done some clean up with dupes, to KISS. > Is openflowplugin the only area of work left related to this bug? Repeat copy/paste from above: But I'd like to keep this controller bug open separately from openflowplugin's until we have confirmed that we are not seeing this OOM anymore (whatever its cause ultimately is found to be). |
| Comment by Michael Vorburger [ 07/Sep/17 ] |
|
Hereby confirming that on a re-test of a custom distribution of latest (yesterday's) carbon/stable together with c/62674/5 (NB Patch Set 5; it’s still WIP) pulled in, we do not see this anymore! It may have been c/62674 or anything else that went in recently (anywhere incl. This has been verified by both "wow, now it didn't blew up anymore with OOM" as well as confirmed via HPROF analysis through MAT. I'll therefore close this issue as soon as a final version of c/62674 has been merged it into carbon, nitrogen and master. |
| Comment by Jamo Luhrsen [ 12/Sep/17 ] |
|
Michael, Sai, can one of you close this bug if it's fixed? |
| Comment by Michael Vorburger [ 13/Sep/17 ] |
|
> Michael, Sai, can one of you close this bug if it's fixed? Nope, because as per #c16 above: "I'll therefore close this issue as soon as a final version of c/62674 has been merged it into carbon, nitrogen and master." AFAIK Sai/Sridhar only did not see this anymore in a custom carbon/stable distribution built by Daniel with c/62674/5 (NB /5!) pulled in. So strictly speaking, we should not close this issue and confirm that this is fixed until a final version of c/62674 (which is still WIP right now) has been merged and re-tested, theoretically. Note however that strictly speaking we are not 100% sure if it was c/62674/5 or anything else that went in recently (anywhere incl. |
| Comment by Robert Varga [ 13/Sep/17 ] |
|
|
| Comment by Kit Lou [ 14/Sep/17 ] |
|
Michael & Team, I have an action from TSC call today to follow up with you on this issue. Given all the work that has been done, is 9034 still a blocker? Thanks! |
| Comment by Michael Vorburger [ 14/Sep/17 ] |
|
Kit Lou, waiting for update on |