[CONTROLLER-1756] OOM due to huge Map in ShardDataTree Created: 23/Aug/17  Updated: 13/Nov/17  Resolved: 19/Sep/17

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Michael Vorburger Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks CONTROLLER-1763 On restarting ODL on one node, ODL on... Resolved
blocks NETVIRT-883 Umbrella parent issue for grouping al... Resolved
is blocked by CONTROLLER-1755 RaftActor lastApplied index moves bac... Resolved
is blocked by CONTROLLER-1760 Tooling to find the real root cause c... Resolved
is blocked by NETCONF-462 TransactionChain created in RestConne... Resolved
is blocked by OPNFLWPLUG-933 IllegalStateException: Attempted to c... Resolved
is blocked by OPNFLWPLUG-935 TransactionChain created in Operation... Resolved
is blocked by OVSDB-423 TransactionChain created in Transacti... Resolved
is blocked by OVSDB-424 TransactionChain created in hwvtepsou... Resolved
Duplicate
is duplicated by CONTROLLER-1762 ODL is up and ports are listening but... Resolved
Relates
relates to NETVIRT-985 java.lang.OutOfMemoryError: Java heap... Resolved
External issue ID: 9034

 Description   

We're seeing an OOM in Red Hat internal scale testing:

Our scenario is a cluster of 3 nodes with odl-netvirt-openstack being stress tested by OpenStack's rally benchmarking tool.

The ODL version that this is being seen with is a a Carbon built last Thursday; specifically https://nexus.opendaylight.org/content/repositories/opendaylight-carbon-epel-7-x86_64-devel/org/opendaylight/integration-packaging/opendaylight/6.2.0-0.1.20170817rel1931.el7.noarch/opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch.rpm

We've started testing with giving all 3 ODL node VMs just 2 GB, in an effort to understand ODL memory requirements better. If it's just "normal" that we cannot run with 2 GB in such a "real world" scenario, we'll be gradually increasing Xmx in this environment - but we wanted to get community feedback for this OOM with 2 GB already.

I'll attach, or provide links, to the usual HPROF, plus a "Leak Suspects" report produced by https://www.eclipse.org/mat/, plus the Karaf log.

Basically what we're seeing is a huge (>1 GB) Map in ShardDataTree (I'm not sure if that's its Map<LocalHistoryIdentifier, ShardDataTreeTransactionChain> transactionChains or its Map<Payload, Runnable> replicationCallbacks).

As far as I can tell from my limited understanding, this is not the same as CONTROLLER-1746 (there's nothing about "closedTransactions" anywhere..), and not CONTROLLER-1755 either?

The Karaf log, among other errors which are less relevant in this context AFAIK, shows:

(1) a lot of those "ERROR ShardDataTree org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | member-0-shard-default-operational: Failed to commit transaction ... java.lang.IllegalStateException: Store tree org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@78fe0203 and candidate base org.opendaylight.yangtools.yang.data.api.schema.tree.spi
.MaterializedContainerNode@686861e8 differ" errors - seems vaguely familiar from recent list posts, someone remind me what were those that all about again?

(2) at the very end before it blows up, seems to indicate genius' lockmanager perhaps not being too happy due to " Waiting for the lock ... is timed out. retrying again" - probably just an impact of this OOM? Or could that (lockmanager) somehow be related and actually be the cause not the effect - could "bad application code" (like not closing DataBroker transaction correctly, or something like that?) cause this OOM?



 Comments   
Comment by Michael Vorburger [ 23/Aug/17 ]

> links, to the usual HPROF, plus a "Leak Suspects" report produced by https://www.eclipse.org/mat/, plus the Karaf log. PLUS a thread stack dump

https://drive.google.com/open?id=0B7gTXYDlI5sLOVByTFNZN0JFNHc

Comment by Michael Vorburger [ 23/Aug/17 ]

> could "bad application code" (like not closing DataBroker
> transaction correctly, or something like that?) cause this OOM?

I don't suppose https://git.opendaylight.org/gerrit/#/c/62196/ could fix this kind of problem - is it important that all transactions are always either submitted or canceled?

Comment by Michael Vorburger [ 23/Aug/17 ]

> don't suppose https://git.opendaylight.org/gerrit/#/c/62196/ could fix

According to https://lists.opendaylight.org/pipermail/mdsal-dev/2017-August/001311.html and a subsequent chat I've had with Robert on #opendaylight IRC, that fixes ANOTHER (possible) OOM, related to a ReadWriteTransaction - which does need fixing, thus https://git.opendaylight.org/gerrit/#/c/62205/ MUST go into stable/carbon.

However, according to Robert, the OOM seen here is due to a TransactionChain (not a ReadWriteTransaction) which is not closed somewhere - and we don't know where, yet - I'll see if I can find it...

Comment by Michael Vorburger [ 23/Aug/17 ]

> a TransactionChain (not a ReadWriteTransaction) which is not closed somewhere
> rovarga: vorburger: i.e. it is open, probably stored in a field of some object

https://git.opendaylight.org/gerrit/#/c/62216/ fixes something I stumbled upon in netvirt bgpmanager BgpUtil which could be related to that; not sure, but let's get it in as well (together c/62205).

> rovarga: try to reproduce first with the fix for 9028

I'm not very clear as to the relationship of this to CONTROLLER-1755 but based on above from Robert on IRC, making this bug dependant on CONTROLLER-1755; we'll wait for either its https://git.opendaylight.org/gerrit/#/c/62170/ and/or https://git.opendaylight.org/gerrit/#/c/62169/ to be merged, and then we'll retry with that in our scale lab.

Meanwhile I've also seen a number of errors in the (huge) Karaf log from this which I missed earlier and which may be related to this (e.g. openflowplugin IllegalStateException: Attempted to close chain with outstanding transaction PingPongTransaction), and will now be opening more fine grained individual sub-bugs blocking this one for those points.

Comment by A H [ 23/Aug/17 ]

A patch was submitted to revert the changes and fix this bug in Carbon SR2:

https://git.opendaylight.org/gerrit/#/c/62205/
https://git.opendaylight.org/gerrit/#/c/62216/

To better assess the impact of this bug and fix, could someone from your team please help us identify the following:
Regression: Is this bug a regression of functionality/performance/feature compared to Carbon?
Severity: Could you elaborate on the severity of this bug? Is this a BLOCKER such that we cannot release Carbon SR2 without it?
Workaround: Is there a workaround such that we can write a release note instead?
Testing: Could you also elaborate on the testing of this patch? How extensively has this patch been tested? Is it covered by any unit tests or system tests?
Impact: Does this fix impact any dependent projects?

Comment by Michael Vorburger [ 23/Aug/17 ]

> Regression: Is this bug a regression of functionality/performance/feature compared to Carbon?

Dunno, but most likely this memory leak is present in Carbon as well.

> Severity: Could you elaborate on the severity of this bug?
> Is this a BLOCKER such that we cannot release Carbon SR2 without it?

Yes; without this fixed, ODL is not usable at scale.

> Workaround: Is there a workaround such that we can write a release note instead?

Nope.

> Testing: Could you also elaborate on the testing of this patch?
> How extensively has this patch been tested? Is it covered by any unit tests or system tests?

Only reproducible during real scale testing using e.g. OpenStack's rally benchmarking tool.

Will be re-tested internally after patches (which are low risk) have been merged to stable branch.

> Impact: Does this fix impact any dependent projects?

I'm not sure that this means.

Comment by Michael Vorburger [ 23/Aug/17 ]

> seeing is a huge (>1 GB) Map in ShardDataTree (I'm not sure if that's
> Map<LocalHistoryIdentifier, ShardDataTreeTransactionChain> transactionChains
> or its Map<Payload, Runnable> replicationCallbacks)

closer look with MAT reveals it's (obviously) transactionChains. The real difficulty is finding what application code created the TransactionChain that lead to this in the first place... Other than "stabbing around in the dark" (which is basically what I've been doing above so far), I can't see yet how to trace back the ShardDataTreeTransactionChain instances in that Map back to what originally caused them to not be closed.

Reminds me vaguely of work I did on a similar problem, in JDBC, like 15 years ago. At the time, we wrote this kind of wrapping JDBC driver which would do the throw

{ new Exception(); }

catch (Exception e) kinda thing when you opened a connection, keep that around somewhere, and when we saw lots of open and never closed connections, could look at that to find the culprit. – I wonder if it would be possible to write something like this for ODL mdsal... at the ShardDataTreeTransactionChain level I cannot see how, because from the little I understand that's after Akka, so I don't see how to trace that back up to the application. But maybe on the other end at the DataBroker API we could do such tracking... similar to what I helped Josh came up with at the time for that mdsal-trace thing we did.

Comment by Michael Vorburger [ 24/Aug/17 ]

> look at that to find the culprit. – I wonder if it would be possible to
> write something like this for ODL mdsal... [...] maybe on the other end at
> the DataBroker API we could do such tracking...

according to https://lists.opendaylight.org/pipermail/mdsal-dev/2017-August/001316.html, mdsal already has such a feature, via debug-transactions=true in a datastore.cfg! None of the documentation available for OpenDaylight on Amazon.com has any further documentation about how to use it, so I'll try to https://www.google.ch/search?q=read+the+source+luke, and try to learn how using this it is possible to find the real cuplrit of this OOM...

Comment by Michael Vorburger [ 24/Aug/17 ]

> mdsal already has such a feature

On https://lists.opendaylight.org/pipermail/mdsal-dev/2017-August/001323.html Tom has clarified that while there is a feature vaguely related to this (transaction-debug-context-enabled=true in a datastore.cfg, not "debug-transactions"), it will not help to find out the real root cause culprit of this OOM.

Therefore, I'd like to suggest https://git.opendaylight.org/gerrit/#/c/62294/ as a new tool which can be used to find the root cause of this (and in the future similar) OOMs.

Comment by Robert Varga [ 25/Aug/17 ]

Can we re-create this in CSIT?

I really think an audit of how the components in the netvirt stack are using TransactionChains is in order. It's not that it is hundreds of classes:

./netvirt/vpnservice/bgpmanager/bgpmanager-impl/src/main/java/org/opendaylight/netvirt/bgpmanager/BgpUtil.java:import org.opendaylight.controller.md.sal.binding.api.BindingTransactionChain;
./netvirt/vpnservice/bgpmanager/bgpmanager-impl/src/main/java/org/opendaylight/netvirt/bgpmanager/BgpUtil.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChain;
./netvirt/vpnservice/bgpmanager/bgpmanager-impl/src/main/java/org/opendaylight/netvirt/bgpmanager/BgpUtil.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChainListener;
./ocpplugin/ocpplugin-impl/src/main/java/org/opendaylight/ocpplugin/impl/device/TransactionChainManager.java:import org.opendaylight.controller.md.sal.binding.api.BindingTransactionChain;
./ocpplugin/ocpplugin-impl/src/main/java/org/opendaylight/ocpplugin/impl/device/TransactionChainManager.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChain;
./ocpplugin/ocpplugin-impl/src/main/java/org/opendaylight/ocpplugin/impl/device/TransactionChainManager.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChainListener;
./of-config/southbound/southbound-impl/src/main/java/org/opendaylight/ofconfig/southbound/impl/topology/OfconfigTopoHandler.java:import org.opendaylight.controller.md.sal.binding.api.BindingTransactionChain;
./of-config/southbound/southbound-impl/src/main/java/org/opendaylight/ofconfig/southbound/impl/topology/OfconfigTopoHandler.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChain;
./of-config/southbound/southbound-impl/src/main/java/org/opendaylight/ofconfig/southbound/impl/topology/OfconfigTopoHandler.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChainListener;
./openflowplugin/applications/bulk-o-matic/src/main/java/org/opendaylight/openflowplugin/applications/bulk/o/matic/FlowWriterTxChain.java:import org.opendaylight.controller.md.sal.binding.api.BindingTransactionChain;
./openflowplugin/applications/bulk-o-matic/src/main/java/org/opendaylight/openflowplugin/applications/bulk/o/matic/FlowWriterTxChain.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChain;
./openflowplugin/applications/bulk-o-matic/src/main/java/org/opendaylight/openflowplugin/applications/bulk/o/matic/FlowWriterTxChain.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChainListener;
./openflowplugin/applications/topology-manager/src/main/java/org/opendaylight/openflowplugin/applications/topology/manager/OperationProcessor.java:import org.opendaylight.controller.md.sal.binding.api.BindingTransactionChain;
./openflowplugin/applications/topology-manager/src/main/java/org/opendaylight/openflowplugin/applications/topology/manager/OperationProcessor.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChain;
./openflowplugin/applications/topology-manager/src/main/java/org/opendaylight/openflowplugin/applications/topology/manager/OperationProcessor.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChainListener;
./openflowplugin/openflowplugin-impl/src/main/java/org/opendaylight/openflowplugin/impl/device/TransactionChainManager.java:import org.opendaylight.controller.md.sal.binding.api.BindingTransactionChain;
./openflowplugin/openflowplugin-impl/src/main/java/org/opendaylight/openflowplugin/impl/device/TransactionChainManager.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChain;
./openflowplugin/openflowplugin-impl/src/main/java/org/opendaylight/openflowplugin/impl/device/TransactionChainManager.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChainClosedException;
./openflowplugin/openflowplugin-impl/src/main/java/org/opendaylight/openflowplugin/impl/device/TransactionChainManager.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChainListener;
./openflowplugin/openflowplugin-impl/src/main/java/org/opendaylight/openflowplugin/impl/statistics/StatisticsContextImpl.java:import org.opendaylight.mdsal.common.api.TransactionChainClosedException;
./openflowplugin/openflowplugin-impl/src/main/java/org/opendaylight/openflowplugin/impl/statistics/StatisticsGatheringUtils.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChainClosedException;
./ovsdb/hwvtepsouthbound/hwvtepsouthbound-impl/src/main/java/org/opendaylight/ovsdb/hwvtepsouthbound/transactions/md/TransactionInvokerImpl.java:import org.opendaylight.controller.md.sal.binding.api.BindingTransactionChain;
./ovsdb/hwvtepsouthbound/hwvtepsouthbound-impl/src/main/java/org/opendaylight/ovsdb/hwvtepsouthbound/transactions/md/TransactionInvokerImpl.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChain;
./ovsdb/hwvtepsouthbound/hwvtepsouthbound-impl/src/main/java/org/opendaylight/ovsdb/hwvtepsouthbound/transactions/md/TransactionInvokerImpl.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChainListener;
./ovsdb/southbound/southbound-impl/src/main/java/org/opendaylight/ovsdb/southbound/transactions/md/TransactionInvokerImpl.java:import org.opendaylight.controller.md.sal.binding.api.BindingTransactionChain;
./ovsdb/southbound/southbound-impl/src/main/java/org/opendaylight/ovsdb/southbound/transactions/md/TransactionInvokerImpl.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChain;
./ovsdb/southbound/southbound-impl/src/main/java/org/opendaylight/ovsdb/southbound/transactions/md/TransactionInvokerImpl.java:import org.opendaylight.controller.md.sal.common.api.data.TransactionChainListener;

Comment by Robert Varga [ 25/Aug/17 ]

Also, can we get that internal test hooked up with jenkins and/or get logs from it?

Comment by Michael Vorburger [ 28/Aug/17 ]

> I really think an audit of how the components in the
> netvirt stack are using TransactionChains is in order

Yeah.. I'm planning to go at this not via blind code audits, but by coming up with something which will properly track ANY non closed TransactionChain (and in turn its Txs) - watch CONTROLLER-1760... once I have that done, I expect to open more bugs / Gerrits for what will likely identify (like OPNFLWPLUG-933 but in other code paths).

> Can we re-create this in CSIT?

As far as I know, this error has never been hit in CSIT - probably because those either do not run long enough, or do not "stress test" resources as much?

> get that internal test hooked up with jenkins and/or get logs from it?

Unfortunately not, because the Red Hat internal scale testing shared infrastructure environment where this was found is currently manually exercised per release, not a permanently available CD/CI type environment with fully automated jobs which hits this. It could be cool to have that, and I'll raise it internally, but this is more of a medium term idea, not feasible for tomorrow, for various reasons.

Comment by Michael Vorburger [ 05/Sep/17 ]

NETVIRT-883 attachment meanwhile has what seems to confirm that this OOM is due to issues in openflowplugin (and that there are no other TransactionChain related problems elsewhere - which until NETVIRT-883 we could not yet "prove").

Therefore, I've removed a number of unrelated bugs that this issue previously used to "depend on". I expect (hope!) that we'll be able to re-verify and then close this issue as a duplicate of openflowplugin OPNFLWPLUG-933 / OPNFLWPLUG-935 / OVSDB-423 soon.

Comment by Kit Lou [ 06/Sep/17 ]

Should we elevate OPNFLWPLUG-933 and OPNFLWPLUG-935 as blockers? Is openflowplugin the only area of work left related to this bug?

Comment by Michael Vorburger [ 06/Sep/17 ]

> Should we elevate OPNFLWPLUG-933 and OPNFLWPLUG-935 as blockers?

Yes, good idea. I've done it, and done some clean up with dupes, to KISS.

> Is openflowplugin the only area of work left related to this bug?

Repeat copy/paste from above: NETVIRT-883 attachment meanwhile has what seems to confirm that this OOM is due to issues in openflowplugin (and that there are no other TransactionChain related problems elsewhere - which until NETVIRT-883 we could not yet "prove").

But I'd like to keep this controller bug open separately from openflowplugin's until we have confirmed that we are not seeing this OOM anymore (whatever its cause ultimately is found to be).

Comment by Michael Vorburger [ 07/Sep/17 ]

Hereby confirming that on a re-test of a custom distribution of latest (yesterday's) carbon/stable together with c/62674/5 (NB Patch Set 5; it’s still WIP) pulled in, we do not see this anymore!

It may have been c/62674 or anything else that went in recently (anywhere incl. CONTROLLER-1755 dunno if that could have helped for this?) which did the trick, but it works now, so great!

This has been verified by both "wow, now it didn't blew up anymore with OOM" as well as confirmed via HPROF analysis through MAT.

I'll therefore close this issue as soon as a final version of c/62674 has been merged it into carbon, nitrogen and master.

Comment by Jamo Luhrsen [ 12/Sep/17 ]

Michael, Sai, can one of you close this bug if it's fixed?

Comment by Michael Vorburger [ 13/Sep/17 ]

> Michael, Sai, can one of you close this bug if it's fixed?

Nope, because as per #c16 above: "I'll therefore close this issue as soon as a final version of c/62674 has been merged it into carbon, nitrogen and master."

AFAIK Sai/Sridhar only did not see this anymore in a custom carbon/stable distribution built by Daniel with c/62674/5 (NB /5!) pulled in. So strictly speaking, we should not close this issue and confirm that this is fixed until a final version of c/62674 (which is still WIP right now) has been merged and re-tested, theoretically.

Note however that strictly speaking we are not 100% sure if it was c/62674/5 or anything else that went in recently (anywhere incl. CONTROLLER-1755 dunno if that could have helped for this?) which did the trick here; confirming that would require re-testing a new "pure" carbon/stable (without c/62674).

Comment by Robert Varga [ 13/Sep/17 ]

CONTROLLER-1755 definitely did not fix any memory leaks, but it may have caused an error path memory leak to become masked again.

Comment by Kit Lou [ 14/Sep/17 ]

Michael & Team, I have an action from TSC call today to follow up with you on this issue. Given all the work that has been done, is 9034 still a blocker? Thanks!

Comment by Michael Vorburger [ 14/Sep/17 ]

Kit Lou, waiting for update on OPNFLWPLUG-933 and merge of c/62674.

Generated at Wed Feb 07 19:56:22 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.