-
Bug
-
Resolution: Done
-
None
-
Carbon
-
None
-
Operating System: All
Platform: All
-
9034
We're seeing an OOM in Red Hat internal scale testing:
Our scenario is a cluster of 3 nodes with odl-netvirt-openstack being stress tested by OpenStack's rally benchmarking tool.
The ODL version that this is being seen with is a a Carbon built last Thursday; specifically https://nexus.opendaylight.org/content/repositories/opendaylight-carbon-epel-7-x86_64-devel/org/opendaylight/integration-packaging/opendaylight/6.2.0-0.1.20170817rel1931.el7.noarch/opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch.rpm
We've started testing with giving all 3 ODL node VMs just 2 GB, in an effort to understand ODL memory requirements better. If it's just "normal" that we cannot run with 2 GB in such a "real world" scenario, we'll be gradually increasing Xmx in this environment - but we wanted to get community feedback for this OOM with 2 GB already.
I'll attach, or provide links, to the usual HPROF, plus a "Leak Suspects" report produced by https://www.eclipse.org/mat/, plus the Karaf log.
Basically what we're seeing is a huge (>1 GB) Map in ShardDataTree (I'm not sure if that's its Map<LocalHistoryIdentifier, ShardDataTreeTransactionChain> transactionChains or its Map<Payload, Runnable> replicationCallbacks).
As far as I can tell from my limited understanding, this is not the same as CONTROLLER-1746 (there's nothing about "closedTransactions" anywhere..), and not CONTROLLER-1755 either?
The Karaf log, among other errors which are less relevant in this context AFAIK, shows:
(1) a lot of those "ERROR ShardDataTree org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | member-0-shard-default-operational: Failed to commit transaction ... java.lang.IllegalStateException: Store tree org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@78fe0203 and candidate base org.opendaylight.yangtools.yang.data.api.schema.tree.spi
.MaterializedContainerNode@686861e8 differ" errors - seems vaguely familiar from recent list posts, someone remind me what were those that all about again?
(2) at the very end before it blows up, seems to indicate genius' lockmanager perhaps not being too happy due to " Waiting for the lock ... is timed out. retrying again" - probably just an impact of this OOM? Or could that (lockmanager) somehow be related and actually be the cause not the effect - could "bad application code" (like not closing DataBroker transaction correctly, or something like that?) cause this OOM?
- blocks
-
CONTROLLER-1763 On restarting ODL on one node, ODL on another node dies in a clustered setup
- Resolved
-
NETVIRT-883 Umbrella parent issue for grouping all suspected transaction leaks
- Resolved
- is blocked by
-
CONTROLLER-1755 RaftActor lastApplied index moves backwards
- Resolved
-
CONTROLLER-1760 Tooling to find the real root cause culprit of memory leaks related to non-closed transactions (and tx chains)
- Resolved
-
NETCONF-462 TransactionChain created in RestConnectorProvider.start line 87 is never closed
- Resolved
-
OPNFLWPLUG-933 IllegalStateException: Attempted to close chain with outstanding transaction PingPongTransaction at org.opendaylight.openflowplugin.impl.device.TransactionChainManager.createTxChain
- Resolved
-
OPNFLWPLUG-935 TransactionChain created in OperationProcessor.<init> line 36 is never closed
- Resolved
-
OVSDB-423 TransactionChain created in TransactionInvokerImpl.<init> line 53 is never closed
- Resolved
-
OVSDB-424 TransactionChain created in hwvtepsouthbound TransactionInvokerImpl.<init> line 61 is never closed
- Resolved
- is duplicated by
-
CONTROLLER-1762 ODL is up and ports are listening but not functional
- Resolved
- relates to
-
NETVIRT-985 java.lang.OutOfMemoryError: Java heap space
- Resolved