[NETVIRT-878] CSIT should help to detect possible memory leaks leading to OOM related to non-closed transactions (and tx chains) early Created: 31/Aug/17  Updated: 22/Jan/20

Status: In Progress
Project: netvirt
Component/s: General
Affects Version/s: Nitrogen
Fix Version/s: Fluorine-SR2, Neon

Type: Improvement Priority: Medium
Reporter: Michael Vorburger Assignee: Srinivas Rachakonda
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks NETVIRT-1010 OOM and other memory issues Resolved
is blocked by NETVIRT-1089 Add trace:transactions to suite teard... In Progress
is blocked by CONTROLLER-1760 Tooling to find the real root cause c... Resolved
is blocked by CONTROLLER-1764 Karaf 4: odl-mdsal-trace cannot "just... Resolved
is blocked by GENIUS-102 New OOM due to more TX leaks seen in ... Resolved
is blocked by OPNFLWPLUG-961 New OOM due to more TX leaks seen in ... Resolved
is blocked by OVSDB-435 New OOM due to more TX leaks seen in ... Resolved
is blocked by OPNFLWPLUG-982 Suspected TransactionChain leak in Tr... Resolved
is blocked by NETVIRT-883 Umbrella parent issue for grouping al... Resolved
is blocked by NETVIRT-985 java.lang.OutOfMemoryError: Java heap... Resolved
is blocked by NEUTRON-153 Neutron NB API feature fails to insta... Verified

 Description   

CONTROLLER-1760's new "tooling to find the real root cause culprit of memory leaks related to non-closed transactions" should ideally be run during CSIT already, something like this:

At the start, before you install netvirt odl-netvirt-openstack, if you could add a "feature:install odl-mdsal-trace" and then after that's through a "feature:install --no-auto-refresh odl-netvirt-openstack" (due to CONTROLLER-1764). Then do loads of interesting stuff, and then at the end do "trace:transactions", and if that returns anything else than that 1 line "all good" message, then start crying like a baby and fail CSIT (and show the output from trace:transactions, which will have long details about what transactions were not closed) ...



 Comments   
Comment by Michael Vorburger [ 06/Dec/17 ]

I'm not 100% sure if on master after we're fully through with NETVIRT-985 you'll truly hit 0 output on the output of the trace:transaction CLI command - but I think the best is to "just do it" and then before merging the CSIT extension doing this shout and let me know - I'll then at that point fix or whitelist / exclude whatever you still see.

Comment by Michael Vorburger [ 08/Jan/18 ]

jluhrsen has now started work related to this on Gerrit topic feature-install.

Comment by Jamo Luhrsen [ 08/Jan/18 ]

It looks like the changes to install this mdsal trace and then netvirt afterward are causing some
CSIT failures. Here is a sandbox [link|https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-ocata-jamo-upstream-stateful-nitrogen/1/] (it will be invalid in one week).

Essentially, the high level problem is that openstack instances are not getting ips. Digging a little deeper it doesn't look like
the OVS instances are getting programmed with flows.

any ideas?

 

Comment by Michael Vorburger [ 08/Jan/18 ]

jluhrsen as per private IRC chat, more as a FTR for myself: That job (above) is for nitrogen, but the trace:transactions is not available for Nitrogen anyway (I've never back patched it). I therefore propose to focus this only on Oxygen - even if your first phase isn't even about trace:transactions just yet, it will later have to be, so might as well focus only on one release.

The only thing in the Karaf log of your Nitrogen CSIT job for this is this error:

2018-01-08 19:00:27,797 | WARN | ender-2-thread-1 | AbstractLifeCycle | 282 - org.eclipse.jetty.util - 9.2.21.v20170120 | FAILED HttpServiceContext ... httpContext=WebAppHttpContext ... org.opendaylight.neutron.northbound-api - 374 ... java.lang.NoClassDefFoundError: javax/ws/rs/ext/MessageBodyReader java.lang.NoClassDefFoundError: javax/ws/rs/ext/MessageBodyReader

perhaps that is causing the 404 you wrote on IRC you saw in neutron log:

DEBUG networking_odl.common.client [-] Exception from ODL: 404 Client Error: Not Found for url: http://10.30.170.91:8080/controller/nb/v2/neutron/ports

If we are seeing this NoClassDefFoundError on Oxygen master as well, and if we are only seeing it when you first feature:install the odl-mdsal-trace then we would have to dig into that... but the easiest then is probably just to wait for skitt's fix for CONTROLLER-1764 to land and then just add odl-mdsal-trace as the first in the list of boot features...

Comment by Jamo Luhrsen [ 09/Jan/18 ]

if using featuresBoot, I don't see the problem loading neutron NB API, but it does show up for me even locally doing the
steps with feature:install. I know we can just change our scripts to simply use featuresBoot, but it's probably a good excuse
for us to figure out why this is happening when we feature:install. Maybe there is some bug we can fix.

I do like having the option of doing feature:install vs featuresBoot in our infra, so no need to stop that work. It's pretty much
ready at this point, but I don't want to merge the netvirt part of it since we know it will break things.

btw, just to be clear, we did run this job with oxygen and saw the same 14 failures:
https://jenkins.opendaylight.org/sandbox/job/netvirt-csit-1node-openstack-ocata-jamo-upstream-stateful-oxygen/2/

Comment by Michael Vorburger [ 09/Jan/18 ]

NEUTRON-153 will look into the problem loading neutron NB API with features:install.

I suggest that in this issue we move forward using featuresBoot instead of feature:install.

Comment by Jamo Luhrsen [ 21/Feb/18 ]

with this builder patch to install odl-mdsal-trace as an additional feature, and running the l2 connectivity suite with this int/test patch we get the following output:

[?1l>[?2004lTracingBroker found some not yet (or never..) closed transaction[chain]s! [NB: If no stack traces are shown below, then enable transaction-debug-context-enabled in mdsaltrace_config.xml] DataBroker : createTransactionChain() 3x TransactionChains opened but not closed here: (...) org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransactionChain.<init>(PingPongTransactionChain.java:104) org.opendaylight.controller.md.sal.dom.broker.impl.PingPongDataBroker.createTransactionChain(PingPongDataBroker.java:49) org.opendaylight.controller.md.sal.dom.broker.impl.PingPongDataBroker.createTransactionChain(PingPongDataBroker.java:28) (...) org.opendaylight.controller.md.sal.binding.impl.BindingDOMTransactionChainAdapter.<init>(BindingDOMTransactionChainAdapter.java:45) org.opendaylight.controller.md.sal.binding.impl.BindingDOMDataBrokerAdapter.createTransactionChain(BindingDOMDataBrokerAdapter.java:74) (...) org.opendaylight.openflowplugin.common.txchain.TransactionChainManager.createTxChain(TransactionChainManager.java:81) org.opendaylight.openflowplugin.common.txchain.TransactionChainManager.activateTransactionManager(TransactionChainManager.java:109) org.opendaylight.openflowplugin.impl.device.DeviceContextImpl.lazyTransactionManagerInitialization(DeviceContextImpl.java:649) org.opendaylight.openflowplugin.impl.device.DeviceContextImpl.instantiateServiceInstance(DeviceContextImpl.java:592) org.opendaylight.openflowplugin.impl.lifecycle.GuardedContextImpl.instantiateServiceInstance(GuardedContextImpl.java:86) java.util.concurrent.CopyOnWriteArrayList.forEach(CopyOnWriteArrayList.java:891) org.opendaylight.openflowplugin.impl.lifecycle.ContextChainImpl.instantiateServiceInstance(ContextChainImpl.java:74) org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lambda$startServices$0(ClusterSingletonServiceGroupImpl.java:648) java.util.ArrayList.forEach(ArrayList.java:1257) org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.startServices(ClusterSingletonServiceGroupImpl.java:645) org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.cleanupCandidateOwnershipChanged(ClusterSingletonServiceGroupImpl.java:506) org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lockedOwnershipChanged(ClusterSingletonServiceGroupImpl.java:453) org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ownershipChanged(ClusterSingletonServiceGroupImpl.java:433) org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.ownershipChanged(AbstractClusterSingletonServiceProviderImpl.java:234) org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl.ownershipChanged(DOMClusterSingletonServiceProviderImpl.java:23) org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.onEntityOwnershipChanged(EntityOwnershipListenerActor.java:44) org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.handleReceive(EntityOwnershipListenerActor.java:33) org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38) (...) [?1h=[?2004hopendaylight-user@root>

 

what can we make of this so far? I think we can probably merge the jjb and test
patches as a start and then figure out if we want to start failing things if we
get certain outputs. I think the idea was that we should never see any output,
unless there is a bug being uncovered.

Comment by Michael Vorburger [ 26/Feb/18 ]

> what can we make of this so far?

a real possible transaction leak in openflowplugin which could lead to OOM on longevity, or a false positive of the TracingBroker (trace:transaction) .. watch OPNFLWPLUG-982 for more on that front.

> I think we can probably merge the jjb and test
> patches as a start and then figure out if we want to start failing things if we
> get certain outputs. I think the idea was that we should never see any output,
> unless there is a bug being uncovered.

confirming this, that is the goal here (and we should keep this issue open until we have achieved that). But I would start merging things gradually, and not wait for the full solution, but have as a goal to enforce failing CSIT when we see any output from trace:transactions after test run within a reasonable time frame - couple of weeks, perhaps?

But a question: Is OPNFLWPLUG-982 just the first or the only output, after all of / a full netvirt CSIT? If it's the latter, then things re. Tx leaks (=OOMs) currently aren't so bad actually... that would mean that no OOM related regressions have been introduced since our last scale test (or they are in scenarios which the CSIT does not cover).

Comment by Sam Hague [ 06/Apr/18 ]

vorburger jluhrsen is there anything else for this issue?

Comment by Jamo Luhrsen [ 06/Apr/18 ]

Michael Vorburger Jamo Luhrsen is there anything else for this issue?

yeah, I need to finish this. We had it in for a bit, but it caused the logging
issues, so we reverted it. I'll take a look at getting it back soon.

Comment by Michael Vorburger [ 06/Apr/18 ]

shague yeah please don't close this one, this is NOT done - and we really have to finally do this!!

Comment by Jamo Luhrsen [ 07/Apr/18 ]

Thanks for confirming what I said, vorburger. Just FYI, this is NOT at the top of my to-do list. I DID get
it finished once before, but we hit a bug with a gagillion log messages ruining the deployment, and that quickly
scared us in to reverting it. I know that logging bug was resolved. "I'll take a look at getting it back soon."

Comment by Jamo Luhrsen [ 07/Apr/18 ]

actually, I forgot about this patch. Running it in the sandbox
now and depending on what that looks like, we might be able to get this in without much effort.

Comment by Jamo Luhrsen [ 09/Apr/18 ]

the recent two sandbox jobs I ran with this CSIT patch and odl-mdsal-trace feature have failures which we
do not expect to see. I have not had time to dig in to the failures, but until we get those figured out, we
can't move forward.

two sandbox failures:
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/35/jamo-netvirt-csit-1node-openstack-queens-upstream-stateful-oxygen/2/robot-plugin/log_full.html.gz
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/34/jamo-netvirt-csit-1node-openstack-queens-upstream-stateful-oxygen/3/robot-plugin/log_full.html.gz

I'll run it a few more times, as that's not really any work; just a few clicks.

Comment by Michael Vorburger [ 25/Jul/18 ]

We really should pick this old and long overdue idea up and finish it, somehow.

In parallel (and with a for me higher short term priority) we will also finish NETVIRT-1318 and start GENIUS-176.

Comment by Sam Hague [ 02/Oct/18 ]

jluhrsen did we want to get this in?

Comment by Jamo Luhrsen [ 02/Oct/18 ]

yes, but we need two things.

1) run csit with this trace:transactions stuff going to verify no weird failures. In the past, it was causing
extra failures. The last time I had time to check, it was not causing trouble.

2) modify the step to fail when too many open transactions are there.

Comment by Abhinav Gupta [ 25/Nov/19 ]

any update here?

Comment by Jamo Luhrsen [ 25/Nov/19 ]

the work went in to add "trace:transactions" to each test case teardown, but it looks like it's not working now, but maybe the feature is not installed properly any more? We'd need to find someone to own this jira now and take it forward.

once that command is working again, the next step would be to understand if extra transactions are not being closed and then mark the
test case as failed. That will take some thinking and understanding to get right though because there will always be some transactions opened
when you call the command. What we are looking for is the case when transaction count increases over time without going back down which
would be a memory leak scenario and could blow up as an OOM in a long running production environment.

Comment by Abhinav Gupta [ 27/Nov/19 ]

Nishchya, can you please look into this?

Comment by Nishchya Gupta [ 22/Jan/20 ]

Hi Srini, please checkif it still valid.

Comment by Jamo Luhrsen [ 22/Jan/20 ]

this jira is not about fixing a bug that you need to check if it's still valid or not. It's about doing the work to add functionality to existing CSIT code
to find memory leaks coming from non-closed transactions. That work never really got finished.

Generated at Wed Feb 07 20:22:42 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.