[OPNFLWPLUG-833] OOM due to suspected memory leak in akka.dispatch.Dispatcher found by MAT in hprof Created: 14/Dec/16  Updated: 27/Sep/21  Resolved: 21/Jun/17

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Tim Rozet Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive Bug7370_Threads.zip     Zip Archive java_pid19570_Leak_Suspects.zip     Zip Archive karaf.zip    
External issue ID: 7370
Priority: High

 Description   

OOM error seen about 4 hours after ODL bring up:

2016-12-14 12:34:35,306 | WARN | entLoopGroup-6-1 | NioServerSocketChannel | 136 - io.netty.common - 4.0.37.Final | Failed to create a new channel from an accepted socket.
java.lang.OutOfMemoryError: Java heap space

2016-12-14 12:34:27,756 | ERROR | ntLoopGroup-7-28 | ExceptionHandler | 285 - org.opendaylight.ovsdb.library - 1.3.2.Boron-SR2 | Exception occurred while processing connection pipeline
io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: Java heap space

karaf logs attached



 Comments   
Comment by Tim Rozet [ 14/Dec/16 ]

Attachment karaf.zip has been added with description: karaf logs

Comment by Anil Vishnoi [ 14/Dec/16 ]

Hi Tim,

do you have heapdump from this OOM? It will help us figure out where the leak is actually happening. Also do you have some detail about what operations were happening during these 4 hours of up time?

Comment by Tim Rozet [ 14/Dec/16 ]

Due to the size of the heap, I uploaded it to here:
http://artifacts.opnfv.org/apex/random/java_pid19570.hprof.zip

All that happened on this setup was an external network/subnet create + a basic healthcheck, which consists of creating a tenant network, and then bringing up a few instances and checking if they got dhcp ip (which they didnt). The deployment then sat idle until crash.

Comment by Anil Vishnoi [ 15/Dec/16 ]

Hi Tim,

Quick question, are you running it in clustered setup?

Comment by Tim Rozet [ 15/Dec/16 ]

There are multiple OpenStack neutron nodes, but ODL is only running as a single instance and not clustered.

Comment by Michael Vorburger [ 29/May/17 ]

Attachment java_pid19570_Leak_Suspects.zip has been added with description: MAT akka.dispatch.Dispatcher memory leak suspect HTML report

Comment by Michael Vorburger [ 29/May/17 ]

Just had a look at this one using MAT, see report in attachment.

This very most likely isn't actually OVSDB or io.netty.handler.codec.DecoderException related; that's just when it hit, so moving project and editing summary.

I'll now email mdsal-dev & controller-dev to get first reactions about this.

Comment by Michael Vorburger [ 29/May/17 ]

Attachment Bug7370_Threads.zip has been added with description: MAT list of 604 threads with stack traces HTML report

Comment by Michael Vorburger [ 29/May/17 ]

https://lists.opendaylight.org/pipermail/mdsal-dev/2017-May/001218.html

Comment by Michael Vorburger [ 19/Jun/17 ]

Moving project again, from mdsal to openflowplugin, because https://lists.opendaylight.org/pipermail/mdsal-dev/2017-May/001219.html clarified that this means that a listener (in openflowplugin) instead of async dealing with events locked an Akka Dispatcher thread (and lead to an OOM there).

I had had a closer look at the org.opendaylight.openflowplugin.impl.util.DeviceInitializationUtils.initializeNodeInformation on master as of 3 weeks ago, and found that the code there meanwhile considerably changed - the "blocking get on a Future" which Tom picked up on in the trace doesn't seem to be there anymore. https://lists.opendaylight.org/pipermail/mdsal-dev/2017-May/001220.html seems to say the same.

I'm thus closing this particular OOM as "WORKSFORME" (in master; no interest in chasing this in stable maintenance). Of course, this does not mean that there could not be other OOM issues, elsewhere... we need people to reproduce them and provide more HPROF dump we can analyze (just like first did Tim here; thanks again).

Comment by Michael Vorburger [ 21/Jun/17 ]

> closer look at the org.opendaylight.openflowplugin.impl.util.DeviceInitializationUtils.initializeNodeInformation on master as of 3 weeks ago, and found that the code there meanwhile considerably changed - the "blocking get on a Future" which Tom picked up on in the trace doesn't seem to be there anymore.

I've also had a (very quick) look at stable/boron branch, and it's the same there already; the code there meanwhile considerably changed - the "blocking get on a Future" which Tom picked up on in the trace doesn't seem to be there anymore.

Generated at Wed Feb 07 20:33:30 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.