[OPNFLWPLUG-833] OOM due to suspected memory leak in akka.dispatch.Dispatcher found by MAT in hprof Created: 14/Dec/16 Updated: 27/Sep/21 Resolved: 21/Jun/17 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Tim Rozet | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 7370 |
| Priority: | High |
| Description |
|
OOM error seen about 4 hours after ODL bring up: 2016-12-14 12:34:35,306 | WARN | entLoopGroup-6-1 | NioServerSocketChannel | 136 - io.netty.common - 4.0.37.Final | Failed to create a new channel from an accepted socket. 2016-12-14 12:34:27,756 | ERROR | ntLoopGroup-7-28 | ExceptionHandler | 285 - org.opendaylight.ovsdb.library - 1.3.2.Boron-SR2 | Exception occurred while processing connection pipeline karaf logs attached |
| Comments |
| Comment by Tim Rozet [ 14/Dec/16 ] |
|
Attachment karaf.zip has been added with description: karaf logs |
| Comment by Anil Vishnoi [ 14/Dec/16 ] |
|
Hi Tim, do you have heapdump from this OOM? It will help us figure out where the leak is actually happening. Also do you have some detail about what operations were happening during these 4 hours of up time? |
| Comment by Tim Rozet [ 14/Dec/16 ] |
|
Due to the size of the heap, I uploaded it to here: All that happened on this setup was an external network/subnet create + a basic healthcheck, which consists of creating a tenant network, and then bringing up a few instances and checking if they got dhcp ip (which they didnt). The deployment then sat idle until crash. |
| Comment by Anil Vishnoi [ 15/Dec/16 ] |
|
Hi Tim, Quick question, are you running it in clustered setup? |
| Comment by Tim Rozet [ 15/Dec/16 ] |
|
There are multiple OpenStack neutron nodes, but ODL is only running as a single instance and not clustered. |
| Comment by Michael Vorburger [ 29/May/17 ] |
|
Attachment java_pid19570_Leak_Suspects.zip has been added with description: MAT akka.dispatch.Dispatcher memory leak suspect HTML report |
| Comment by Michael Vorburger [ 29/May/17 ] |
|
Just had a look at this one using MAT, see report in attachment. This very most likely isn't actually OVSDB or io.netty.handler.codec.DecoderException related; that's just when it hit, so moving project and editing summary. I'll now email mdsal-dev & controller-dev to get first reactions about this. |
| Comment by Michael Vorburger [ 29/May/17 ] |
|
Attachment Bug7370_Threads.zip has been added with description: MAT list of 604 threads with stack traces HTML report |
| Comment by Michael Vorburger [ 29/May/17 ] |
|
https://lists.opendaylight.org/pipermail/mdsal-dev/2017-May/001218.html |
| Comment by Michael Vorburger [ 19/Jun/17 ] |
|
Moving project again, from mdsal to openflowplugin, because https://lists.opendaylight.org/pipermail/mdsal-dev/2017-May/001219.html clarified that this means that a listener (in openflowplugin) instead of async dealing with events locked an Akka Dispatcher thread (and lead to an OOM there). I had had a closer look at the org.opendaylight.openflowplugin.impl.util.DeviceInitializationUtils.initializeNodeInformation on master as of 3 weeks ago, and found that the code there meanwhile considerably changed - the "blocking get on a Future" which Tom picked up on in the trace doesn't seem to be there anymore. https://lists.opendaylight.org/pipermail/mdsal-dev/2017-May/001220.html seems to say the same. I'm thus closing this particular OOM as "WORKSFORME" (in master; no interest in chasing this in stable maintenance). Of course, this does not mean that there could not be other OOM issues, elsewhere... we need people to reproduce them and provide more HPROF dump we can analyze (just like first did Tim here; thanks again). |
| Comment by Michael Vorburger [ 21/Jun/17 ] |
|
> closer look at the org.opendaylight.openflowplugin.impl.util.DeviceInitializationUtils.initializeNodeInformation on master as of 3 weeks ago, and found that the code there meanwhile considerably changed - the "blocking get on a Future" which Tom picked up on in the trace doesn't seem to be there anymore. I've also had a (very quick) look at stable/boron branch, and it's the same there already; the code there meanwhile considerably changed - the "blocking get on a Future" which Tom picked up on in the trace doesn't seem to be there anymore. |