[GENIUS-97] IdPoolListener OOM Created: 03/Nov/17  Updated: 19/Apr/18  Resolved: 19/Apr/18

Status: Verified
Project: genius
Component/s: None
Affects Version/s: Nitrogen-SR1
Fix Version/s: Nitrogen-SR1

Type: Bug Priority: Highest
Reporter: Michael Vorburger Assignee: Kency Kurian
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive java_pid65530_Leak_Suspects.zip     File jstack_info.log.xz    
Issue Links:
Blocks
is blocked by INFRAUTILS-23 Remove Executors.newSingleThreadExecu... Confirmed
Priority: Highest

 Description   

Internal downstream testing reports OOM with latest stable/nitrogen builds.

HPROF analysis by MAT points to something really badly wrong in IdPoolListener.

see attached ZIP (ignore DJC, that's GENIUS-96)



 Comments   
Comment by Michael Vorburger [ 03/Nov/17 ]

tpantelis points out that:

Looks like an Executor queue is backed up - something's probably stuck - we need to get thread dumps.

Does anyone know any automatic way to get the JVM to do that, just like the -XX:+HeapDumpOnOutOfMemoryError makes it create the HRPOF? Or we ask folks to keep doing jstack up to until it boums - is that the best we can offer? (Perhaps I could come up with some nifty .. thingie in infrautils that does this - start logging WARN level if the thread count is above a (configurable) threshold, that would be quite useful in this kind of situation, wouldn't it?)

Comment by Tom Pantelis [ 03/Nov/17 ]

It's probably not the thread count - it's a particular thread that gets stuck. Maybe the app code gets deadlocked or blocking on something (there's a lot of that). Or it's the cumulative effect of many queued tasks that are slow, ie the classic producer faster than the consumer. All of the listeners and JC use unbounded queues so if something gets stuck or runs slowly, it is inevitable a queue will back up and hit OOM.

What the testers are doing? Scale testing? longevity? is it reproducible? Are they tracking the memory usage? If so, as it climbs then start taking thread dumps.

Comment by Faseela K [ 04/Nov/17 ]

Kency will you be able to analyze this? I don't think it is IdPoolListener, probably too many other things have slowed down the system, and this is just an after effect

Comment by Michael Vorburger [ 04/Nov/17 ]

INFRAUTILS-23 proposes to switch to using bounded queues in listeners, and INFRAUTILS-24 the same for JC.

INFRAUTILS-21 and maybe INFRAUTILS-22 will do something re. automatic way to dump deadlocked threads (there's no readily available JVM option AFAIK).

We need to fix the bug hit during testing in this JIRA issue; I'll get us a jstack.

Comment by Periyasamy Palanisamy [ 04/Nov/17 ]

The task (in update method) being executed under IdPoolListener is a short-lived one which i don't think causes a thread starvation. [1] is invoked only in a corner case where updating local id pool when reclamation happen from other local id pool due to id exhaustion in node's local and parent id pools. Does that happen in your tests ?

It's better to look at jstack and jmap to see what is actually going on.

[1] https://git.opendaylight.org/gerrit/gitweb?p=genius.git;a=blob;f=idmanager/idmanager-impl/src/main/java/org/opendaylight/genius/idmanager/IdPoolListener.java;h=adc77023c371853543745fe50ac4f649c7c5bab6;hb=refs/heads/master#l67

Comment by Kit Lou [ 06/Nov/17 ]

Is this issue related to https://jira.opendaylight.org/browse/NETVIRT-974 ?

Is this truly a blocker for Nitrogen-SR1? Thanks!

Comment by Michael Vorburger [ 06/Nov/17 ]

> Is this issue related to https://jira.opendaylight.org/browse/NETVIRT-974 ?

no, not at all. But it could turn out to have one and the same single cause as GENIUS-96 though - we don't know yet.

> Is this truly a blocker for Nitrogen-SR1? Thanks!

Yup.

Comment by Michael Vorburger [ 07/Nov/17 ]

Attached jstack_info.log.xz has 67x jstack taken once per minute leading up to the OOM. Anyone spotting anything useful here?

Comment by Michael Vorburger [ 07/Nov/17 ]

This can only be reproduced with latest stable/Nitrogen HEAD (which will be SR1), NOT with the first September 26 Nitro, so recently broke.

Comment by Kit Lou [ 07/Nov/17 ]

Do we have ETA on resolution? Need input to assess how far we have to push out Nitrogen-SR1. Thanks!

Comment by Michael Vorburger [ 08/Nov/17 ]

klou ETA is when it's Fixed. We can try to do earlier then when it's Done, but we would need a time machine.

Comment by Michael Vorburger [ 08/Nov/17 ]

klou & vivekshiv FYI I'm hoping to update and possibly close this one tomorrow, currentlywaiting for feedback from ltomasbo@redhat.com ...

Comment by Michael Vorburger [ 08/Nov/17 ]

Closing as CANNOT REPRO, because ltomasbo has clarified that he only hits an OOM with the 512 MB instead of 2 GB due to devstack, see GENIUS-96.

What was particularly confusing me here was that the attached Leak Suspects MAT report I produced from his HPROF does show 2 GB, BUT this was only because: (quote) "probably the previous one left the env (DB) unstable, with existing ports and flows... I then removed the env, recreated it with 0.7.1 (= SR1 to-be) with 2 GB on a completely fresh install. I'm re-testing it and is not seeing the OOM anymore."

Generated at Wed Feb 07 19:59:54 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.