[GENIUS-97] IdPoolListener OOM Created: 03/Nov/17 Updated: 19/Apr/18 Resolved: 19/Apr/18 |
|
| Status: | Verified |
| Project: | genius |
| Component/s: | None |
| Affects Version/s: | Nitrogen-SR1 |
| Fix Version/s: | Nitrogen-SR1 |
| Type: | Bug | Priority: | Highest |
| Reporter: | Michael Vorburger | Assignee: | Kency Kurian |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Priority: | Highest | ||||||||
| Description |
|
Internal downstream testing reports OOM with latest stable/nitrogen builds. HPROF analysis by MAT points to something really badly wrong in IdPoolListener. see attached ZIP (ignore DJC, that's |
| Comments |
| Comment by Michael Vorburger [ 03/Nov/17 ] |
|
tpantelis points out that:
Does anyone know any automatic way to get the JVM to do that, just like the -XX:+HeapDumpOnOutOfMemoryError makes it create the HRPOF? Or we ask folks to keep doing jstack up to until it boums - is that the best we can offer? (Perhaps I could come up with some nifty .. thingie in infrautils that does this - start logging WARN level if the thread count is above a (configurable) threshold, that would be quite useful in this kind of situation, wouldn't it?) |
| Comment by Tom Pantelis [ 03/Nov/17 ] |
|
It's probably not the thread count - it's a particular thread that gets stuck. Maybe the app code gets deadlocked or blocking on something (there's a lot of that). Or it's the cumulative effect of many queued tasks that are slow, ie the classic producer faster than the consumer. All of the listeners and JC use unbounded queues so if something gets stuck or runs slowly, it is inevitable a queue will back up and hit OOM. What the testers are doing? Scale testing? longevity? is it reproducible? Are they tracking the memory usage? If so, as it climbs then start taking thread dumps. |
| Comment by Faseela K [ 04/Nov/17 ] |
|
Kency will you be able to analyze this? I don't think it is IdPoolListener, probably too many other things have slowed down the system, and this is just an after effect |
| Comment by Michael Vorburger [ 04/Nov/17 ] |
|
INFRAUTILS-23 proposes to switch to using bounded queues in listeners, and
We need to fix the bug hit during testing in this JIRA issue; I'll get us a jstack. |
| Comment by Periyasamy Palanisamy [ 04/Nov/17 ] |
|
The task (in update method) being executed under IdPoolListener is a short-lived one which i don't think causes a thread starvation. [1] is invoked only in a corner case where updating local id pool when reclamation happen from other local id pool due to id exhaustion in node's local and parent id pools. Does that happen in your tests ? It's better to look at jstack and jmap to see what is actually going on. |
| Comment by Kit Lou [ 06/Nov/17 ] |
|
Is this issue related to https://jira.opendaylight.org/browse/NETVIRT-974 ? Is this truly a blocker for Nitrogen-SR1? Thanks! |
| Comment by Michael Vorburger [ 06/Nov/17 ] |
|
> Is this issue related to https://jira.opendaylight.org/browse/NETVIRT-974 ? no, not at all. But it could turn out to have one and the same single cause as > Is this truly a blocker for Nitrogen-SR1? Thanks! Yup. |
| Comment by Michael Vorburger [ 07/Nov/17 ] |
|
Attached jstack_info.log.xz has 67x jstack taken once per minute leading up to the OOM. Anyone spotting anything useful here? |
| Comment by Michael Vorburger [ 07/Nov/17 ] |
|
This can only be reproduced with latest stable/Nitrogen HEAD (which will be SR1), NOT with the first September 26 Nitro, so recently broke. |
| Comment by Kit Lou [ 07/Nov/17 ] |
|
Do we have ETA on resolution? Need input to assess how far we have to push out Nitrogen-SR1. Thanks! |
| Comment by Michael Vorburger [ 08/Nov/17 ] |
|
klou ETA is when it's Fixed. We can try to do earlier then when it's Done, but we would need a time machine. |
| Comment by Michael Vorburger [ 08/Nov/17 ] |
|
klou & vivekshiv FYI I'm hoping to update and possibly close this one tomorrow, currentlywaiting for feedback from ltomasbo@redhat.com ... |
| Comment by Michael Vorburger [ 08/Nov/17 ] |
|
Closing as CANNOT REPRO, because ltomasbo has clarified that he only hits an OOM with the 512 MB instead of 2 GB due to devstack, see What was particularly confusing me here was that the attached Leak Suspects MAT report I produced from his HPROF does show 2 GB, BUT this was only because: (quote) "probably the previous one left the env (DB) unstable, with existing ports and flows... I then removed the env, recreated it with 0.7.1 (= SR1 to-be) with 2 GB on a completely fresh install. I'm re-testing it and is not seeing the OOM anymore." |