[GENIUS-96] DataStoreJobCoordinator OOM Created: 03/Nov/17  Updated: 19/Apr/18  Resolved: 19/Apr/18

Status: Verified
Project: genius
Component/s: General
Affects Version/s: Nitrogen-SR1
Fix Version/s: Nitrogen-SR1

Type: Bug Priority: Highest
Reporter: Michael Vorburger Assignee: Michael Vorburger
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive java_pid22161_Leak_Suspects.zip    
Issue Links:
Blocks
is blocked by INFRAUTILS-24 JobCoordinator should use bounded Exe... Resolved
Relates
relates to INFRAUTILS-19 Coda Hale Dropwizard Metrics integration Resolved
Priority: Highest

 Description   

Internal downstream testing reports OOM with latest stable/nitrogen builds.

HPROF analysis by MAT points to something really badly wrong in DataStoreJobCoordinator.

see attached ZIP



 Comments   
Comment by Michael Vorburger [ 03/Nov/17 ]

tpantelis dixit:

maybe some job got stuck or took a long time and backed up a queue

This somehow sounds familiar - didn't you mention something like this somewhere recently, k.faseela ?

BTW this reminds me that during the code reviews of the genius DJC to infrautils JC we added this to the JavaDoc:

Enqueued jobs are stored in unbounded queues until they are run, this should be kept in mind as it might lead to an OOM.

and now that ^^^ is exactly what happened here... it would probably be better if we made the JobCoordinator reject jobs after a certain (configurable) capacity? And start dumping whatever is stuck in the queue at that point?

Comment by Faseela K [ 04/Nov/17 ]

Yeah..this is what even I was proposing that day.
In one of our testing, saw that two Jobs were in deadlock, and enough of such jobs were queued up to take up all available threads, and none of the new jobs had resources to get executed.
Taking a jstack output at that time helped us understand the culprit.

Comment by Michael Vorburger [ 04/Nov/17 ]

INFRAUTILS-24 will make the JobCoordinator reject jobs to avoid the such OOMs in the future. We need to fix the bug hit during testing in this JIRA issue; I'll get jstack.

Comment by Michael Vorburger [ 04/Nov/17 ]

k.faseela said on IRC that earlier patch sets of https://git.opendaylight.org/gerrit/#/c/63884/ caused this kind of problem, but that the final one (which also went into nitrogen) "should" have fixed the deadlock. So when we get the jstack, we should watch out whether it looks like it could have anything to do with that change, just to be sure. (She also mentioned that there will be a follow up patch which will fix something re. a "slowing down of the DJC", but that's not related to deadlock / OOM, AFAIK.)

Comment by Faseela K [ 05/Nov/17 ]

https://git.opendaylight.org/gerrit/65146

This will have some impact?

Comment by Faseela K [ 06/Nov/17 ]

Kency indicated that there were some lockmanager related issues which used to make some jobs stuck in DJC, which is fixed under below review :

https://git.opendaylight.org/gerrit/#/c/61977/

Could you please review and merge?

Comment by Kit Lou [ 06/Nov/17 ]

Is this issue related to https://jira.opendaylight.org/browse/NETVIRT-974 ?

Is this truly a blocker for Nitrogen-SR1? Thanks!

Comment by Michael Vorburger [ 06/Nov/17 ]

> Is this issue related to https://jira.opendaylight.org/browse/NETVIRT-974 ?

no, not at all. But it could turn out to have one and the same single cause as GENIUS-97 though - we don't know yet.

> Is this truly a blocker for Nitrogen-SR1? Thanks!

Yup.

Comment by Michael Vorburger [ 07/Nov/17 ]

Stack trace just attached to GENIUS-97 shows a lot of DataStoreJobCoordinator ...

Comment by Michael Vorburger [ 07/Nov/17 ]

This can only be reproduced with latest stable/Nitrogen HEAD (which will be SR1), NOT with the first September 26 Nitro, so recently broke.

Comment by Kit Lou [ 07/Nov/17 ]

Do we have ETA on resolution? Need input to assess how far we have to push out Nitrogen-SR1. Thanks!

Comment by Michael Vorburger [ 08/Nov/17 ]

klou ETA is when it's Fixed. We can try to do earlier then when it's Done, but we would need a time machine.

Comment by Michael Vorburger [ 08/Nov/17 ]

Closing as CANNOT REPRO, because ltomasbo has clarified that he only hits an OOM (reproducible) on Nitrogen 0.7.0 and 0.7.1 (=SP1) with Xmx 512 MB heap, instead of the default 2 GB, using which ie works for him. The 512 MB is the default when deploying ODL with devstack without specifying a different Xmx. We're proposing to fix that in https://review.openstack.org/#/c/518540/ to avoid future confusion.

Comment by Michael Vorburger [ 08/Nov/17 ]

PS: We'll be adding proper JobCoordinator monitor-ability via INFRAUTILS-19.

Generated at Wed Feb 07 19:59:54 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.