[GENIUS-96] DataStoreJobCoordinator OOM Created: 03/Nov/17 Updated: 19/Apr/18 Resolved: 19/Apr/18 |
|
| Status: | Verified |
| Project: | genius |
| Component/s: | General |
| Affects Version/s: | Nitrogen-SR1 |
| Fix Version/s: | Nitrogen-SR1 |
| Type: | Bug | Priority: | Highest |
| Reporter: | Michael Vorburger | Assignee: | Michael Vorburger |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Priority: | Highest | ||||||||||||||||
| Description |
|
Internal downstream testing reports OOM with latest stable/nitrogen builds. HPROF analysis by MAT points to something really badly wrong in DataStoreJobCoordinator. see attached ZIP |
| Comments |
| Comment by Michael Vorburger [ 03/Nov/17 ] |
|
tpantelis dixit:
This somehow sounds familiar - didn't you mention something like this somewhere recently, k.faseela ? BTW this reminds me that during the code reviews of the genius DJC to infrautils JC we added this to the JavaDoc:
and now that ^^^ is exactly what happened here... it would probably be better if we made the JobCoordinator reject jobs after a certain (configurable) capacity? And start dumping whatever is stuck in the queue at that point? |
| Comment by Faseela K [ 04/Nov/17 ] |
|
Yeah..this is what even I was proposing that day. |
| Comment by Michael Vorburger [ 04/Nov/17 ] |
|
|
| Comment by Michael Vorburger [ 04/Nov/17 ] |
|
k.faseela said on IRC that earlier patch sets of https://git.opendaylight.org/gerrit/#/c/63884/ caused this kind of problem, but that the final one (which also went into nitrogen) "should" have fixed the deadlock. So when we get the jstack, we should watch out whether it looks like it could have anything to do with that change, just to be sure. (She also mentioned that there will be a follow up patch which will fix something re. a "slowing down of the DJC", but that's not related to deadlock / OOM, AFAIK.) |
| Comment by Faseela K [ 05/Nov/17 ] |
|
https://git.opendaylight.org/gerrit/65146 This will have some impact? |
| Comment by Faseela K [ 06/Nov/17 ] |
|
Kency indicated that there were some lockmanager related issues which used to make some jobs stuck in DJC, which is fixed under below review : https://git.opendaylight.org/gerrit/#/c/61977/ Could you please review and merge? |
| Comment by Kit Lou [ 06/Nov/17 ] |
|
Is this issue related to https://jira.opendaylight.org/browse/NETVIRT-974 ? Is this truly a blocker for Nitrogen-SR1? Thanks! |
| Comment by Michael Vorburger [ 06/Nov/17 ] |
|
> Is this issue related to https://jira.opendaylight.org/browse/NETVIRT-974 ? no, not at all. But it could turn out to have one and the same single cause as > Is this truly a blocker for Nitrogen-SR1? Thanks! Yup. |
| Comment by Michael Vorburger [ 07/Nov/17 ] |
|
Stack trace just attached to |
| Comment by Michael Vorburger [ 07/Nov/17 ] |
|
This can only be reproduced with latest stable/Nitrogen HEAD (which will be SR1), NOT with the first September 26 Nitro, so recently broke. |
| Comment by Kit Lou [ 07/Nov/17 ] |
|
Do we have ETA on resolution? Need input to assess how far we have to push out Nitrogen-SR1. Thanks! |
| Comment by Michael Vorburger [ 08/Nov/17 ] |
|
klou ETA is when it's Fixed. We can try to do earlier then when it's Done, but we would need a time machine. |
| Comment by Michael Vorburger [ 08/Nov/17 ] |
|
Closing as CANNOT REPRO, because ltomasbo has clarified that he only hits an OOM (reproducible) on Nitrogen 0.7.0 and 0.7.1 (=SP1) with Xmx 512 MB heap, instead of the default 2 GB, using which ie works for him. The 512 MB is the default when deploying ODL with devstack without specifying a different Xmx. We're proposing to fix that in https://review.openstack.org/#/c/518540/ to avoid future confusion. |
| Comment by Michael Vorburger [ 08/Nov/17 ] |
|
PS: We'll be adding proper JobCoordinator monitor-ability via |