[NETVIRT-974] karaf process killed by OS due to OOM condition Created: 02/Nov/17 Updated: 07/Nov/17 |
|
| Status: | Verified |
| Project: | netvirt |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | Nitrogen-SR1 |
| Type: | Bug | Priority: | Medium |
| Reporter: | Jamo Luhrsen | Assignee: | Michael Vorburger |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | csit:failures | ||
| Σ Remaining Estimate: | Not Specified | Remaining Estimate: | Not Specified |
| Σ Time Spent: | Not Specified | Time Spent: | Not Specified |
| Σ Original Estimate: | Not Specified | Original Estimate: | Not Specified |
| Attachments: |
|
||||||||||||||||||||||||||||||
| Sub-Tasks: |
|
| Description |
|
since aprox 10/20/2017 we have seen our nitrogen and oxygen CSIT jobs fail the karaf console gets a message like this: /tmp/karaf-0.8.0-SNAPSHOT/bin/karaf: line 422: 11539 Killed
${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS} "$NON_BLOCKING_PRNG" -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}" -Djava.ext.dirs="${JAVA_EXT_DIRS}" -Dkaraf.instances="${KARAF_HOME}/instances" -Dkaraf.home="${KARAF_HOME}" -Dkaraf.base="${KARAF_BASE}" -Dkaraf.data="${KARAF_DATA}" -Dkaraf.etc="${KARAF_ETC}" -Dkaraf.restart.jvm.supported=true -Djava.io.tmpdir="${KARAF_DATA}/tmp" -Djava.util.logging.config.file="${KARAF_BASE}/etc/java.util.logging.properties" ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS} "$@" -classpath "${CLASSPATH}" ${MAIN}
During a live debug session, we witnessed the java process consuming arpox 1.6G |
| Comments |
| Comment by Michael Vorburger [ 02/Nov/17 ] |
|
we should have a hs_err_pid*.log file and have to have an *.hprof file to know where and find a fix for an OOM... have you been able to re-double-check if these files are't already produced somewhere? How about just doing a dumb: sudo find / -name "hs_err_pid*.log" the hprof should be produced by I can see that in $ODL/bin/karaf we already have "-XX:+HeapDumpOnOutOfMemoryError" on DEFAULT_JAVA_OPTS... to fix the folder where it would write the HPROF into, you could add: -XX:HeapDumpPath=/a/folder/you/can/recover |
| Comment by Michael Vorburger [ 02/Nov/17 ] |
|
jluhrsen clarified that there is no hs_err_pid*.log or *.hprof - OS kills JVM before that's produced. https://lists.opendaylight.org/pipermail/controller-dev/2017-November/014034.html lists some ideas how that possible. Attached ThreadExplosion.java So then my best bet currently then is on some JNI crap or "off heap allocation" (ByteBuffers) - this is going to be fun to figure out. Best next step I can think of is to figure out how to get an OS (not JVM) level coredump kind of file when it gets killed, and then get help analyzing that. |
| Comment by Jamo Luhrsen [ 06/Nov/17 ] |
|
well, it's looking more and more like something to do with the host os and how it interacts with whatever is I have some tests in the sandbox that show the crash is coming with the host image as is. Another job that for completeness, the image label with the problem is: the older image without the problem has a label of: Further, there was some very odd SFT failures we saw in the past and seemed like they could be explained what should we do next? |
| Comment by Michael Vorburger [ 06/Nov/17 ] |
|
> but first doing a "yum update -y" before running the CSIT suites and the crash is not seen. can you detail which packages get updated from what (bad) version to which (good) versions? If only for future "ah that was...". > what should we do next? In an ideal world with unlimited resources, it would be interesting to understand what was wrong here. Back in the real world, if I read this right (please confirm if I did get this right..), then basically the summary of what you're saying is that with current "latest everything", we're good? If so, then my vote is fuhgeddaboudit ... PS: If we ever hit this kind of Kernel OOM killer hitting ODL again, the sub tasks on this issue now have steps which may be useful if something like this ever has to be investigated again another time. |
| Comment by Kit Lou [ 06/Nov/17 ] |
|
This is no longer a blocker for Nitrogen-SR1 - correct? Should we downgrade severity and move Fix Version/s value to Nitrogen-SR2? |
| Comment by Jamo Luhrsen [ 06/Nov/17 ] |
|
I've attached the full output of "yum list installed" before and after and including the "yum update -y" below is the diff of the packages that are listed in "yum list installed" before and after the "yum update -y". Maybe there 12:53 $ diff /tmp/beforeyumupdate /tmp/afteryumupdate |
| Comment by Jamo Luhrsen [ 06/Nov/17 ] |
|
Kit, I moved this to Medium and Major instead of Critical and Blocker |
| Comment by Anil Belur [ 06/Nov/17 ] |
|
jluhrsen Could you check with the below recently updated java-builder image, if the issue is still reproducible. The SFT test which has OOM with the earlier builds is no longer seen with this image. "88569cec-33d3-4fae-a749-da64fd9c9c5c" |
| Comment by Jamo Luhrsen [ 06/Nov/17 ] |
|
hopefully this fixes us: |
| Comment by Stephen Kitt [ 07/Nov/17 ] |
|
Funny that, I’d mentioned to Michael that the 151 Java 8 update listed a few OOM fixes in its changelog; perhaps that’s what’s helping us here... (I do suspect however that we’ve just moved the goalposts a little.) |