[NETVIRT-974] karaf process killed by OS due to OOM condition Created: 02/Nov/17  Updated: 07/Nov/17

Status: Verified
Project: netvirt
Component/s: General
Affects Version/s: None
Fix Version/s: Nitrogen-SR1

Type: Bug Priority: Medium
Reporter: Jamo Luhrsen Assignee: Michael Vorburger
Resolution: Unresolved Votes: 0
Labels: csit:failures
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Attachments: Java Source File ThreadExplosion.java     Text File dmesg.log.txt     Text File journalctl.log.txt     Text File odl1_karaf.log.txt     Text File odl1_karaf_console.log.txt     HTML File yumupdate    
Sub-Tasks:
Key
Summary
Type
Status
Assignee
NETVIRT-975 Confirm ODL terminated by Kernel OOM ... Sub-task Verified Jamo Luhrsen  
NETVIRT-979 Try disabling Kernel OOM killer to se... Sub-task Resolved Jamo Luhrsen  
NETVIRT-978 Track native memory usage via jcmd Sub-task Resolved Jamo Luhrsen  
NETVIRT-976 Try out LevelDB Java impl instead of JNI Sub-task Verified Stephen Kitt  
NETVIRT-977 Track heap & native (!) memory usage ... Sub-task Resolved  

 Description   

since aprox 10/20/2017 we have seen our nitrogen and oxygen CSIT jobs fail
because the karaf process is using so much memory that the OS has to kill it:

the karaf console gets a message like this:

/tmp/karaf-0.8.0-SNAPSHOT/bin/karaf: line 422: 11539 Killed                  
${KARAF_EXEC} "${JAVA}" ${JAVA_OPTS} "$NON_BLOCKING_PRNG" -Djava.endorsed.dirs="${JAVA_ENDORSED_DIRS}" -Djava.ext.dirs="${JAVA_EXT_DIRS}" -Dkaraf.instances="${KARAF_HOME}/instances" -Dkaraf.home="${KARAF_HOME}" -Dkaraf.base="${KARAF_BASE}" -Dkaraf.data="${KARAF_DATA}" -Dkaraf.etc="${KARAF_ETC}" -Dkaraf.restart.jvm.supported=true -Djava.io.tmpdir="${KARAF_DATA}/tmp" -Djava.util.logging.config.file="${KARAF_BASE}/etc/java.util.logging.properties" ${KARAF_SYSTEM_OPTS} ${KARAF_OPTS} ${OPTS} "$@" -classpath "${CLASSPATH}" ${MAIN}

During a live debug session, we witnessed the java process consuming arpox 1.6G
of memory. (xmx is set to 2G and the host system has 4G). At some point, in a manner
of 5-10s the mem usage jumped up near 3.5G and the process was killed. It was very
sudden.



 Comments   
Comment by Michael Vorburger [ 02/Nov/17 ]

we should have a hs_err_pid*.log file and have to have an *.hprof file to know where and find a fix for an OOM...

have you been able to re-double-check if these files are't already produced somewhere? How about just doing a dumb:

sudo find / -name "hs_err_pid*.log"
sudo find / -name "*.hprof"

the hprof should be produced by I can see that in $ODL/bin/karaf we already have "-XX:+HeapDumpOnOutOfMemoryError" on DEFAULT_JAVA_OPTS...

to fix the folder where it would write the HPROF into, you could add: -XX:HeapDumpPath=/a/folder/you/can/recover

Comment by Michael Vorburger [ 02/Nov/17 ]

jluhrsen clarified that there is no hs_err_pid*.log or *.hprof - OS kills JVM before that's produced.

https://lists.opendaylight.org/pipermail/controller-dev/2017-November/014034.html lists some ideas how that possible.

Attached ThreadExplosion.java is a quick test I wrote to convince myself that a big Thread bug should not lead to this behaviour.

So then my best bet currently then is on some JNI crap or "off heap allocation" (ByteBuffers) - this is going to be fun to figure out.

Best next step I can think of is to figure out how to get an OS (not JVM) level coredump kind of file when it gets killed, and then get help analyzing that.

Comment by Jamo Luhrsen [ 06/Nov/17 ]

well, it's looking more and more like something to do with the host os and how it interacts with whatever is
different in nitro/oxy vs carbon.

I have some tests in the sandbox that show the crash is coming with the host image as is. Another job that
uses a host image from the past (these are updated periodically. think "yum update -y") that is not seeing
the crash. Also, another job using the troublesome image but first doing a "yum update -y" before running
the CSIT suites and the crash is not seen.

for completeness, the image label with the problem is:
java-builder - 20171020-0130 (Oct. 20, 2017)

the older image without the problem has a label of:
java-builder - 20170807-2145 (Aug 7, 2017)

Further, there was some very odd SFT failures we saw in the past and seemed like they could be explained
by crashing VMs or java processes just dying. I think the java-builder images used in those jobs was also
updated recently and the problem went away. If needed, we can dig in to that and see if the same image
label was used there. Maybe we just hit some corner case bug related to java and some host os software/lib
which is now fixed.

what should we do next?

Comment by Michael Vorburger [ 06/Nov/17 ]

> but first doing a "yum update -y" before running the CSIT suites and the crash is not seen.

can you detail which packages get updated from what (bad) version to which (good) versions? If only for future "ah that was...".

> what should we do next?

In an ideal world with unlimited resources, it would be interesting to understand what was wrong here. Back in the real world, if I read this right (please confirm if I did get this right..), then basically the summary of what you're saying is that with current "latest everything", we're good? If so, then my vote is fuhgeddaboudit ...

PS: If we ever hit this kind of Kernel OOM killer hitting ODL again, the sub tasks on this issue now have steps which may be useful if something like this ever has to be investigated again another time.

Comment by Kit Lou [ 06/Nov/17 ]

This is no longer a blocker for Nitrogen-SR1 - correct?

Should we downgrade severity and move Fix Version/s value to Nitrogen-SR2?

Comment by Jamo Luhrsen [ 06/Nov/17 ]

I've attached the full output of "yum list installed" before and after and including the "yum update -y"

below is the diff of the packages that are listed in "yum list installed" before and after the "yum update -y". Maybe there
is some difference in packages that will give a clue. maybe not. Maybe it's the java upgrade itself (from 144-0.b01 to 151-1.b12)
BTW, the older image (Aug 20) has java version 141-b16

12:53 $ diff /tmp/beforeyumupdate /tmp/afteryumupdate
4,9c4,9
< NetworkManager.x86_64 1:1.8.0-9.el7 @base
< NetworkManager-libnm.x86_64 1:1.8.0-9.el7 @base
< NetworkManager-ppp.x86_64 1:1.8.0-9.el7 @base
< NetworkManager-team.x86_64 1:1.8.0-9.el7 @base
< NetworkManager-tui.x86_64 1:1.8.0-9.el7 @base
< NetworkManager-wifi.x86_64 1:1.8.0-9.el7 @base

> NetworkManager.x86_64 1:1.8.0-11.el7_4 @updates
> NetworkManager-libnm.x86_64 1:1.8.0-11.el7_4 @updates
> NetworkManager-ppp.x86_64 1:1.8.0-11.el7_4 @updates
> NetworkManager-team.x86_64 1:1.8.0-11.el7_4 @updates
> NetworkManager-tui.x86_64 1:1.8.0-11.el7_4 @updates
> NetworkManager-wifi.x86_64 1:1.8.0-11.el7_4 @updates
40c40
< augeas-libs.x86_64 1.4.0-2.el7_4.1 @updates

> augeas-libs.x86_64 1.4.0-2.el7_4.2 @updates
214c214
< epel-release.noarch 7-10 @epel

> epel-release.noarch 7-11 @epel
304,311c304,310
< grub2.x86_64 1:2.02-0.64.el7.centos @base
< grub2-common.noarch 1:2.02-0.64.el7.centos @base
< grub2-pc.x86_64 1:2.02-0.64.el7.centos @base
< grub2-pc-modules.noarch 1:2.02-0.64.el7.centos @base
< grub2-tools.x86_64 1:2.02-0.64.el7.centos @base
< grub2-tools-efi.x86_64 1:2.02-0.64.el7.centos @base
< grub2-tools-extra.x86_64 1:2.02-0.64.el7.centos @base
< grub2-tools-minimal.x86_64 1:2.02-0.64.el7.centos @base

> grub2.x86_64 1:2.02-0.65.el7.centos.2 @updates
> grub2-common.noarch 1:2.02-0.65.el7.centos.2 @updates
> grub2-pc.x86_64 1:2.02-0.65.el7.centos.2 @updates
> grub2-pc-modules.noarch 1:2.02-0.65.el7.centos.2 @updates
> grub2-tools.x86_64 1:2.02-0.65.el7.centos.2 @updates
> grub2-tools-extra.x86_64 1:2.02-0.65.el7.centos.2 @updates
> grub2-tools-minimal.x86_64 1:2.02-0.65.el7.centos.2 @updates
341,342c340,341
< iptables.x86_64 1.4.21-18.0.1.el7.centos @base
< iptables-services.x86_64 1.4.21-18.0.1.el7.centos @base

> iptables.x86_64 1.4.21-18.2.el7_4 @updates
> iptables-services.x86_64 1.4.21-18.2.el7_4 @updates
371,373c370,372
< java-1.8.0-openjdk.x86_64 1:1.8.0.144-0.b01.el7_4 @updates
< java-1.8.0-openjdk-devel.x86_64 1:1.8.0.144-0.b01.el7_4 @updates
< java-1.8.0-openjdk-headless.x86_64 1:1.8.0.144-0.b01.el7_4 @updates

> java-1.8.0-openjdk.x86_64 1:1.8.0.151-1.b12.el7_4 @updates
> java-1.8.0-openjdk-devel.x86_64 1:1.8.0.151-1.b12.el7_4 @updates
> java-1.8.0-openjdk-headless.x86_64 1:1.8.0.151-1.b12.el7_4 @updates
385,389c384,389
< kernel-debug-devel.x86_64 3.10.0-693.2.2.el7 @updates
< kernel-headers.x86_64 3.10.0-693.2.2.el7 @updates
< kernel-tools.x86_64 3.10.0-693.2.2.el7 @updates
< kernel-tools-libs.x86_64 3.10.0-693.2.2.el7 @updates
< kexec-tools.x86_64 2.0.14-17.el7 @base

> kernel.x86_64 3.10.0-693.5.2.el7 @updates
> kernel-debug-devel.x86_64 3.10.0-693.5.2.el7 @updates
> kernel-headers.x86_64 3.10.0-693.5.2.el7 @updates
> kernel-tools.x86_64 3.10.0-693.5.2.el7 @updates
> kernel-tools-libs.x86_64 3.10.0-693.5.2.el7 @updates
> kexec-tools.x86_64 2.0.14-17.2.el7 @updates
392,394c392,394
< kmod.x86_64 20-15.el7_4.2 @updates
< kmod-libs.x86_64 20-15.el7_4.2 @updates
< koji.noarch 1.13.0-3.el7 @epel

> kmod.x86_64 20-15.el7_4.4 @updates
> kmod-libs.x86_64 20-15.el7_4.4 @updates
> koji.noarch 1.14.0-1.el7 @epel
479c479
< libgudev1.x86_64 219-42.el7_4.1 @updates

> libgudev1.x86_64 219-42.el7_4.4 @updates
542,543c542,543
< libsss_idmap.x86_64 1.15.2-50.el7_4.2 @updates
< libsss_nss_idmap.x86_64 1.15.2-50.el7_4.2 @updates

> libsss_idmap.x86_64 1.15.2-50.el7_4.6 @updates
> libsss_nss_idmap.x86_64 1.15.2-50.el7_4.6 @updates
645c645
< nss.x86_64 3.28.4-12.el7_4 @updates

> nss.x86_64 3.28.4-15.el7_4 @updates
650,651c650,651
< nss-sysinit.x86_64 3.28.4-12.el7_4 @updates
< nss-tools.x86_64 3.28.4-12.el7_4 @updates

> nss-sysinit.x86_64 3.28.4-15.el7_4 @updates
> nss-tools.x86_64 3.28.4-15.el7_4 @updates
661,663c661,663
< openssh.x86_64 7.4p1-12.el7_4 @updates
< openssh-clients.x86_64 7.4p1-12.el7_4 @updates
< openssh-server.x86_64 7.4p1-12.el7_4 @updates

> openssh.x86_64 7.4p1-13.el7_4 @updates
> openssh-clients.x86_64 7.4p1-13.el7_4 @updates
> openssh-server.x86_64 7.4p1-13.el7_4 @updates
675c675
< parallel.noarch 20150522-1.el7.cern installed

> parallel.noarch 20160222-1.el7 @epel
800c800
< python-perf.x86_64 3.10.0-693.2.2.el7 @updates

> python-perf.x86_64 3.10.0-693.5.2.el7 @updates
829c829
< python2-koji.noarch 1.13.0-3.el7 @epel

> python2-koji.noarch 1.14.0-1.el7 @epel
884,885c884,885
< selinux-policy.noarch 3.13.1-166.el7_4.4 @updates
< selinux-policy-targeted.noarch 3.13.1-166.el7_4.4 @updates

> selinux-policy.noarch 3.13.1-166.el7_4.5 @updates
> selinux-policy-targeted.noarch 3.13.1-166.el7_4.5 @updates
900c900
< sssd-client.x86_64 1.15.2-50.el7_4.2 @updates

> sssd-client.x86_64 1.15.2-50.el7_4.6 @updates
907,911c907,911
< systemd.x86_64 219-42.el7_4.1 @updates
< systemd-libs.i686 219-42.el7_4.1 @updates
< systemd-libs.x86_64 219-42.el7_4.1 @updates
< systemd-python.x86_64 219-42.el7_4.1 @updates
< systemd-sysv.x86_64 219-42.el7_4.1 @updates

> systemd.x86_64 219-42.el7_4.4 @updates
> systemd-libs.i686 219-42.el7_4.4 @updates
> systemd-libs.x86_64 219-42.el7_4.4 @updates
> systemd-python.x86_64 219-42.el7_4.4 @updates
> systemd-sysv.x86_64 219-42.el7_4.4 @updates
930,931c930,931
< tzdata.noarch 2017b-1.el7 @base
< tzdata-java.noarch 2017b-1.el7 @base

> tzdata.noarch 2017c-1.el7 @updates
> tzdata-java.noarch 2017c-1.el7 @updates
947c947
< wget.x86_64 1.14-15.el7 @base

> wget.x86_64 1.14-15.el7_4.1 @updates
980,983d979

Comment by Jamo Luhrsen [ 06/Nov/17 ]

Kit, I moved this to Medium and Major instead of Critical and Blocker

Comment by Anil Belur [ 06/Nov/17 ]

jluhrsen Could you check with the below recently updated java-builder image, if the issue is still reproducible.

The SFT test which has OOM with the earlier builds is no longer seen with this image.

"88569cec-33d3-4fae-a749-da64fd9c9c5c"
"CentOS 7 - java-builder - 20171030-1039"

Comment by Jamo Luhrsen [ 06/Nov/17 ]

hopefully this fixes us:
https://git.opendaylight.org/gerrit/#/c/65231

Comment by Stephen Kitt [ 07/Nov/17 ]

Funny that, I’d mentioned to Michael that the 151 Java 8 update listed a few OOM fixes in its changelog; perhaps that’s what’s helping us here...

(I do suspect however that we’ve just moved the goalposts a little.)

Generated at Wed Feb 07 20:22:55 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.