[NETCONF-655] Unable to mount devices, OOM crash Created: 05/Feb/20  Updated: 07/Feb/20  Resolved: 07/Feb/20

Status: Resolved
Project: netconf
Component/s: None
Affects Version/s: Sodium SR1, Sodium SR2
Fix Version/s: None

Type: Bug Priority: High
Reporter: Jamo Luhrsen Assignee: Jamo Luhrsen
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Using a device with a large yang schema and the netconf testtool simulator, mounting
devices is failing. I've observed OOM crashes as well. This is not happening in the original/first
Sodium release but is happening in both SR1 and most recent sodium build which is possibly
the SR2 release.

The basic test is to create 50 devices of the same exact type and then poll on the
operational topology (every 2s) until all are marked as "connected". The device being simulated
takes up aprox 375MB of heap after yang compilation and the 50 devices are all compiled and
mounted in aprox 3m with the original Sodium release. (NOTE: I am not sure I can pubilcally share
the yang models for this device at this time) With SR1 and ~SR2, none of the devices are
mounted.

The karaf log shows a lot of messages like:

2020-02-05T19:02:14,675 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-185 | Cluster(akka://opendaylight-cluster-data) | 39 - com.typesafe.akka.slf4j - 2.5.26 | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [18924] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.
2020-02-05T19:02:51,833 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-185 | Cluster(akka://opendaylight-cluster-data) | 39 - com.typesafe.akka.slf4j - 2.5.26 | Cluster Node [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [17136] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, e.g. by running blocking tasks on the default dispatcher, CPU overload, or GC.

and eventually some OOM crash messages like this:

2020-02-05T19:03:58,779 | ERROR | Framework Event Dispatcher: org.eclipse.osgi.internal.framework.EquinoxEventPublisher@4450d156 | actor                            | 33 - com.typesafe.akka.actor - 2.5.26 | FrameworkEvent ERROR - com.typesafe.akka.actor
java.lang.OutOfMemoryError: GC overhead limit exceeded
2020-02-05T19:03:58,848 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | ActorSystemImpl                  | 39 - com.typesafe.akka.slf4j - 2.5.26 | Uncaught error from thread [opendaylight-cluster-data-shard-dispatcher-168]: GC overhead limit exceeded, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-data]
java.lang.OutOfMemoryError: GC overhead limit exceeded
2020-02-05T19:03:58,859 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | ActorSystemImpl                  | 39 - com.typesafe.akka.slf4j - 2.5.26 | Uncaught error from thread [opendaylight-cluster-data-scheduler-1]: Unexpected error from weaving hook., shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-data]
java.lang.ClassFormatError: Unexpected error from weaving hook.
        at org.eclipse.osgi.internal.weaving.WeavingHookConfigurator.processClass(WeavingHookConfigurator.java:73) ~[?:?]
        at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.processClass(ClasspathManager.java:674) ~[?:?]
        at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.defineClass(ClasspathManager.java:645) ~[?:?]
        at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.findClassImpl(ClasspathManager.java:578) ~[?:?]
        at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.findLocalClassImpl(ClasspathManager.java:538) ~[?:?]
        at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.findLocalClass(ClasspathManager.java:525) ~[?:?]
        at org.eclipse.osgi.internal.loader.ModuleClassLoader.findLocalClass(ModuleClassLoader.java:328) ~[?:?]
        at org.eclipse.osgi.internal.loader.BundleLoader.findLocalClass(BundleLoader.java:368) ~[?:?]
        at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:446) ~[?:?]
        at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:395) ~[?:?]
        at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:387) ~[?:?]
        at org.eclipse.osgi.internal.loader.ModuleClassLoader.loadClass(ModuleClassLoader.java:150) ~[?:?]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:352) ~[?:?]
        at akka.event.BusLogging.notifyError(Logging.scala:1932) ~[33:com.typesafe.akka.actor:2.5.26]
        at akka.event.LoggingAdapter.error(Logging.scala:1207) ~[33:com.typesafe.akka.actor:2.5.26]
        at akka.event.LoggingAdapter.error$(Logging.scala:1207) ~[33:com.typesafe.akka.actor:2.5.26]
        at akka.event.BusLogging.error(Logging.scala:1914) ~[33:com.typesafe.akka.actor:2.5.26]
        at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:244) ~[33:com.typesafe.akka.actor:2.5.26]
        at java.lang.Thread.run(Thread.java:748) ~[?:?]
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded


 Comments   
Comment by Jamo Luhrsen [ 05/Feb/20 ]

If I increase the max mem heap value to 7G, I can get around this issue. Note that the original sodium release default of 2G is
working fine. I tried 2-6G with the SR2 candidate build and it fails. At 7G I can get my devices to mount and see the CPU finally
calm down.

Comment by Robert Varga [ 07/Feb/20 ]

Hmm, that's weird, I do not see anything in yangtools between GA and SR1 which would cause this

Comment by Jamo Luhrsen [ 07/Feb/20 ]

Please see this 3G heap dump that was created using a recent sodium distribution (so after SR1) and trying to mount 50 devices (trying to make the crash happen sooner). This time the OOM crash took aprox 1 hour to occur.

Comment by Jamo Luhrsen [ 07/Feb/20 ]

Sorry for the false alarm here. I was mistakenly using a modified sodium distribution when I
though it was a new/default version. It was modified to use a full 24G max mem heap size,
so I never saw this issue with that distribution. So turns out the same symptoms are there
for all releases and expected.

Generated at Wed Feb 07 20:15:34 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.