[OPNFLWPLUG-382] [Lithium redesign] OOM errors, CPU 100% with 128 connected switches Created: 18/Mar/15  Updated: 27/Sep/21  Resolved: 08/May/15

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: SANDEEP GANGADHARAN Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: Other


Attachments: PNG File dump.png     PNG File dump2.png     Zip Archive log.zip     JPEG File openflow_multipartRequest_memoryLeak.jpg    
Issue Links:
Blocks
blocks OPNFLWPLUG-357 Milestone: OpenFlow Plugin Lithium Re... Resolved
External issue ID: 2869

 Description   

1) Run karaf
2) Enable the below features
feature:install odl-openflowplugin-flow-services
feature:install odl-dlux-all
feature:install odl-l2switch-all
feature:install odl-l2switch-switch-ui
feature:install odl-l2switch-switch-rest

3) Using mininet connect 300 switches in linear topology
4) Controller crashed.
4) Switch connections move to CLOSE_WAIT state.
5) karaf console showed below exception
java.io.IOException: Exception in opening zip file: /root/odl/distribution-karaf-0.3.0-SNAPSHOT/data/cache/org.eclipse.osgi/bundles/26/1/bundlefile
at org.eclipse.osgi.framework.util.SecureAction.getZipFile(SecureAction.java:291)
at org.eclipse.osgi.baseadaptor.bundlefile.ZipBundleFile.basicOpen(ZipBundleFile.java:87)
at org.eclipse.osgi.baseadaptor.bundlefile.ZipBundleFile.getZipFile(ZipBundleFile.java:100)
at org.eclipse.osgi.baseadaptor.bundlefile.ZipBundleFile.checkedOpen(ZipBundleFile.java:73)
at org.eclipse.osgi.baseadaptor.bundlefile.ZipBundleFile.getEntry(ZipBundleFile.java:245)
at org.eclipse.osgi.baseadaptor.loader.ClasspathManager.findClassImpl(ClasspathManager.java:542)
at org.eclipse.osgi.baseadaptor.loader.ClasspathManager.findLocalClassImpl(ClasspathManager.java:492)
at org.eclipse.osgi.baseadaptor.loader.ClasspathManager.findLocalClass(ClasspathManager.java:465)
at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.findLocalClass(DefaultClassLoader.java:216)
at org.eclipse.osgi.internal.loader.BundleLoader.findLocalClass(BundleLoader.java:395)
at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:464)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:421)
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:412)
at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.java:107)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at jline.internal.Log.warn(Log.java:114)
at jline.internal.TerminalLineSettings.getProperty(TerminalLineSettings.java:97)
at jline.UnixTerminal.getWidth(UnixTerminal.java:75)
at jline.console.ConsoleReader.drawBuffer(ConsoleReader.java:834)
at jline.console.ConsoleReader.drawBuffer(ConsoleReader.java:853)
at jline.console.ConsoleReader.putString(ConsoleReader.java:793)
at jline.console.ConsoleReader.readLine(ConsoleReader.java:2533)
at jline.console.ConsoleReader.readLine(ConsoleReader.java:2162)
at org.apache.karaf.shell.console.impl.jline.ConsoleImpl.readAndParseCommand(ConsoleImpl.java:280)
at org.apache.karaf.shell.console.impl.jline.ConsoleImpl.run(ConsoleImpl.java:207)
at java.lang.Thread.run(Thread.java:744)
at org.apache.karaf.shell.console.impl.jline.ConsoleFactoryService$3.doRun(ConsoleFactoryService.java:126)
at org.apache.karaf.shell.console.impl.jline.ConsoleFactoryService$3$1.run(ConsoleFactoryService.java:117)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.karaf.jaas.modules.JaasHelper.doAs(JaasHelper.java:47)
at org.apache.karaf.shell.console.impl.jline.ConsoleFactoryService$3.run(ConsoleFactoryService.java:115)
Caused by: java.io.FileNotFoundException: /root/odl/distribution-karaf-0.3.0-SNAPSHOT/data/cache/org.eclipse.osgi/bundles/26/1/bundlefile (Too many open files)
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.<init>(ZipFile.java:215)
at java.util.zip.ZipFile.<init>(ZipFile.java:145)
at java.util.zip.ZipFile.<init>(ZipFile.java:159)
at org.eclipse.osgi.framework.util.SecureAction.getZipFile(SecureAction.java:274)
... 30 more

===============================================================
Note:
With Helium SR2 version the same test case passed
Just connecting 300 switches without any links works fine.

===============================================================



 Comments   
Comment by SANDEEP GANGADHARAN [ 18/Mar/15 ]

Attachment log.zip has been added with description: Attaching the logs

Comment by Jamo Luhrsen [ 17/Apr/15 ]

Repeated this type of test with the lithium redesign (i.e., odl-openflowplugin-app-new-lldp-speaker feature).

with a linear mininet topology using 128 switches, the controller will eventually (takes aprox 30m) give
an OOM and all four CPUs are running at 100% making the system unusable. Eventually, the JVM is killed.

not sure how relevant these messages are, but just a snippet from the karaf console:

Exception in thread "Thread-3042" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "CommitFutures-9" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Thread-3127" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Thread-3152" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Thread-3153" java.lang.OutOfMemoryError: GC overhead limit exceeded
Uncaught error from thread [odl-cluster-rpc-scheduler-1] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[odl-cluster-rpcException in thread "fileinstall-/home/sdn/distribution-karaf-0.3.0-SNAPSHOT/deploy" java.lang.OutOfMemoryError: GC overhead limit exceeded

Comment by Robert Varga [ 20/Apr/15 ]

Can you take a memory dump (-XX:+HeapDumpOnOutOfMemoryError for example), as this looks like an unbounded queue or a memory leak somewhere.

Comment by Evan Zeller [ 20/Apr/15 ]

My JVM was crippled long before I saw OutOfMemory so I did a dev:dump-create as fast as I could. Here's what I see in visualvm.

Comment by Evan Zeller [ 20/Apr/15 ]

Attachment dump.png has been added with description: heapdump

Comment by Evan Zeller [ 20/Apr/15 ]

Here's another dump, this time I made sure to run mininet with --switch ovsk,protocols=OpenFlow13

Comment by Evan Zeller [ 20/Apr/15 ]

Attachment dump2.png has been added with description: heapdump_of13

Comment by Jamo Luhrsen [ 21/Apr/15 ]

Attachment openflow_multipartRequest_memoryLeak.jpg has been added with description: jvisual screenshot

Comment by Jamo Luhrsen [ 21/Apr/15 ]

As Evan has already pointed out, and I have confirmed in my setup, it looks like the memory leak is around this:

org.opendaylight.yang.gen.v1.urn.opendaylight.openflow.protocol.rev130731.multipart.reply.multipart.reply.body.multipart.reply.table._case.multipart.reply.table.TableStatsBuilder$TableStatsImpl

The overall symptom is that memory is slowly consumed in my 100 switch topology until it hits the limit (3G in my case). multipart request/reply
is a period event so if it's the activity leaking memory, then it would
all jive.

Comment by Abhijit Kumbhare [ 21/Apr/15 ]

Added Hema if she has any knowledge of this table features code mentioned by Jamo in the previous comment as the source of the memory leak?

Comment by Jamo Luhrsen [ 07/May/15 ]

does not appear to be resolved, at least as far as CI is concerned:
https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-periodic-1node-cds-scalability-daily-only-master/9/robot/report/log.html#s1-s1-t1

this was using this specific distro:
distribution-karaf-0.3.0-20150507.123410-1486.zip

Comment by Jamo Luhrsen [ 08/May/15 ]

I left this open because of a CI scale test that was turning up an OOM Exception, but now it's been determined that the root cause in that scale test is from OPNFLWPLUG-322 and is only happening because the OVS version used
in CI is version 2.0.2. The same test with newer OVS versions does not
have trouble.

scale test:
https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-periodic-1node-cds-scalability-daily-only-master/

Generated at Wed Feb 07 20:32:19 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.