[L2SWITCH-38] basic openflow network breaks after short cbench test with l2switch feature installed Created: 25/Mar/15  Updated: 30/Oct/17  Resolved: 03/May/16

Status: Verified
Project: l2switch
Component/s: General
Affects Version/s: unspecified
Fix Version/s: None

Type: Bug
Reporter: Jamo Luhrsen Assignee: Ajay L
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File 1395_Lithium_data.ods     File mnBroken.pcap     File mnWorks.pcap    
External issue ID: 2897

 Description   

TL;DR using a larger number of switches in a cbench test breaks the openflow-ness such
that a simple mininet network will not work.

Test Details:

Setup: install these features:

  • odl-l2switch-all
  • odl-l2switch-switch-rest
  • odl-l2switch-switch-ui
  • odl-dlux-all

Steps:
1) start small mininet topo (tree,1,3) and verify successful pingall, then quit
2) send brief cbench test (-m 5000 -M 10000 -s 64 -l 2 -t)
3) repeat step 1, but pingall fails and switches are not learned

Notes:

in step 1, the switch is learned and it is removed when mininet is done.

after step 2, there are leftover switches in the operational store.

once this state is seen, the logout (exiting karaf console) process takes a very long time
(or never exits), and I just kill the java process so I don’t have to wait.
see at the end of the logs here: http://pastebin.com/Kj4hdves

The cbench test simulates X openflow 1.0 switches all connecting. When X == 32 this
problem doesn’t always happen, but maybe on the second iteration of steps 1-3. With
64 it looks like it happens every time.

I took a packet capture of the mininet network connecting in the problem state and the
handshake looks ok. After the final ACK is given by the controller to the MULTIPART
REPLY the only traffic is originating from the switch (e.g. port status msg, or packet-in’s
triggered by data plane traffic — pingall). In a working environment we’ll see LLDP
packet out learning, default flows being pushed, etc. None of that is happening
here.



 Comments   
Comment by Jamo Luhrsen [ 25/Mar/15 ]

Attachment mnWorks.pcap has been added with description: what mininet-to-controller traffic looks like in a working setup

Comment by Jamo Luhrsen [ 25/Mar/15 ]

Attachment mnBroken.pcap has been added with description: what mininet-to-controller traffic looks like in a broken setup

Comment by Jamo Luhrsen [ 14/Apr/15 ]

with Lithium build from 4/14/2015, this issue is there with throughput
mode (-t), but not there in latency mode (default mode)

additionally, running this test twice resulted in an OOM:

cbench -c 172.17.7.23 -m 30000 -M 10000 -s 64 -l 10 -t

The OOM trace:

Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.nio.cs.UTF_8.newEncoder(UTF_8.java:72)
at java.lang.StringCoding$StringEncoder.<init>(StringCoding.java:282)
at java.lang.StringCoding$StringEncoder.<init>(StringCoding.java:273)
at java.lang.StringCoding.encode(StringCoding.java:338)
at java.lang.String.getBytes(String.java:916)
at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242)
at java.io.File.isDirectory(File.java:843)
at org.apache.karaf.main.Main.doMonitor(Main.java:283)
at org.apache.karaf.main.Main.access$100(Main.java:69)
at org.apache.karaf.main.Main$1.run(Main.java:271)
Exception in thread "Thread-104" java.util.concurrent.RejectedExecutionException: Task org.opendaylight.openflowplugin.openflow.md.core.HandshakeStepWrapper@29295cff rejected from org.opendaylight.openflowplugin.openflow.md.core.ThreadPoolLoggingExecutor@5a987e75[Shutting down, pool size = 0, active threads = 0, queued tasks = 1, completed tasks = 0]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at org.opendaylight.openflowplugin.openflow.md.core.ConnectionConductorImpl.onConnectionReady(ConnectionConductorImpl.java:450)
at org.opendaylight.openflowjava.protocol.impl.core.connection.ConnectionAdapterImpl$3.run(ConnectionAdapterImpl.java:449)
at java.lang.Thread.run(Thread.java:744)

Comment by Jamo Luhrsen [ 14/Apr/15 ]

Also, in regards to the last comment, all four cores on the test system are running at 100% and the java processes running the controller needed to be forcefully killed.

Comment by Abhijit Kumbhare [ 01/Jun/15 ]

Jamo,

Can you update this bug after retest? Martin says this should not be an issue any more.

Abhijit

Comment by Jamo Luhrsen [ 01/Jun/15 ]

I am still seeing this as described in a distribution built today (06/01/2015):

distribution-karaf-0.3.0-20150601.211237-2105.zip

looks like the cbench switches are not removed after that step is complete.

let me know what deeper debugs I can provide, if any.

Comment by Abhijit Kumbhare [ 02/Jun/15 ]

Assigning it to Martin - but feel free to reassign.

Comment by Jamo Luhrsen [ 03/Jun/15 ]

more details as we are deciding if this needs to be fixed for Lithium.

  • this is not an issue with Helium SR3
  • this is happening with the Helium plugin codebase with a distribution built 06/02/2015
  • some other symptoms:
  • the mock cbench switches are not removed from operational after the short cbench test is run.
  • it doesn't appear that any new switch can connect at this point and I have tried with switches
    that will not conflict with the switch id.
  • one of the systems 4 CPUs is constantly pegged at 100%
  • the problem is not there if I only install odl-openflowplugin-flow-services-ui. It requires
    me to install odl-l2-switch-switch-ui. I suppose that might need us to involve the l2switch team,
    but obviously something openflow-ish is broken.
  • because of the last point, I am not able to try this test with the Lithium redesign since enabling
    l2switch will depend on, and install the Helium codebase.
Comment by Jamo Luhrsen [ 20/Jul/15 ]

Attachment 1395_Lithium_data.ods has been added with description: re-test with Lithium

Comment by Jamo Luhrsen [ 20/Jul/15 ]

(In reply to Jamo Luhrsen from comment #8)
> Created attachment 573 [details]
> re-test with Lithium

ignore this comment. was meant for another bug.

Comment by Colin Dixon [ 19/Jan/16 ]

Do we know if this is still an issue in Beryllum?

Comment by Jamo Luhrsen [ 20/Jan/16 ]

(In reply to Colin Dixon from comment #10)
> Do we know if this is still an issue in Beryllum?

yes, the high level issue does still seem to be there in Beryllium (used a
distro built 01/19/2016). I did use a larger number of cbench switches than
in inital description.

I also saw the same high level issue with Lithium SR3.

I say "high level", because I am not confident that the root cause is
the same.

Comment by Ajay L [ 24/Feb/16 ]

I am able to repro the high CPU issue in Beryllium using cbench commands mentioned. Threaddump shows that after the cbench test has run, below thread hogs the CPU

"pool-28-thread-2" prio=10 tid=0x00007f6c9c566000 nid=0x2b38 runnable [0x00007f6c7c464000]
java.lang.Thread.State: RUNNABLE
at com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:56)
at com.lmax.disruptor.PhasedBackoffWaitStrategy.waitFor(PhasedBackoffWaitStrategy.java:102)
at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:55)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

ODL uses LMAX disruptor library to implement DOM notifications. packet-handler decoder classes (e.g. ArpDecoder.java), consume the packet notifications, decode the packet, and publish the event as notification to be consumed by next decoder. All this happens in a single thread. Under high load, this may cause dead-lock as the disruptor ring-buffer will not move till the initial notification handler returns and it cannot return till it gets a slot to post event on the ring-buffer

Proposed fix is to perform decode-and-publish on a separate thread to not block the initial notification handler. Unit-tested it and seems to work fine

Jamo - can u pls try in your setup

https://git.opendaylight.org/gerrit/35300

Comment by Jamo Luhrsen [ 03/Mar/16 ]

(In reply to Ajay L from comment #12)
> I am able to repro the high CPU issue in Beryllium using cbench commands
> mentioned. Threaddump shows that after the cbench test has run, below thread
> hogs the CPU
>
> "pool-28-thread-2" prio=10 tid=0x00007f6c9c566000 nid=0x2b38 runnable
> [0x00007f6c7c464000]
> java.lang.Thread.State: RUNNABLE
> at
> com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:56)
> at
> com.lmax.disruptor.PhasedBackoffWaitStrategy.
> waitFor(PhasedBackoffWaitStrategy.java:102)
> at
> com.lmax.disruptor.ProcessingSequenceBarrier.
> waitFor(ProcessingSequenceBarrier.java:55)
> at
> com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:123)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
> 1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
> 615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> ODL uses LMAX disruptor library to implement DOM notifications.
> packet-handler decoder classes (e.g. ArpDecoder.java), consume the packet
> notifications, decode the packet, and publish the event as notification to
> be consumed by next decoder. All this happens in a single thread. Under high
> load, this may cause dead-lock as the disruptor ring-buffer will not move
> till the initial notification handler returns and it cannot return till it
> gets a slot to post event on the ring-buffer
>
> Proposed fix is to perform decode-and-publish on a separate thread to not
> block the initial notification handler. Unit-tested it and seems to work fine
>
> Jamo - can u pls try in your setup
>
> https://git.opendaylight.org/gerrit/35300

Ajay, I do intend to look at this, but I am going to need more time to
catch up on other things. I will report back here when I am able to
check.

Comment by Ajay L [ 08/Mar/16 ]

Thx Jamo. Let us know once u get a chance to try

Comment by Jamo Luhrsen [ 22/Mar/16 ]

(In reply to Ajay L from comment #14)
> Thx Jamo. Let us know once u get a chance to try

Ajay, sorry for being slow on this one. I was able to take the Boron
distribution created by your patch's verify job and compare it to the
latest boron distribution (which does not have your changes).

good news! The problem is fixed.

I could reproduce with current boron distro, and saw things fixed using
the distro from your patch.

I did not test with the stable/beryllium patch, as I had trouble tracking
down the distro created by that patch.

good work.

Comment by Jamo Luhrsen [ 03/May/16 ]

I've also verified on Beryllium SR candidate distro:
https://wiki.opendaylight.org/view/Simultaneous_Release:Beryllium_Release_Plan#Beryllium_SR2_Download

Generated at Wed Feb 07 20:05:45 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.