[L2SWITCH-38] basic openflow network breaks after short cbench test with l2switch feature installed Created: 25/Mar/15 Updated: 30/Oct/17 Resolved: 03/May/16 |
|
| Status: | Verified |
| Project: | l2switch |
| Component/s: | General |
| Affects Version/s: | unspecified |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Jamo Luhrsen | Assignee: | Ajay L |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 2897 |
| Description |
|
TL;DR using a larger number of switches in a cbench test breaks the openflow-ness such Test Details: Setup: install these features:
Steps: Notes: in step 1, the switch is learned and it is removed when mininet is done. after step 2, there are leftover switches in the operational store. once this state is seen, the logout (exiting karaf console) process takes a very long time The cbench test simulates X openflow 1.0 switches all connecting. When X == 32 this I took a packet capture of the mininet network connecting in the problem state and the |
| Comments |
| Comment by Jamo Luhrsen [ 25/Mar/15 ] |
|
Attachment mnWorks.pcap has been added with description: what mininet-to-controller traffic looks like in a working setup |
| Comment by Jamo Luhrsen [ 25/Mar/15 ] |
|
Attachment mnBroken.pcap has been added with description: what mininet-to-controller traffic looks like in a broken setup |
| Comment by Jamo Luhrsen [ 14/Apr/15 ] |
|
with Lithium build from 4/14/2015, this issue is there with throughput additionally, running this test twice resulted in an OOM: cbench -c 172.17.7.23 -m 30000 -M 10000 -s 64 -l 10 -t The OOM trace: Exception in thread "Thread-2" java.lang.OutOfMemoryError: GC overhead limit exceeded |
| Comment by Jamo Luhrsen [ 14/Apr/15 ] |
|
Also, in regards to the last comment, all four cores on the test system are running at 100% and the java processes running the controller needed to be forcefully killed. |
| Comment by Abhijit Kumbhare [ 01/Jun/15 ] |
|
Jamo, Can you update this bug after retest? Martin says this should not be an issue any more. Abhijit |
| Comment by Jamo Luhrsen [ 01/Jun/15 ] |
|
I am still seeing this as described in a distribution built today (06/01/2015): distribution-karaf-0.3.0-20150601.211237-2105.zip looks like the cbench switches are not removed after that step is complete. let me know what deeper debugs I can provide, if any. |
| Comment by Abhijit Kumbhare [ 02/Jun/15 ] |
|
Assigning it to Martin - but feel free to reassign. |
| Comment by Jamo Luhrsen [ 03/Jun/15 ] |
|
more details as we are deciding if this needs to be fixed for Lithium.
|
| Comment by Jamo Luhrsen [ 20/Jul/15 ] |
|
Attachment 1395_Lithium_data.ods has been added with description: re-test with Lithium |
| Comment by Jamo Luhrsen [ 20/Jul/15 ] |
|
(In reply to Jamo Luhrsen from comment #8) ignore this comment. was meant for another bug. |
| Comment by Colin Dixon [ 19/Jan/16 ] |
|
Do we know if this is still an issue in Beryllum? |
| Comment by Jamo Luhrsen [ 20/Jan/16 ] |
|
(In reply to Colin Dixon from comment #10) yes, the high level issue does still seem to be there in Beryllium (used a I also saw the same high level issue with Lithium SR3. I say "high level", because I am not confident that the root cause is |
| Comment by Ajay L [ 24/Feb/16 ] |
|
I am able to repro the high CPU issue in Beryllium using cbench commands mentioned. Threaddump shows that after the cbench test has run, below thread hogs the CPU "pool-28-thread-2" prio=10 tid=0x00007f6c9c566000 nid=0x2b38 runnable [0x00007f6c7c464000] ODL uses LMAX disruptor library to implement DOM notifications. packet-handler decoder classes (e.g. ArpDecoder.java), consume the packet notifications, decode the packet, and publish the event as notification to be consumed by next decoder. All this happens in a single thread. Under high load, this may cause dead-lock as the disruptor ring-buffer will not move till the initial notification handler returns and it cannot return till it gets a slot to post event on the ring-buffer Proposed fix is to perform decode-and-publish on a separate thread to not block the initial notification handler. Unit-tested it and seems to work fine Jamo - can u pls try in your setup |
| Comment by Jamo Luhrsen [ 03/Mar/16 ] |
|
(In reply to Ajay L from comment #12) Ajay, I do intend to look at this, but I am going to need more time to |
| Comment by Ajay L [ 08/Mar/16 ] |
|
Thx Jamo. Let us know once u get a chance to try |
| Comment by Jamo Luhrsen [ 22/Mar/16 ] |
|
(In reply to Ajay L from comment #14) Ajay, sorry for being slow on this one. I was able to take the Boron good news! The problem is fixed. I could reproduce with current boron distro, and saw things fixed using I did not test with the stable/beryllium patch, as I had trouble tracking good work. |
| Comment by Jamo Luhrsen [ 03/May/16 ] |
|
I've also verified on Beryllium SR candidate distro: |