[OPNFLWPLUG-727] cbench throughput test does not work in Boron Created: 08/Jul/16 Updated: 27/Sep/21 Resolved: 21/Nov/16 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Luis Gomez | Assignee: | Tomas Slusny |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 6176 | ||||||||
| Description |
|
Both plugins (He + Li) show the following: cbench -c 10.29.8.203 -t -m 12000 -M 10000 -s 8 -l 10 cbench: controller benchmarking tool My interpretation is that controller closes the switches connections and therefore we get following report: 05:52:56.797 8 switches: flows/sec: 0 0 0 0 0 0 0 0 total = 0.000000 per ms |
| Comments |
| Comment by Jozef Bacigal [ 01/Aug/16 ] |
|
Already solved. |
| Comment by Luis Gomez [ 18/Aug/16 ] |
|
Reopening, issue shows again and it is tracked here: BR/Luis |
| Comment by Luis Gomez [ 18/Aug/16 ] |
|
I easily reporduced locally: mininet@mininet-vm:~\> cbench -c 192.168.0.1 -t -m 12000 -M 10000 -s 8 -l 10 Not only the above but when the test stops the controller stays with 100% CPU with no switch connected so there is really an issue here. To install cbench on ubuntu just look at the cbench section of this script: https://git.opendaylight.org/gerrit/gitweb?p=releng/builder.git;a=blob;f=packer/provision/mininet.sh |
| Comment by Luis Gomez [ 18/Aug/16 ] |
|
Rising to critical as people in the field will try cbench for sure. |
| Comment by Miroslav Macko [ 26/Aug/16 ] |
|
Hi Luis, Tested on master. We have found some blocked threads, because of logging. I have tried to turn logging off, but anyway the best I've got locally is this: cbench: controller benchmarking tool There could be some issue with that, but it will be probably not the only one. CPU is running also for me at 100% after cbench end. I am not sure about connected switches. How do you check it? That is what we have for now. Thanks, |
| Comment by Luis Gomez [ 29/Aug/16 ] |
|
With more testing I could fix the test by adding some start delay (-D 10000) and reducing the number of MACs (-M 100): https://git.opendaylight.org/gerrit/#/c/44773/ However even with these changes the controller stays with 100% CPU after the test which I think we should fix as this could be a security vulnerability. |
| Comment by Luis Gomez [ 29/Aug/16 ] |
|
After connecting a profiler, this thread seems to be the issue: AbstractStackedOutboundQueue.java:333 org.opendaylight.openflowjava.protocol.impl.core.connection.StackedSegment.failAll(OutboundQueueException) |
| Comment by Andrej Leitner [ 30/Aug/16 ] |
|
Hi Luis, |
| Comment by Shuva Jyoti Kar [ 30/Aug/16 ] |
|
(In reply to Andrej Leitner from comment #8) Logging the failures as error should be ok, since those are failures. any trace should be removed and lets have the debug logs checked before logging.I will also take a look at the places we log and come out with an improved logging |
| Comment by Andrej Leitner [ 31/Aug/16 ] |
|
Hi Shuva, |
| Comment by Luis Gomez [ 01/Sep/16 ] |
|
Upgrading to blocker due to impact in ODL perf reports. |
| Comment by Shuva Jyoti Kar [ 01/Sep/16 ] |
|
(In reply to Luis Gomez from comment #11) Luis do we still see controller closing the switch connections ? or is it that the controller cpu stays at 100% even after the test ? |
| Comment by Luis Gomez [ 01/Sep/16 ] |
|
I am currently testing the proposed patch, I will update shortly on it. |
| Comment by Luis Gomez [ 01/Sep/16 ] |
|
This patch may improve some perf number but it does not help with: 1) Test abort due to cbench switches disconnect issue: This is mostly addressed in this patch: https://git.opendaylight.org/gerrit/#/c/44773/ 2) CPU very high after test finishes and switches are disconnected. This is to me the blocker now as controller seems to be inoperable after the test. BR/Luis |
| Comment by Shuva Jyoti Kar [ 01/Sep/16 ] |
|
(In reply to Luis Gomez from comment #14) Does the CPU usage come down after sometime or it remains same eternally ? |
| Comment by Luis Gomez [ 01/Sep/16 ] |
|
For as long as my patient allows which is some minutes after the test the CPU is still 100%. |
| Comment by Luis Gomez [ 01/Sep/16 ] |
|
FYI https://git.opendaylight.org/gerrit/#/c/44773/ is already merged so test should go to green but CPU issue is still there and I will be probably extending the test to catch this scenario |
| Comment by A H [ 02/Sep/16 ] |
|
Is there an ETA for this bug and someone assigned to fix? |
| Comment by Luis Gomez [ 02/Sep/16 ] |
|
FYI, I added check to ver |
| Comment by Luis Gomez [ 02/Sep/16 ] |
|
FYI, I added extra check in cbench test to track this bug: https://git.opendaylight.org/gerrit/#/c/45046/ This job will fail until this issue gets fixed: BR/Luis |
| Comment by A H [ 06/Sep/16 ] |
|
To better assess the impact of this bug and fix, could someone from your team please help us identify the following: |
| Comment by Andrej Leitner [ 06/Sep/16 ] |
|
Severity Testing There is also an issue in ofjava described in Impact |
| Comment by Andrej Leitner [ 06/Sep/16 ] |
|
merged in boron |
| Comment by Luis Gomez [ 06/Sep/16 ] |
|
This is fixed now according to: https://jenkins.opendaylight.org/releng/job/openflowplugin-csit-1node-cbench-performance-only-boron/ BR/Luis |
| Comment by Luis Gomez [ 07/Sep/16 ] |
|
Reopening the bug: the CPU issue is fixed but after running throughput test couple of times (cbench -c 192.168.0.1 -t -m 12000 -M 100 -l 10 -s 16 -D 5000) I see memory issues: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000cdc00000, 121634816, 0) failed; error='Cannot allocate memory' (errno=12)
|
| Comment by Luis Gomez [ 07/Sep/16 ] |
|
After increasing the physical RAM size, I do not see issues running the test locally. Still some unstability in CI so downgrading to Major until we understand why Beryllium does not show this unstability. |
| Comment by A H [ 08/Sep/16 ] |
|
(In reply to Luis Gomez from comment #24) Based on Luis's comment, can I safely assume that this bug been verified as fixed in the latest Boron RC 3.1 Build? |
| Comment by Luis Gomez [ 08/Sep/16 ] |
|
This bug needs to remain open with lower priority in case we get some questions after Boron release. The reason is I had to modify the test to sleep 10 secs between Cbench runs in order to stabilize it: https://git.opendaylight.org/gerrit/#/c/45294/ In Beryllium we did not have to do this. BR/Luis |
| Comment by Andrej Leitner [ 08/Sep/16 ] |
|
(In reply to Luis Gomez from comment #25)
|
| Comment by Tomas Slusny [ 08/Sep/16 ] |
|
According to CBench unstability - this is bug with CBench. When we think that switch is IDLE, we send HELLO message to switch, and we are expecting that switch will also reply with HELLO message. But CBench was not sending this HELLO message and silently ignored all incoming HELLO messages. I added proper HELLO reply when HELLO is received to CBench sources. Here is my fork of CBench repo with this fix: https://github.com/deathbeam/oflops. I also created pull request to official repository, but since it is pretty inactive, I doubt that it will be ever merged. |
| Comment by Tomas Slusny [ 08/Sep/16 ] |
|
After some more testing, we are actually sending ECHO and on HELLO on switch IDLE, and CBench is actually trying to send ECHO_REPLY, but I think for some reason we are not receiving this ECHO_REPLY in time. I will investigate this a bit more. |
| Comment by Luis Gomez [ 08/Sep/16 ] |
|
OK there is a line on the sand of 11:59p UTC on sunday, if you find something and can get some patch by that fine, otherwise it will have to wait until SR1. |
| Comment by Andrej Leitner [ 13/Sep/16 ] |
|
we are getting 100% pass on jenkins from Sep 8 |
| Comment by Andrej Leitner [ 19/Sep/16 ] |
|
Luis, could we close the bug as resolved? |
| Comment by Luis Gomez [ 22/Sep/16 ] |
|
Yes, this is fixed now. |
| Comment by Andrej Leitner [ 27/Oct/16 ] |
|
latency rerun is failing occasionally |
| Comment by Tomas Slusny [ 21/Nov/16 ] |
|
According to jenkins: https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-cbench-performance-only-carbon/ and https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-cbench-performance-only-boron/ cbench runs are now pretty stable, only with occasional fails (but there failed all 3 tests, so it is probably environment issue) so closing this, again. |
| Comment by Luis Gomez [ 21/Nov/16 ] |
|
Right, this can be closed for now. |