[OPNFLWPLUG-727] cbench throughput test does not work in Boron Created: 08/Jul/16  Updated: 27/Sep/21  Resolved: 21/Nov/16

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Tomas Slusny
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by OPNFLWJAVA-79 Infinite reschedule of flush Resolved
External issue ID: 6176

 Description   

Both plugins (He + Li) show the following:

cbench -c 10.29.8.203 -t -m 12000 -M 10000 -s 8 -l 10

cbench: controller benchmarking tool
running in mode 'throughput'
connecting to controller at 10.29.8.203:6633
faking 8 switches offset 1 :: 10 tests each; 12000 ms per test
with 10000 unique source MACs per switch
learning destination mac addresses before the test
starting test with 0 ms delay after features_reply
ignoring first 1 "warmup" and last 0 "cooldown" loops
connection delay of 0ms per 1 switch(es)
debugging info is off
controller msgbuf_read() = -1: msgbuf_read: Connection reset by peer
... exiting

My interpretation is that controller closes the switches connections and therefore we get following report:

05:52:56.797 8 switches: flows/sec: 0 0 0 0 0 0 0 0 total = 0.000000 per ms
05:53:08.897 8 switches: flows/sec: 0 0 0 0 0 0 20000 0 total = 1.666627 per ms
05:53:20.998 8 switches: flows/sec: 0 0 0 0 0 0 0 0 total = 0.000000 per ms
05:53:33.098 8 switches: flows/sec: 0 0 0 0 0 0 0 0 total = 0.000000 per ms
05:53:45.222 8 switches: flows/sec: 0 0 0 0 0 776 0 0 total = 0.064542 per ms
05:53:57.323 8 switches: flows/sec: 0 0 0 0 0 0 0 0 total = 0.000000 per ms
05:54:09.423 8 switches: flows/sec: 0 0 0 0 0 0 0 0 total = 0.000000 per ms



 Comments   
Comment by Jozef Bacigal [ 01/Aug/16 ]

Already solved.

Comment by Luis Gomez [ 18/Aug/16 ]

Reopening, issue shows again and it is tracked here:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-cbench-performance-only-boron/

BR/Luis

Comment by Luis Gomez [ 18/Aug/16 ]

I easily reporduced locally:

mininet@mininet-vm:~\> cbench -c 192.168.0.1 -t -m 12000 -M 10000 -s 8 -l 10
cbench: controller benchmarking tool
running in mode 'throughput'
connecting to controller at 192.168.0.1:6633
faking 8 switches offset 1 :: 10 tests each; 12000 ms per test
with 10000 unique source MACs per switch
learning destination mac addresses before the test
starting test with 0 ms delay after features_reply
ignoring first 1 "warmup" and last 0 "cooldown" loops
connection delay of 0ms per 1 switch(es)
debugging info is off
14:04:25.711 8 switches: flows/sec: 11006 0 0 16669 0 0 0 0 total = 2.306181 per ms
14:04:37.815 8 switches: flows/sec: 0 0 0 3331 0 0 0 11903 total = 1.269233 per ms
controller msgbuf_read() = -1: msgbuf_read: Connection reset by peer
... exiting

Not only the above but when the test stops the controller stays with 100% CPU with no switch connected so there is really an issue here.

To install cbench on ubuntu just look at the cbench section of this script:

https://git.opendaylight.org/gerrit/gitweb?p=releng/builder.git;a=blob;f=packer/provision/mininet.sh

Comment by Luis Gomez [ 18/Aug/16 ]

Rising to critical as people in the field will try cbench for sure.

Comment by Miroslav Macko [ 26/Aug/16 ]

Hi Luis,

Tested on master.

We have found some blocked threads, because of logging.

I have tried to turn logging off, but anyway the best I've got locally is this:

cbench: controller benchmarking tool
running in mode 'throughput'
connecting to controller at localhost:6633
faking 8 switches offset 1 :: 9 tests each; 12000 ms per test
with 10000 unique source MACs per switch
learning destination mac addresses before the test
starting test with 1000 ms delay after features_reply
ignoring first 1 "warmup" and last 0 "cooldown" loops
connection delay of 0ms per 1 switch(es)
debugging info is off
13:43:16.118 8 switches: flows/sec: 0 19999 17883 19165 0 0 0 0 total = 4.751830 per ms
13:43:28.220 8 switches: flows/sec: 14254 0 0 0 0 3916 0 14849 total = 2.751527 per ms
13:43:40.320 8 switches: flows/sec: 0 0 0 0 7962 0 0 0 total = 0.663450 per ms
13:43:52.427 8 switches: flows/sec: 0 0 0 0 0 19999 20000 0 total = 3.331661 per ms
13:44:19.436 8 switches: flows/sec: 0 0 0 0 12038 0 0 0 total = 0.447354 per ms
13:44:31.544 8 switches: flows/sec: 0 0 0 0 0 0 0 0 total = 0.000000 per ms
13:44:43.645 8 switches: flows/sec: 20000 51877 56720 29757 0 60000 0 60000 total = 23.195637 per ms
13:44:55.746 8 switches: flows/sec: 78687 56798 71107 60000 0 55125 0 94532 total = 34.687287 per ms
13:45:08.235 8 switches: flows/sec: 45838 80000 53229 60000 0 74013 0 65008 total = 30.518821 per ms
RESULT: 8 switches 8 tests min/max/avg/stdev = 0.00/34687.29/11949.47/13917.78 responses/s

There could be some issue with that, but it will be probably not the only one.

CPU is running also for me at 100% after cbench end. I am not sure about connected switches. How do you check it?

That is what we have for now.

Thanks,
Miro

Comment by Luis Gomez [ 29/Aug/16 ]

With more testing I could fix the test by adding some start delay (-D 10000) and reducing the number of MACs (-M 100):

https://git.opendaylight.org/gerrit/#/c/44773/

However even with these changes the controller stays with 100% CPU after the test which I think we should fix as this could be a security vulnerability.

Comment by Luis Gomez [ 29/Aug/16 ]

After connecting a profiler, this thread seems to be the issue:

AbstractStackedOutboundQueue.java:333 org.opendaylight.openflowjava.protocol.impl.core.connection.StackedSegment.failAll(OutboundQueueException)

Comment by Andrej Leitner [ 30/Aug/16 ]

Hi Luis,
I think, there is still problem with excessive logging. I prepared the patch where log level is set to debug and also precondition isDebugEnabled is used. I ran cbench locally on this patch (+logging in karaf set to ERROR) and my CPU after test and karaf shutdown get to normal (2%) and didn't get stuck at 100%. Could you please test it in jenkins enviroment?

https://git.opendaylight.org/gerrit/#/c/44868/

Comment by Shuva Jyoti Kar [ 30/Aug/16 ]

(In reply to Andrej Leitner from comment #8)
> Hi Luis,
> I think, there is still problem with excessive logging. I prepared the patch
> where log level is set to debug and also precondition isDebugEnabled is
> used. I ran cbench locally on this patch (+logging in karaf set to ERROR)
> and my CPU after test and karaf shutdown get to normal (2%) and didn't get
> stuck at 100%. Could you please test it in jenkins enviroment?
>
> https://git.opendaylight.org/gerrit/#/c/44868/

Logging the failures as error should be ok, since those are failures. any trace should be removed and lets have the debug logs checked before logging.I will also take a look at the places we log and come out with an improved logging

Comment by Andrej Leitner [ 31/Aug/16 ]

Hi Shuva,
I think that despite the fact that unsuccessful RPC result brings errors from device there is no reason for throwing error since it is not error/failure of karaf or OFP itself. It's "only" problem of device and we can go on. I think (can't find it on mailing-lists now) we should throw ERROR only if there is a real problem and we are not able to recover (e.g. restart needed). Hence log level for this shouldn't be more than info. I chose debug level because information about result is useful/targeted for admin and not common user (like everyone, there is still operational/DS where you see if flow got installed or not)

Comment by Luis Gomez [ 01/Sep/16 ]

Upgrading to blocker due to impact in ODL perf reports.

Comment by Shuva Jyoti Kar [ 01/Sep/16 ]

(In reply to Luis Gomez from comment #11)
> Upgrading to blocker due to impact in ODL perf reports.

Luis do we still see controller closing the switch connections ? or is it that the controller cpu stays at 100% even after the test ?

Comment by Luis Gomez [ 01/Sep/16 ]

I am currently testing the proposed patch, I will update shortly on it.

Comment by Luis Gomez [ 01/Sep/16 ]

This patch may improve some perf number but it does not help with:

1) Test abort due to cbench switches disconnect issue: This is mostly addressed in this patch: https://git.opendaylight.org/gerrit/#/c/44773/

2) CPU very high after test finishes and switches are disconnected. This is to me the blocker now as controller seems to be inoperable after the test.

BR/Luis

Comment by Shuva Jyoti Kar [ 01/Sep/16 ]

(In reply to Luis Gomez from comment #14)
> This patch may improve some perf number but it does not help with:
>
> 1) Test abort due to cbench switches disconnect issue: This is mostly
> addressed in this patch: https://git.opendaylight.org/gerrit/#/c/44773/
>
> 2) CPU very high after test finishes and switches are disconnected. This is
> to me the blocker now as controller seems to be inoperable after the test.
>
> BR/Luis

Does the CPU usage come down after sometime or it remains same eternally ?

Comment by Luis Gomez [ 01/Sep/16 ]

For as long as my patient allows which is some minutes after the test the CPU is still 100%.

Comment by Luis Gomez [ 01/Sep/16 ]

FYI https://git.opendaylight.org/gerrit/#/c/44773/ is already merged so test should go to green but CPU issue is still there and I will be probably extending the test to catch this scenario

Comment by A H [ 02/Sep/16 ]

Is there an ETA for this bug and someone assigned to fix?

Comment by Luis Gomez [ 02/Sep/16 ]

FYI, I added check to ver

Comment by Luis Gomez [ 02/Sep/16 ]

FYI, I added extra check in cbench test to track this bug:

https://git.opendaylight.org/gerrit/#/c/45046/

This job will fail until this issue gets fixed:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-cbench-performance-only-boron/

BR/Luis

Comment by A H [ 06/Sep/16 ]

To better assess the impact of this bug and fix, could someone from your team please help us identify the following:
Severity: Could you elaborate on the severity of this bug? Is this a BLOCKER such that we cannot release Boron without it? Is there a workaround such that we can write a release note and fix in future Boron SR1?
Testing: Could you also elaborate on the testing of this patch? How extensively has this patch been tested? Is it covered by any unit tests or system tests?
Impact: Does this fix impact any dependent projects?

Comment by Andrej Leitner [ 06/Sep/16 ]

Severity
The bug is blocker because of performance regression ->
(In reply to Luis Gomez from comment #11)
> Upgrading to blocker due to impact in ODL perf reports.

Testing
The patch for this bug in ofplugin consists of decreasing log level to prevent overloading of netty thread with logging errors (seen in troughput test) and ConcurrentModificationException at higher load (ibid.). There was no need to unit test update, we ran patch-test for ofplugin more times to check impact.

There is also an issue in ofjava described in OPNFLWJAVA-79.

Impact
In principle, the fix does not impact any dependent projects.

Comment by Andrej Leitner [ 06/Sep/16 ]

merged in boron

Comment by Luis Gomez [ 06/Sep/16 ]

This is fixed now according to:

https://jenkins.opendaylight.org/releng/job/openflowplugin-csit-1node-cbench-performance-only-boron/

BR/Luis

Comment by Luis Gomez [ 07/Sep/16 ]

Reopening the bug: the CPU issue is fixed but after running throughput test couple of times (cbench -c 192.168.0.1 -t -m 12000 -M 100 -l 10 -s 16 -D 5000) I see memory issues:

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000cdc00000, 121634816, 0) failed; error='Cannot allocate memory' (errno=12)
#

  1. There is insufficient memory for the Java Runtime Environment to continue.
  2. Native memory allocation (mmap) failed to map 121634816 bytes for committing reserved memory.
  3. An error report file with more information is saved as:
  4. /home/vagrant/controller-test/distribution-karaf-0.5.0-SNAPSHOT/hs_err_pid10272.log
    Killed
Comment by Luis Gomez [ 07/Sep/16 ]

After increasing the physical RAM size, I do not see issues running the test locally. Still some unstability in CI so downgrading to Major until we understand why Beryllium does not show this unstability.

Comment by A H [ 08/Sep/16 ]

(In reply to Luis Gomez from comment #24)
> This is fixed now according to:
>
> https://jenkins.opendaylight.org/releng/job/openflowplugin-csit-1node-cbench-
> performance-only-boron/
>
> BR/Luis

Based on Luis's comment, can I safely assume that this bug been verified as fixed in the latest Boron RC 3.1 Build?

Comment by Luis Gomez [ 08/Sep/16 ]

This bug needs to remain open with lower priority in case we get some questions after Boron release. The reason is I had to modify the test to sleep 10 secs between Cbench runs in order to stabilize it:

https://git.opendaylight.org/gerrit/#/c/45294/

In Beryllium we did not have to do this.

BR/Luis

Comment by Andrej Leitner [ 08/Sep/16 ]

(In reply to Luis Gomez from comment #25)
> Reopening the bug: the CPU issue is fixed but after running throughput test
> couple of times (cbench -c 192.168.0.1 -t -m 12000 -M 100 -l 10 -s 16 -D
> 5000) I see memory issues:
>
> OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x00000000cdc00000, 121634816, 0) failed; error='Cannot
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (mmap) failed to map 121634816 bytes for
> committing reserved memory.
> # An error report file with more information is saved as:
> #
> /home/vagrant/controller-test/distribution-karaf-0.5.0-SNAPSHOT/
> hs_err_pid10272.log
> Killed

OPNFLWJAVA-78 in openflowjava could be related to memory issues. OPNFLWJAVA-79 merged as dependency for our cbench blocker allows to fail unflushed segments, however failed entries in segment are not counted into completed (OPNFLWJAVA-78) and segment is not considered as completed thus not removed. The patch is already merged in carbon, tested by Luis as per comments: https://git.opendaylight.org/gerrit/#/c/45210/ and got +2 from Michal. Not sure if it still possible to merge it for RC3.

Comment by Tomas Slusny [ 08/Sep/16 ]

According to CBench unstability - this is bug with CBench.

When we think that switch is IDLE, we send HELLO message to switch, and we are expecting that switch will also reply with HELLO message. But CBench was not sending this HELLO message and silently ignored all incoming HELLO messages.

I added proper HELLO reply when HELLO is received to CBench sources. Here is my fork of CBench repo with this fix: https://github.com/deathbeam/oflops.

I also created pull request to official repository, but since it is pretty inactive, I doubt that it will be ever merged.

Comment by Tomas Slusny [ 08/Sep/16 ]

After some more testing, we are actually sending ECHO and on HELLO on switch IDLE, and CBench is actually trying to send ECHO_REPLY, but I think for some reason we are not receiving this ECHO_REPLY in time. I will investigate this a bit more.

Comment by Luis Gomez [ 08/Sep/16 ]

OK there is a line on the sand of 11:59p UTC on sunday, if you find something and can get some patch by that fine, otherwise it will have to wait until SR1.

Comment by Andrej Leitner [ 13/Sep/16 ]

we are getting 100% pass on jenkins from Sep 8

Comment by Andrej Leitner [ 19/Sep/16 ]

Luis, could we close the bug as resolved?

Comment by Luis Gomez [ 22/Sep/16 ]

Yes, this is fixed now.

Comment by Andrej Leitner [ 27/Oct/16 ]

latency rerun is failing occasionally

Comment by Tomas Slusny [ 21/Nov/16 ]

According to jenkins: https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-cbench-performance-only-carbon/ and https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-cbench-performance-only-boron/ cbench runs are now pretty stable, only with occasional fails (but there failed all 3 tests, so it is probably environment issue) so closing this, again.

Comment by Luis Gomez [ 21/Nov/16 ]

Right, this can be closed for now.

Generated at Wed Feb 07 20:33:13 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.