[OPNFLWPLUG-217] 24hr performance degradation Created: 18/Jul/14  Updated: 27/Sep/21  Resolved: 17/Aug/16

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Christopher O'Shea Assignee: Marcus G K Williams
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: PC


Attachments: File 1395.odc     File 1395_Lithium_data.ods     File 1395_fix.csv     PNG File Graph of 1395fix CSV file flows.png     PNG File OF_1_Bug1395.PNG     PNG File OpenFlow performance.png     File results-18-6-2014.csv    
External issue ID: 1395

 Description   

Following step in the Wiki https://wiki.opendaylight.org/view/CrossProject:Integration_Group:Performance_Test

With "dropAllPacketsRPC on" there is a noticeable performance degradation.

Ever 30 minutes cbench is ran (cbench -c $controllerip -p 6633 -m 10000 -l 10 -s 256 -M 100000 -w 1).

(Thu Jul 17 07:53:43 PDT 2014)30 minutes after starting controller min/max/avg is 21434.08/22523.66/21841.65
(Thu Jul 17 11:53:43 PDT 2014) 4:30hr after starting controller min/max/avg is 19243.97/20008.8/19744.69
(Thu Jul 17 19:53:43 PDT 2014) 12:30hr after starting controller min/max/avg is 14169.36/14786.94/14444.8
(Fri Jul 18 07:23:43 PDT 2014) 24:00hr after starting controller min/max/avg is 10534.49/10943.67/10767.37

See CSV file for detail report.
and PNG file for the visual report.



 Comments   
Comment by Christopher O'Shea [ 18/Jul/14 ]

Attachment OpenFlow performance.png has been added with description: Visual plot of performance.

Comment by Christopher O'Shea [ 18/Jul/14 ]

Attachment results-18-6-2014.csv has been added with description: CSV break down of memory and CPU stats

Comment by Daniel Farrell [ 06/Aug/14 ]

Here's another example of this observed performance degradation[1].

Note that (perhaps obviously) restarting ODL between CBench runs mitigates this decrease[2].

[1]: https://github.com/dfarrell07/cbench_regression/issues/13#issuecomment-47366340
[2]: https://github.com/dfarrell07/cbench_regression/issues/13#issuecomment-47538456

Comment by Christopher O'Shea [ 10/Sep/14 ]

Switch to using Karaf controller RC0

        1. Installing the needed bundles
          feature:install odl-flow-services
          feature:install odl-openflowplugin-drop-test
      1. Turning on droptest
        drop-test:dropallpacketsrpc on

First set of test run fine.
wait 30 minutes start the next test run and the first are ok but the last few return 0.

Comment by Christopher O'Shea [ 16/Sep/14 ]

I had the controller running for 24hour, but wasn't running Cbench ever 30 minutes.

Look like the first run after getting stabled was around 38k, didnn't change when i ran it again 24hr later.

Will setting to run ever 30 minutes tonight and update ticket tomorrow.

Comment by Daniel Farrell [ 19/Nov/14 ]

I'm still seeing perf degradation over time in Helium 0.2.1. Lots of details on this WCBench wiki page[1].

[1]: https://github.com/dfarrell07/wcbench/wiki/Helium-0.2.1-Performance-Results#results-0

Comment by Daniel Farrell [ 10/Feb/15 ]

The perf results wiki I linked to previously has changed addresses. The updated URLs (including much more detailed results) are:

https://github.com/dfarrell07/wcbench/wiki/Unusual-Helium-0.2.1-Performance-Results
https://github.com/dfarrell07/wcbench/wiki/Experiment:-Varying-NUM_SWITCHES

Comment by Marcus G K Williams [ 23/Feb/15 ]

Reproducing this issue. I'll work on a fix after reproduction.

Comment by Marcus G K Williams [ 18/Mar/15 ]

Graph of OPNFLWPLUG-217 reproduction showing cbench stats and ram usage.

Comment by Marcus G K Williams [ 18/Mar/15 ]

Attachment OF_1_Bug1395.PNG has been added with description: Graph of CBench Results showing OPNFLWPLUG-217

Comment by Marcus G K Williams [ 18/Mar/15 ]

I've reproduced this many times. I've attached a graph of the results I'm seeing on Helium SR1, though I've also reproduce on latest integration nightly build.

Currently java profiling ODL during bug reproduction in an attempt to narrow likely causes and focus on correct area of the code. Initial data based on profiling show ever increasing nioEventLoopGroup threads after each cbench run. More analysis is needed before this can be definitively named as the root cause. Source of nio thread generation is under investigation and currently being traced back to code.

Comment by Jamo Luhrsen [ 20/Jul/15 ]

I've retested this with the Lithium release and the issue is still there.
with 57 iterations, the average performance value went from 20k to under
5k. Attached is a spreadsheet with this info.

Comment by Jamo Luhrsen [ 20/Jul/15 ]

Attachment 1395_Lithium_data.ods has been added with description: re-test with Lithium

Comment by Sania Zara [ 06/Oct/15 ]

Submitted a patch for review https://git.opendaylight.org/gerrit/#/c/27867/

Comment by Jamo Luhrsen [ 12/Oct/15 ]

Attachment 1395.odc has been added with description: degradation chart with potential fix

Comment by Jamo Luhrsen [ 12/Oct/15 ]

(In reply to Jamo Luhrsen from comment #13)
> Created attachment 656 [details]
> degradation chart with potential fix

The attached chart is the data I obtained when testing with Sania's fix.

Comment by Daniel Farrell [ 13/Oct/15 ]

> The attached chart is the data I obtained when testing with Sania's fix.

Am I reading this correctly - the summery is that perf is still degrading in a similar way to the initial reports?

Comment by Jamo Luhrsen [ 13/Oct/15 ]

(In reply to Daniel Farrell from comment #15)
> > The attached chart is the data I obtained when testing with Sania's fix.
>
> Am I reading this correctly - the summery is that perf is still degrading in
> a similar way to the initial reports?

that's correct.

Comment by Muhammad Umar Hameed [ 26/Oct/15 ]

I have tested Sania's patch and its working fine with no degradation.

Comment by Muhammad Umar Hameed [ 26/Oct/15 ]

Attachment 1395_fix.csv has been added with description: 1395fix

Comment by Daniel Farrell [ 27/Oct/15 ]

(In reply to Muhammad Umar Hameed from comment #17)
> Created attachment 672 [details]
> 1395fix
>
> I have tested Sania's patch and its working fine with no degradation.

Your standard deviation is quite high, which could be masking a trend.

Still, interesting result, thanks.

Graph attached.

/wcbench$ ./stats.py -s flows 12:12:53
{'flows':

{'max': 62104, 'mean': 44968.913, 'min': 26658, 'relstddev': 15.511, 'stddev': 6975.297}

,
'sample_size': 500}

Comment by Daniel Farrell [ 27/Oct/15 ]

Attachment Graph of 1395fix CSV file flows.png has been added with description: Graph of 1395fix CSV file created using WCBench

Comment by Muhammad Umar Hameed [ 31/Oct/15 ]

Anyone else tested with the same patch?

Comment by Daniel Farrell [ 02/Nov/15 ]

(In reply to Muhammad Umar Hameed from comment #17)
> Created attachment 672 [details]
> 1395fix
>
> I have tested Sania's patch and its working fine with no degradation.

Jamo also verified the fix and the patch has now been merged.

https://git.opendaylight.org/gerrit/#/c/27867/

We may be able to close the longest-standing known perf bug in ODL.

Comment by Abhijit Kumbhare [ 09/Nov/15 ]

Fixed by Sania's patch: https://git.opendaylight.org/gerrit/#/c/27867/

Comment by A H [ 23/Nov/15 ]

Should we reopen this bug since it was reverted for stable/lithium in the patch below?

https://git.opendaylight.org/gerrit/#/c/30085

Comment by Luis Gomez [ 23/Nov/15 ]

I just reopened it.

Comment by Sania Zara [ 03/Dec/15 ]

We employed black-box type analysis to locate the source of this performance issue with a minimum understanding of ODL code. So the solution we provided doesn't work well and I can now see why.
Anyway, I want to share our findings regarding the source of the problem.
Wcbench connects a new set of switches after each run, disconnecting the previous set and this can be verified using Wireshark or even netstat.
CPU profiling of ODL shows that a poll method being invoked from QueueKeeperHarvester gets called an increasing number of times after each run. According to my understanding, this method polls a collection of queuekeeper objects that contain queues with OF messages (data+control) coming from the switches. Now if we check the size of the said collection while Wcbnch is running, it keeps on increasing after every run and the increase is equal to number of switches. Number of calls to the poll method increase as the size of the collection increases linearly. This was verified by keeping the same set of switches connected to ODL for a long running test; this time the performance dint degrade. So the problem is that when a switch disconnects, queuekeeper object corresponding to that switch is not closed/destroyed/removed-from-the-collection. And since a new set of switches is connected after each run, the size of aforementioned collection keeps on increasing. What (I think) I did wrong, in the patch I submitted, was that I prematurely closed the queuekeeper.
It would be great if someone with a better understanding of ODL code could comment on this.

Comment by Colin Dixon [ 19/Jan/16 ]

Is there a plan for how to incorporate the fact that we know the root cause of the bug into an actual fix? Also, could we please explain here why the fix was reverted?

Comment by Jamo Luhrsen [ 19/Jan/16 ]

(In reply to Colin Dixon from comment #26)
> Is there a plan for how to incorporate the fact that we know the root cause
> of the bug into an actual fix? Also, could we please explain here why the
> fix was reverted?

once this fix was given, a new bug was uncovered:
https://bugs.opendaylight.org/show_bug.cgi?id=4649

it was decided to fix 4649 above by reverting the patch that fixed this 1395
bug.

Comment by Jozef Bacigal [ 17/Aug/16 ]

Closing this bug. Actually this bug was on performance degradation in He design. In Boron/Carbon we don't use it anymore.

If any project or someone need to fix this problem for Beryllium SR4 Helium design should raise a new bug. But I doubt there is time for such thing, we got a lot of work with new releases

Generated at Wed Feb 07 20:31:52 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.