[OPNFLWPLUG-630] Li plugin: Scalability issues with OVS 2.4 Created: 04/Mar/16  Updated: 27/Sep/21  Resolved: 30/Jun/16

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Alexis de Talhouët Assignee: Alexis de Talhouët
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 5464

 Description   

While performing scalability test against OpenFlowPlugin codebase, stable/lithium branch, -li plugin, with statistics collection enable, I encounter the following issue:

  • OvS 2.3.x or previous version:
    Scalability seems fine, or at least goes up to 400+ switches.
  • OvS 2.4.x or newer version:
    Scalability is capped around 45-50 switches.
    At the tipping point, switches close the connection then send hello message. As ODL is in bad shape at this time, ODL goes in a crazy loop where it tries to reestablish the connection for all switches, one by one, but failed. This was ending in OOM error.
    This crazy loop behaviour was recently fixed with Bug-4957; now, at the tipping point, ODL goes crazy for a bit then recovers and stabilizes, although all switches are disconnected and no more connection is possible.
    Here are some logs, and a Yourkit Java Profiler snapshot: https://www.dropbox.com/sh/1zz1x6i1bl5uor8/AACHJfML-RqvOk7vFI5U-haJa?dl=0

I will setup tests in ODL infra to track progress, and to know better about scalability performance. Work started here: https://git.opendaylight.org/gerrit/#/c/35813/



 Comments   
Comment by Jozef Bacigal [ 16/Mar/16 ]

Alexis please can you explain what exactly means "ODL goes crazy a bit", can you point me for the exact part of the log ? Thank you.

Jozef

Comment by Alexis de Talhouët [ 16/Mar/16 ]

Jozef, please read this mail I send a while now: https://lists.opendaylight.org/pipermail/openflowplugin-dev/2016-February/004631.html

I know the behavior has changed a bit since a lot of code came in to stabilize transaction/role/etc...

Also it pretty easy to reproduce, I have create a patch in intergration test with all the Dockerfile, scripts I've been using to create the scale testbed.

I shall reproduce it with current codebase by the end of the week if you want to.

Thanks,
Alexis

Comment by Luis Gomez [ 16/Mar/16 ]

This bug is actually very easy to reproduce with OVS 2.4:

1) Download latest ODL Be from Nexus
2) Start He or Li plugin (I think issue is on both)
3) Start mininet tree,4 (this is not even a large topology)
4) CPU goes crazy but there is also memory leak

>OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000007b5c00000, 95944704, 0) failed; error='Cannot allocate memory' (errno=12)
#

  1. There is insufficient memory for the Java Runtime Environment to continue.
  2. Native memory allocation (malloc) failed to allocate 95944704 bytes for committing reserved memory.
  3. An error report file with more information is saved as:
  4. /home/mininet/controller-master/distribution-karaf-0.4.1-SNAPSHOT/hs_err_pid18041.log
Comment by Alexis de Talhouët [ 17/Mar/16 ]

So I analyzed what is going on:

OvS has a inactivity probe timeout that by default is 5 second [0]. If in the period of 2 * inactivity probe (10 second by default) the switch doesn't hear from the controller (ECHO_REQUEST, or whatever message in fact), the switch will send a disconnected request. [1]
So under a certain load, ODL is not able to process all switches in this given timeout , thus receive disconnected event followed by hello message from switches.

wondering if this has something to do with the HashedWheelTimer from the DeviceManagerImpl class: https://github.com/opendaylight/openflowplugin/blob/stable/lithium/openflowplugin-impl/src/main/java/org/opendaylight/openflowplugin/impl/device/DeviceManagerImpl.java#l81

[0]
Logic to initialize inactivity timer value for a specific bridge from ovsdb content.
https://github.com/openvswitch/ovs/blob/master/vswitchd/bridge.c#L3415

[1]
Logic to handle no response from inactivity probe
https://github.com/openvswitch/ovs/blob/master/lib/rconn.c#L559-L560

Comment by Luis Gomez [ 21/Mar/16 ]

Updating bug title to reflect issue only happens in Li plugin.

Comment by Jozef Bacigal [ 21/Mar/16 ]

We were talking wint Michal Rehak and there is really a huge amount of data and huge load on DS with table features. With this possibility to switch off table features was running locally on my mininet all smoothly with 127 switches.

I would appreciate when someone will test it properly with this patch:

https://git.opendaylight.org/gerrit/#/c/36506/

This is anyway a workaround, but we can provide it into SR1.

If I am not mistaken, everyone can still use RPC for table feature request.

Thanks

Jozef

Comment by Alexis de Talhouët [ 22/Mar/16 ]

Jozeph,

I confirmed this increase the scalability of the OFP-lithium. In fact this is the workaround I did for my app.
One question though, would it make sense to make all the features configurable through config subsystem?
I sent a mail couple of weeks ago proposing to do that, but got few reply: https://lists.opendaylight.org/pipermail/openflowplugin-dev/2016-March/004769.html

Would you mind amending this patch so one can define what features he wants? This would be great! Else I can do it, if you agree it makes sense.

Thanks,
Alexis

Comment by Jozef Bacigal [ 23/Mar/16 ]

Hi Alexis,

I don't think to have "all" features configurable it such a good idea, as you have more possibility something to set or change then you get more wrong settings and more problems and question about how to properly set.

I would be wise to let configurable features to hold on minimum.

Jozef

Comment by Michal Rehak [ 05/Apr/16 ]

Hi all,
this is more like fix then workaround:
https://git.opendaylight.org/gerrit/#/c/36559

Here table-features got moved outside table in order not to put additional heavy load on statistic updates.

Trade off is that downstream apps would need to adapt their instance-identifiers to table features (I checked NIC and DIDM - both are reading table-features).

Positive by this fix is that no exclusive flag in config is needed and projects like NIC/DIDM can still coexist beside other - "table-features free" projects.

So please test this change. Thank you.

Comment by Luis Gomez [ 05/Apr/16 ]

I like this solution better than the config workaround. One of the reasons is we do not want to ask users to set special configuration to run perf/scale tests, perf/scale should be good out-of-the-box.

BR/Luis

Comment by Hideyuki Tai [ 20/May/16 ]

Hi all,

I've found out that the patch for the OPNFLWPLUG-630 has been already merged.
https://git.opendaylight.org/gerrit/#/c/36559/
That patch changes the model.

In the result, it has broken NIC project's build.
The build on the master branch of the NIC project contentiously fails since May 13.
https://jenkins.opendaylight.org/releng/view/nic/job/nic-integration-boron/

I think we should notify downstream projects of any model changes before we merge the patch for model changes.

Comment by Jozef Bacigal [ 23/May/16 ]

Hideyuki we were talking about it a lot and second its a long time that nobody from NIC has noticed that NIC is broken

Here is link

https://lists.opendaylight.org/pipermail/openflowplugin-dev/2016-May/005123.html

where we were talked about it,

if we wait to long when anyone give us feedback we would still be waiting for someone.

PEACE

Jozef

P.S.: Sorry to cause you problem, but I really though we all agreed to merge the changes onto master.

Comment by Hideyuki Tai [ 23/May/16 ]

(In reply to Jozef Bacigal from comment #12)
> Hideyuki we were talking about it a lot and second its a long time that
> nobody from NIC has noticed that NIC is broken
>
> Here is link
>
> https://lists.opendaylight.org/pipermail/openflowplugin-dev/2016-May/005123.
> html
>
> where we were talked about it,
>
> if we wait to long when anyone give us feedback we would still be waiting
> for someone.
>
> PEACE
>
> Jozef
>
> P.S.: Sorry to cause you problem, but I really though we all agreed to merge
> the changes onto master.

Yeah, it's quite weird that nobody in the NIC project was aware of that build failure for a long time

And, I'm sorry I was not aware you guys talked about it a lot before the merge.

Thanks!

Generated at Wed Feb 07 20:32:58 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.