[CONTROLLER-661] Statistics Manager performance poor for large number of flows Created: 05/Aug/14  Updated: 05/May/15  Resolved: 05/May/15

Status: Resolved
Project: controller
Component/s: adsal
Affects Version/s: Helium
Fix Version/s: None

Type: Bug
Reporter: Jan Medved Assignee: Vaclav Demcak
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Mac OS
Platform: Macintosh


Attachments: Zip Archive Call-tree--all-threads-together,-CPU-usage-estimation.zip     File flow_config_perf copy.py    
Issue Links:
Blocks
blocks OPNFLWPLUG-228 Continuous ERROR logs in org.opendayl... Resolved
blocks OPNFLWPLUG-250 Continuous WARN logs in o.o.c.m.s.d.s... Resolved
is blocked by OPNFLWPLUG-227 FRM/Statistics Manager: figure out fl... Resolved
External issue ID: 1484

 Description   

The controller was loaded with 10,000 flows, and things went down fairly spectacularly. The controller was running a mininet with 15 nodes, and a script programmed 10k flows randomly (evenly) distributed over the 15 nodes. All 10k flows are written into the config space properly (we have no problem there).

Things start going really bad when we try to program the flows into mininet. CPU Utilization goes to almost 100% and we start loosing connections to mininet switches. We only manage to program only about 3700 flows (as reported back eventually but the table stats collected from mininet switches). I assume this is because we keep loosing the connection when both the config and oper space start being utilized heavily (stats collection -> oper space plus flows into config space). Since the msgContDump stats (plugin) show 10k enqueued AddFlow messages, I assume the messages got lost either in openflowjava or in the mininet switches themselves (I would suspect openflow java first, because I do not see any real utilization in the mininet switches.)

As the number of flows increases, stats collection starts taking a disproportionate amount of CPU cycles. Interestingly, the bottleneck does not seem to be the MD-SAL data store, but the codecs. I am attaching the profiler run.

Things that needs to be fixed:

  • figure out how to do flow indexing (OPNFLWPLUG-227)
  • Codecs (which is happening, I guess)
  • See if we can write into the MD-SAL data store in batches, rather than
    individual stats.
  • Another thought would be to use the DOM API for stats rather than the
    generated Java APIs.


 Comments   
Comment by Jan Medved [ 05/Aug/14 ]

Attachment Call-tree--all-threads-together,-CPU-usage-estimation.zip has been added with description: Profiler output for controller with 10k flows

Comment by Jan Medved [ 05/Aug/14 ]

Attachment flow_config_perf copy.py has been added with description: Python script that can be used to add 10k flows

Comment by Anil Vishnoi [ 05/Aug/14 ]

This is something that is long pending. As of now we do manual comparison to figure out the exact flow where we can augment the stats. This is not going to scale for sure. AFAIK Michal and team is working on using flow cookies as a flow-id and its work in progress. Once its done, we can get rid of customer comparator code we use for flow matching and direct augment stats using the flow-id. That i believe will improve the performance significantly.

Comment by Tony Tkacik [ 24/Sep/14 ]

https://git.opendaylight.org/gerrit/#/c/10605/

Comment by Carol Sanders [ 05/May/15 ]

This bug is part of the project to Move all ADSAL associated component bugs to ADSAL.

Generated at Wed Feb 07 19:53:34 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.