[NETCONF-355] Performance regression in controller-rest-cars-perf Created: 28/Feb/17  Updated: 15/Mar/19  Resolved: 21/Apr/17

Status: Verified
Project: netconf
Component/s: restconf-nb
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File Main-2017-04-18-2.snapshot.xz     HTML File compare     File time results.tar.gz    
External issue ID: 7868

 Description   

The regression may be in clustering code, or in Restconf code.
This affects both Boron and Carbon snapshots, in comparison with Beryllium snapshots.

The biggest difference is in Add_Cars test case, which uses a script [0] to send single POST request to create 10000 car items. But also Add_People test case takes longer, compare Boron durations [1] to Beryllium [2].

[0] https://github.com/opendaylight/integration-test/blob/master/tools/odl-mdsal-clustering-tests/scripts/cluster_rest_script.py
[1] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-rest-cars-perf-only-boron/870/robot/controller-rest-cars-perf.txt/
[2] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-rest-cars-perf-only-beryllium/116/robot/controller-rest-cars-perf.txt/



 Comments   
Comment by Robert Varga [ 23/Mar/17 ]

Well, Verify_Cars fails, can they be related?

To quantify this a bit, we are seeing about 100% performance degradation observed from test duration.

Comment by Vratko Polak [ 29/Mar/17 ]

> Verify_Cars fails

I do not see that in recent builds. Links without specific build number:
Be [3], B [4], C [5].

Note that those are single node tests, the suite runs also in three node setup [6], but the difference in performance is small.

[3] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-rest-cars-perf-only-beryllium/lastSuccessfulBuild/robot/controller-rest-cars-perf.txt/
[4] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-rest-cars-perf-only-boron/lastSuccessfulBuild/robot/controller-rest-cars-perf.txt/
[5] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-rest-cars-perf-only-carbon/lastSuccessfulBuild/robot/controller-rest-cars-perf.txt/010%20Crud%20Mdsal%20Perf/
[6] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-rest-clust-cars-perf-only-carbon/lastSuccessfulBuild/robot/controller-rest-clust-cars-perf.txt/010%20Crud%20Mdsal%20Perf/

Comment by Vratko Polak [ 11/Apr/17 ]

Numeric values (for times when Beryllium jobs are removed) for Add_Cars:
Beryllium: 0:00:28.534
Boron: 0:01:24.138
Carbon: 0:01:36.001

Comment by Robert Varga [ 18/Apr/17 ]

Testing with distribution-karaf-0.6.0-20170418.121829-4936, it looks like the add case is doing an awful lot of reads.

I will profile this further to get to the bottom of this.

Comment by Ivan Hrasko [ 18/Apr/17 ]

Add must check for existence (by read) but reads were also in beryllium,
by comparison of code there are exactly the same operations on transactions (merge, read, put) in master as were done in beryllium.

There are some modifications in master compare to beryllium (uri parameters) but in this simple case they are not applied.

When profiled beryllium and master and compared there were a lot of changes in Yangtools, maybe added some additional verifications for data.

Comment by Ivan Hrasko [ 18/Apr/17 ]

Did you find more reads in recent versions?

Comment by Robert Varga [ 18/Apr/17 ]

Well, the reason for the regression is in RESTCONF. The regression was introduced in https://git.opendaylight.org/gerrit/#/c/30554/48/restconf/sal-rest-connector/src/main/java/org/opendaylight/netconf/sal/restconf/impl/BrokerFacade.java@320, as now BrokerFacade issues a read request for each individual item in the list.

Comment by Robert Varga [ 18/Apr/17 ]

So, the question is: what exactly is the reason for issuing the massive amounts of reads? I would prefer exact quotes from drafts, because the code sure as hell does not contain any sort of documentation about what is going on.

Comment by Robert Varga [ 18/Apr/17 ]

Attachment Main-2017-04-18-2.snapshot.xz has been added with description: YKP tracing snapshot

Comment by Robert Varga [ 18/Apr/17 ]

Snapshot shows BrokerFacade executing 10K read operations. Each of those is being waited synchronously, which is not something you can do while expecting reasonable performance.

Comment by Robert Varga [ 18/Apr/17 ]

Furthermore, these reads are executed as the target is being modified, which is forcing DataTree to apply operations so it can report accurate results.

Comment by Ivan Hrasko [ 19/Apr/17 ]

easy data - POST empty container
complete data - POST list items into container

Comment by Ivan Hrasko [ 19/Apr/17 ]

Attachment time results.tar.gz has been added with description: result of time measure between Be-4 and master running the same data

Comment by Ivan Hrasko [ 19/Apr/17 ]

(In reply to Robert Varga from comment #7)
> Well, the reason for the regression is in RESTCONF. The regression was
> introduced in
> https://git.opendaylight.org/gerrit/#/c/30554/48/restconf/sal-rest-connector/
> src/main/java/org/opendaylight/netconf/sal/restconf/impl/BrokerFacade.
> java@320, as now BrokerFacade issues a read request for each individual item
> in the list.

-this code is for PACTH operation, we are dealing now with POST (postDataViaTransaction method)

Comment by Ivan Hrasko [ 19/Apr/17 ]

when batch size for command is used wall time difference between master and be-4 arises

command:
/usr/bin/time -v python cluster_rest_script.py --host localhost --port 8181 --threads 1 --itemcount 10000 --ipr 10000 --debug add

Comment by Ivan Hrasko [ 19/Apr/17 ]

Attachment compare has been added with description: compare be-4 and master when batch size is 10000

Comment by Robert Varga [ 19/Apr/17 ]

carbon: https://git.opendaylight.org/gerrit/55205
boron: https://git.opendaylight.org/gerrit/55211

This will need a proper follow-up to address overall code structure like:

  • eliminate blocking checks in loops, so the checks are done in parallel
  • make RestDocumentedException a checked exception
  • audit all code to make sure checks are done before modifications
  • ask controller/mdsal to add a create() method
Comment by Martin Ciglan [ 19/Apr/17 ]

will sit down with Ivan & Jakub to discuss follow-up notes

Comment by Vratko Polak [ 19/Apr/17 ]

Boron looks fixed [7]:
Duration: 0:00:02.544 (-0:01:18.522)

Carbon merged, waiting for test run.
Nitrogen [8] not merged just yet.

[7] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-rest-cars-perf-only-boron/947/robot/controller-rest-cars-perf.txt/Add%20Cars/
[8] https://git.opendaylight.org/gerrit/55231

Comment by Vratko Polak [ 20/Apr/17 ]

> Nitrogen [8]

merged now.

Comment by Vratko Polak [ 21/Apr/17 ]

New Carbon performance [9] number: 0:00:02.314

[9] https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-rest-cars-perf-all-carbon/245/robot/controller-rest-cars-perf.txt/010%20Crud%20Mdsal%20Perf/Add%20Cars/

Comment by Ivan Hrasko [ 21/Apr/17 ]

Follow up bugs/improvements:

https://bugs.opendaylight.org/show_bug.cgi?id=8268
https://bugs.opendaylight.org/show_bug.cgi?id=8269
https://bugs.opendaylight.org/show_bug.cgi?id=8270
https://bugs.opendaylight.org/show_bug.cgi?id=8271
https://bugs.opendaylight.org/show_bug.cgi?id=8272

Comment by Ivan Hrasko [ 21/Apr/17 ]

Also performance issue: https://bugs.opendaylight.org/show_bug.cgi?id=6412

Generated at Wed Feb 07 20:14:48 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.