[CONTROLLER-1569] B and C: ds benchmark unstable for READ and DELETE operation in 3node cluster Created: 15/Dec/16  Updated: 25/Jul/23  Resolved: 08/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Peter Gubka Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File b211_odl3.log     Text File b212_odl1.log    
External issue ID: 7390

 Description   

1node ds benchmark jobs dont show unstability

https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-periodic-benchmark-all-boron/plot/
https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-periodic-benchmark-all-carbon/plot/
https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-periodic-benchmark-only-boron/plot/
https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-1node-periodic-benchmark-only-carbon/plot/

and all operations (READ,PUT,MERGE,DELETE) preforms stable according to the long term thrends

3node ds benchmark jobs show that PUT and MERGE preform stable, but READ and DELETE dont

https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-periodic-benchmark-all-boron/plot/
https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-periodic-benchmark-all-carbon/plot/
https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-periodic-benchmark-only-boron/plot/
https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-periodic-benchmark-only-carbon/plot/

Having a closer look at e.g.
https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-periodic-benchmark-only-carbon/plot/getPlot?index=1&width=750&height=1000
shows that ds benchmark with FOL_(running on a follower node) prefix are not that stable



 Comments   
Comment by Peter Gubka [ 15/Dec/16 ]

in the past i wrote an email for this problem too
https://www.mail-archive.com/controller-dev@lists.opendaylight.org/msg00461.html

Comment by Robert Varga [ 02/Feb/17 ]

Interesting fluctuations. It looks like we have a battle between leader and follower, visible from:

https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-3node-periodic-benchmark-only-carbon/plot/getPlot?index=4&width=1750&height=1450

Note green median (CONFIG) and the wide disconnect between blue and purple set of lines. I think their average is slightly above the config line.

Is it possible that 'follower' and 'leader' are really talking to the shard leader?

Also, could we get the Y-axis show operations per millisecond, i.e. 100K ops in 200ms == 500ops/ms?

I think this needs some profiling to understand what is going on.

Comment by Peter Gubka [ 07/Mar/17 ]

(In reply to Robert Varga from comment #2)
> Interesting fluctuations. It looks like we have a battle between leader and
> follower, visible from:
>
> https://jenkins.opendaylight.org/releng/view/controller/job/controller-csit-
> 3node-periodic-benchmark-only-carbon/plot/
> getPlot?index=4&width=1750&height=1450
>
> Note green median (CONFIG) and the wide disconnect between blue and purple
> set of lines. I think their average is slightly above the config line.
>
> Is it possible that 'follower' and 'leader' are really talking to the shard
> leader?

From the log.html it can be seen that before the python bechmarch script starts, the robot iterates over all 3 nodes and gets information about the shard (e.g. from jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-default-config,type=DistributedConfigDatastore)
Then runs the python script against the leader's or follower's ip address. There is no check during the test about the leader movement.

>
> Also, could we get the Y-axis show operations per millisecond, i.e. 100K ops
> in 200ms == 500ops/ms?

Is this a request to modify the job? In this case we will either loose history and present graphs or the lines will be plotted in present graphs, but it can be very ugly.

As csv files are stored for every jenkins run, i can create a post-processing script (if really needed) which will recount/create new csv files with new Y-axis values. It will be run locally and graphs will have to be also created somehow locally.

>
> I think this needs some profiling to understand what is going on.

Comment by Peter Gubka [ 29/Mar/17 ]

Attachment b211_odl3.log has been added with description: Measure_Leader_Operational_Txchain_Read log from #211

Comment by Peter Gubka [ 29/Mar/17 ]

Attachment b212_odl1.log has been added with description: Measure_Leader_Operational_Txchain_Read log from #212

Comment by Peter Gubka [ 29/Mar/17 ]

Since the build #210 the job is back to the blue dot.

Now we have 3 builds with blue dots.
https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-periodic-benchmark-only-carbon/210/
https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-periodic-benchmark-only-carbon/211/
https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-periodic-benchmark-only-carbon/212/

The robot's log.html details shows that each test case lasts 3-6 minutes. There are exceptions for the READ tests which either lasts usual time or the last up to 35 minuts.

It does not affect follower's tests only.

E.g. test case duration of

Measure_Leader_Operational_Txchain_Read
#210 32:52
#211 2:51
#212 30:11

Measure_Follower_Operational_Simpletx_Read
#211 35:32
#212 5:00

Checking logs on leaders for Measure_Leader_Operational_Txchain_Read i fount that a good tc from #211 did not contain any messages like FrontendClientMetadataBuilder | 206 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for aborted transaction member-1-datastore-config-fe-0-txn-127-0, ignoring

Logs attached.

Comment by Tom Pantelis [ 30/Mar/17 ]

I'm not familiar with the tests you're running but if the test times are fluctuating on jenkins runs perhaps it's due to instability in the upstream build environment. I would suggest you run the tests multiple times on local systems and compare.

Comment by Peter Gubka [ 30/Mar/17 ]

(In reply to Tom Pantelis from comment #7)
> I'm not familiar with the tests you're running but if the test times are
> fluctuating on jenkins runs perhaps it's due to instability in the upstream
> build environment. I would suggest you run the tests multiple times on local
> systems and compare.

Only READ tests have problem. PUT, MERGE and DELETE look ok. I would prefer some hints how to debug the READ problem than reproducing all tests locally.

Comment by Tom Pantelis [ 30/Mar/17 ]

Perhaps the reads are being performed remotely when they're intended to be local or the node that is selected could be either a follower or a leader? Remote reads from a follower will be significantly slower than a local read. Also the size of the data could be significant for remote reads as it needs to ser/des and transport it over the wire. Remote reads are largely dependent on the speed of the network.

(In reply to Peter Gubka from comment #8)
> (In reply to Tom Pantelis from comment #7)
> > I'm not familiar with the tests you're running but if the test times are
> > fluctuating on jenkins runs perhaps it's due to instability in the upstream
> > build environment. I would suggest you run the tests multiple times on local
> > systems and compare.
>
> Only READ tests have problem. PUT, MERGE and DELETE look ok. I would prefer
> some hints how to debug the READ problem than reproducing all tests locally.

Comment by Peter Gubka [ 19/Apr/17 ]

Checking https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-periodic-benchmark-only-carbon/229/archives/
there was no leader change during the test. All changes are done before suite starts. Same for jobs #225 and #226.

Comment by Robert Varga [ 26/Apr/17 ]

I have audited dsbenchmark and the fact is that its results with operational store are irrelevant due to inconsistencies in setup (which writes into operational) and actual execution (which is hard-wired to CONFIG).

carbon: https://git.opendaylight.org/gerrit/56108

We'll need to see what the results will look like.

Generated at Wed Feb 07 19:55:52 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.