[CONTROLLER-1569] B and C: ds benchmark unstable for READ and DELETE operation in 3node cluster Created: 15/Dec/16 Updated: 25/Jul/23 Resolved: 08/May/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Peter Gubka | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 7390 |
| Description |
| Comments |
| Comment by Peter Gubka [ 15/Dec/16 ] |
|
in the past i wrote an email for this problem too |
| Comment by Robert Varga [ 02/Feb/17 ] |
|
Interesting fluctuations. It looks like we have a battle between leader and follower, visible from: Note green median (CONFIG) and the wide disconnect between blue and purple set of lines. I think their average is slightly above the config line. Is it possible that 'follower' and 'leader' are really talking to the shard leader? Also, could we get the Y-axis show operations per millisecond, i.e. 100K ops in 200ms == 500ops/ms? I think this needs some profiling to understand what is going on. |
| Comment by Peter Gubka [ 07/Mar/17 ] |
|
(In reply to Robert Varga from comment #2) From the log.html it can be seen that before the python bechmarch script starts, the robot iterates over all 3 nodes and gets information about the shard (e.g. from jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-default-config,type=DistributedConfigDatastore) > Is this a request to modify the job? In this case we will either loose history and present graphs or the lines will be plotted in present graphs, but it can be very ugly. As csv files are stored for every jenkins run, i can create a post-processing script (if really needed) which will recount/create new csv files with new Y-axis values. It will be run locally and graphs will have to be also created somehow locally. > |
| Comment by Peter Gubka [ 29/Mar/17 ] |
|
Attachment b211_odl3.log has been added with description: Measure_Leader_Operational_Txchain_Read log from #211 |
| Comment by Peter Gubka [ 29/Mar/17 ] |
|
Attachment b212_odl1.log has been added with description: Measure_Leader_Operational_Txchain_Read log from #212 |
| Comment by Peter Gubka [ 29/Mar/17 ] |
|
Since the build #210 the job is back to the blue dot. Now we have 3 builds with blue dots. The robot's log.html details shows that each test case lasts 3-6 minutes. There are exceptions for the READ tests which either lasts usual time or the last up to 35 minuts. It does not affect follower's tests only. E.g. test case duration of Measure_Leader_Operational_Txchain_Read Measure_Follower_Operational_Simpletx_Read Checking logs on leaders for Measure_Leader_Operational_Txchain_Read i fount that a good tc from #211 did not contain any messages like FrontendClientMetadataBuilder | 206 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for aborted transaction member-1-datastore-config-fe-0-txn-127-0, ignoring Logs attached. |
| Comment by Tom Pantelis [ 30/Mar/17 ] |
|
I'm not familiar with the tests you're running but if the test times are fluctuating on jenkins runs perhaps it's due to instability in the upstream build environment. I would suggest you run the tests multiple times on local systems and compare. |
| Comment by Peter Gubka [ 30/Mar/17 ] |
|
(In reply to Tom Pantelis from comment #7) Only READ tests have problem. PUT, MERGE and DELETE look ok. I would prefer some hints how to debug the READ problem than reproducing all tests locally. |
| Comment by Tom Pantelis [ 30/Mar/17 ] |
|
Perhaps the reads are being performed remotely when they're intended to be local or the node that is selected could be either a follower or a leader? Remote reads from a follower will be significantly slower than a local read. Also the size of the data could be significant for remote reads as it needs to ser/des and transport it over the wire. Remote reads are largely dependent on the speed of the network. (In reply to Peter Gubka from comment #8) |
| Comment by Peter Gubka [ 19/Apr/17 ] |
|
Checking https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-periodic-benchmark-only-carbon/229/archives/ |
| Comment by Robert Varga [ 26/Apr/17 ] |
|
I have audited dsbenchmark and the fact is that its results with operational store are irrelevant due to inconsistencies in setup (which writes into operational) and actual execution (which is hard-wired to CONFIG). carbon: https://git.opendaylight.org/gerrit/56108 We'll need to see what the results will look like. |