[BGPCEP-258] BGP Scale tests with >= 10k prefixes fail following Transaction chain failure Created: 22/Jul/15 Updated: 03/Mar/19 Resolved: 27/Jul/15 |
|
| Status: | Resolved |
| Project: | bgpcep |
| Component/s: | BGP |
| Affects Version/s: | Bugzilla Migration |
| Fix Version/s: | Bugzilla Migration |
| Type: | Bug | ||
| Reporter: | RichardHill | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 4039 |
| Description |
|
During BGP scale testing of the RC4 Helium release candidate we were unable to use the play.py tool to push more than 10k prefixes into the test artifact and obtain a result.
java -version java version "1.7.0_67"; Java(TM) SE Runtime Environment (build 1.7.0_67-b01); Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) Xmx 2G
Detailed description of the tests There are 8 tests in the suite: TEST CASE 1: Waiting for RIB (no idle periods) TEST 1. Test suite starts with the Test Case “Waiting for RIB (no idle periods)”. This test case watches the CPU load and the “uptodate” attribute. When the CPU load drops, it ends its waiting, looks at the “uptodate” and fails if it is still “false”. The meaning is that there is a bottleneck. This test case is marked as “auxiliary” because its failure does not mean the functionality is broken, it may mean that there is just some lock contention or other bottleneck in the code. 2. The test suite then continues with Test Case “Waiting for RIB (with idle periods)”. This one watches only the “uptodate” attribute. It passes when it sees “uptodate:true” or fails on timeout. This test case is marked “critical” because if it fails, it most likely means either the performance is unacceptable or the RIB failed somehow. If the previous test case passed, this test case shall complete in less than 1 second. Otherwise the time this test case took shall be added to the time the previous test case took to determine the processing time. 3. The test then continues with Test Case “Count of prefixes in RIB matches”. This one downloads the entire RIB and counts the prefixes returned. The count must match what was pushed, otherwise you will get a FAIL. As a bonus when the test fails, it will emit the count of prefixes that were actually found in the RIB. This test case is done now to test that the connection can survive even when someone tries to download the RIB. These tests are repackaged to inspect the RIB +TOPOLOGY. The failure occurs during RIB tests 015-07-22 11:39:50,322 | ERROR | oupCloseable-4-3 | DOMDataCommitCoordinatorImpl | 165 - org.opendaylight.controller.sal-broker-impl - 1.1.4.Helium-SR4_0 | The commit executor's queue is full - submit task was rejected. |
| Comments |
| Comment by RichardHill [ 22/Jul/15 ] |
|
Attachment opendaylight.tar.xz has been added with description: ODL artifact, logs and config |
| Comment by RichardHill [ 22/Jul/15 ] |
|
yourkitsnapshot available at https://drive.google.com/file/d/0ByXiyf4iY7RYSHJWTXhNOU0zZ2c/view?usp=sharing |
| Comment by Milos Fabian [ 23/Jul/15 ] |
|
Looks like https://bugs.opendaylight.org/show_bug.cgi?id=2255 |
| Comment by RichardHill [ 23/Jul/15 ] |
|
yeah looks a lot like it. Will retest with ping pong DB. |
| Comment by Jozef Behran [ 23/Jul/15 ] |
|
Further testing revealed that this bug is newly introduced but indeed very similar to 2255. A new queue was added into the transaction path but the fix as implemented on the |
| Comment by Milos Fabian [ 24/Jul/15 ] |
|
(In reply to Jozef Behran from comment #4) No, there are no new config parameters for in-memory-data-store You need to increase value of "max-data-store-executor-queue-size" (by default 5000). |
| Comment by Jozef Behran [ 24/Jul/15 ] |
|
That's exactly what I did. I did that for both, "inmemory-config-datastore-provider" and "inmemory-operational-datastore-provider". However the exception reported suggests it had no effect, I still can see "limit=5000" instead of "limit=65000" in its message. Here is the relevant section of the configuration file: — snip — The .yang file suggests these configuration elements are placed correctly here. Anybody has any idea what is/might be wrong here (or elsewhere)? |
| Comment by Milos Fabian [ 24/Jul/15 ] |
|
There is another "queue size" configuration parameter in "dom-in-memory-data-broker" named "max-data-broker-commit-queue-size" Snippet example (01-mdsal.xml): <module> <schema-service> <config-data-store> <operational-data-store> <max-data-broker-commit-queue-size xmlns="urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:impl">50000</max-data-broker-commit-queue-size> |
| Comment by RichardHill [ 27/Jul/15 ] |
|
New tests with configuration edited as described show that Helium scales to above 1M prefixes when only RIB is updated and only 5k prefixes when RIB and Topology are both updated. Test Setup Cluster size 1 Java version java version "1.7.0_67" Results Total Failed Passed Pass % |
| Comment by RichardHill [ 27/Jul/15 ] |
|
Fixed 1M routes on IMDS, RIB only, 16G RAM, 6553M HEAP: 3m 58s 5k routes on IMDS, RIB+Topology, 16G RAM, 6553M HEAP: 1m 24s + 5m 26s We can process 1m routes RIB only but 5k routes when using RIB and topology. To process more than 5k we expect we need to use the ping-pong data-broker. This is no longer available in helium because of incompatibilities. |