[BGPCEP-258] BGP Scale tests with >= 10k prefixes fail following Transaction chain failure Created: 22/Jul/15  Updated: 03/Mar/19  Resolved: 27/Jul/15

Status: Resolved
Project: bgpcep
Component/s: BGP
Affects Version/s: Bugzilla Migration
Fix Version/s: Bugzilla Migration

Type: Bug
Reporter: RichardHill Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File opendaylight.tar.xz    
External issue ID: 4039

 Description   

During BGP scale testing of the RC4 Helium release candidate we were unable to use the play.py tool to push more than 10k prefixes into the test artifact and obtain a result.

    1. The test set up
      1. Testsetup
        CPU 4 vCPUs for Linux, no restriction
        RAM: 16G

java -version java version "1.7.0_67"; Java(TM) SE Runtime Environment (build 1.7.0_67-b01); Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)
Garbage collection default

Xmx 2G
permGen 0.5G
Xms 128MB
Setup: 1 node
Clustering: No
Replication: No
Persistence: No
DS: (IMDS) Installed features: cdl’s default

      1. Test description
        10k BGP paths were pushed into ODL using the play.py tool, the RESTCONF URL RIB was then polled until the 10k prefixes appeared.

Detailed description of the tests

There are 8 tests in the suite:

TEST CASE 1: Waiting for RIB (no idle periods)
TEST CASE 2: Waiting for RIB (with idle periods)
TEST CONTROLLER-1: Count of prefixes in RIB matches
TEST CONTROLLER-2: Waiting for TOPOLOGY(no idle periods)
TEST CONTROLLER-3: Waiting for TOPOLOGY (with idle periods)
TEST CONTROLLER-4: Count of prefixes in RIB matches

TEST CONTROLLER-5: Connection Alive
TEST CONTROLLER-6: Connection Would Stay Alive

1. Test suite starts with the Test Case “Waiting for RIB (no idle periods)”. This test case watches the CPU load and the “uptodate” attribute. When the CPU load drops, it ends its waiting, looks at the “uptodate” and fails if it is still “false”. The meaning is that there is a bottleneck. This test case is marked as “auxiliary” because its failure does not mean the functionality is broken, it may mean that there is just some lock contention or other bottleneck in the code.

2. The test suite then continues with Test Case “Waiting for RIB (with idle periods)”. This one watches only the “uptodate” attribute. It passes when it sees “uptodate:true” or fails on timeout. This test case is marked “critical” because if it fails, it most likely means either the performance is unacceptable or the RIB failed somehow. If the previous test case passed, this test case shall complete in less than 1 second. Otherwise the time this test case took shall be added to the time the previous test case took to determine the processing time.

3. The test then continues with Test Case “Count of prefixes in RIB matches”. This one downloads the entire RIB and counts the prefixes returned. The count must match what was pushed, otherwise you will get a FAIL. As a bonus when the test fails, it will emit the count of prefixes that were actually found in the RIB. This test case is done now to test that the connection can survive even when someone tries to download the RIB.

These tests are repackaged to inspect the RIB +TOPOLOGY.

The failure occurs during RIB tests

015-07-22 11:39:50,322 | ERROR | oupCloseable-4-3 | DOMDataCommitCoordinatorImpl | 165 - org.opendaylight.controller.sal-broker-impl - 1.1.4.Helium-SR4_0 | The commit executor's queue is full - submit task was rejected.
DeadlockDetectingListeningExecutorService{delegate=FastThreadPoolExecutor{Thread Prefix=WriteTxCommit, Current Thread Pool Size=1, Largest Thread Pool Size=1, Max Thread Pool Size=1, Current Queue Size=4999, Largest Queue Size=5000, Max Queue Size=5000, Active Thread Count=0, Completed Task Count=1367, Total Task Count=6366}}
java.util.concurrent.RejectedExecutionException: Task org.opendaylight.yangtools.util.concurrent.AsyncNotifyingListenableFutureTask$DelegatingAsyncNotifyingListenableFutureTask@51d27c8d rejected from FastThreadPoolExecutor

{Thread Prefix=WriteTxCommit, Current Thread Pool Size=1, Largest Thread Pool Size=1, Max Thread Pool Size=1, Current Queue Size=5000, Largest Queue Size=5000, Max Queue Size=5000, Active Thread Count=1, Completed Task Count=1366, Total Task Count=6367}

 Comments   
Comment by RichardHill [ 22/Jul/15 ]

Attachment opendaylight.tar.xz has been added with description: ODL artifact, logs and config

Comment by RichardHill [ 22/Jul/15 ]

yourkitsnapshot available at

https://drive.google.com/file/d/0ByXiyf4iY7RYSHJWTXhNOU0zZ2c/view?usp=sharing

Comment by Milos Fabian [ 23/Jul/15 ]

Looks like https://bugs.opendaylight.org/show_bug.cgi?id=2255

Comment by RichardHill [ 23/Jul/15 ]

yeah looks a lot like it. Will retest with ping pong DB.

Comment by Jozef Behran [ 23/Jul/15 ]

Further testing revealed that this bug is newly introduced but indeed very similar to 2255.

A new queue was added into the transaction path but the fix as implemented on the CONTROLLER-957 was omitted for it. I have no idea where that queue is configured and what is its name so I cannot test increase of the capacity of this new queue but I do know it is a new queue because when I increased the capacity of the 2 queues I know about (“inmemory-config-datastore-provider” and “inmemory-operational-datastore-provider”) to 65000, the log messages produced by this bug indicated that the capacity of this overflowing queue is still 5000.

Comment by Milos Fabian [ 24/Jul/15 ]

(In reply to Jozef Behran from comment #4)
> Further testing revealed that this bug is newly introduced but indeed very
> similar to 2255.
>
> A new queue was added into the transaction path but the fix as implemented
> on the CONTROLLER-957 was omitted for it. I have no idea where that queue is
> configured and what is its name so I cannot test increase of the capacity of
> this new queue but I do know it is a new queue because when I increased the
> capacity of the 2 queues I know about (“inmemory-config-datastore-provider”
> and “inmemory-operational-datastore-provider”) to 65000, the log messages
> produced by this bug indicated that the capacity of this overflowing queue
> is still 5000.

No, there are no new config parameters for in-memory-data-store
https://github.com/opendaylight/controller/blob/stable/helium/opendaylight/md-sal/sal-inmemory-datastore/src/main/yang/opendaylight-inmemory-datastore-provider.yang

You need to increase value of "max-data-store-executor-queue-size" (by default 5000).

Comment by Jozef Behran [ 24/Jul/15 ]

That's exactly what I did. I did that for both, "inmemory-config-datastore-provider" and "inmemory-operational-datastore-provider". However the exception reported suggests it had no effect, I still can see "limit=5000" instead of "limit=65000" in its message.

Here is the relevant section of the configuration file:

— snip —
<module>
<type xmlns:prefix="urn:opendaylight:params:xml:ns:yang:controller:inmemory-datastore-provider">prefix:inmemory-config-datastore-provider</type>
<name>config-store-service</name>
<inmemory-config-datastore-provider xmlns="urn:opendaylight:params:xml:ns:yang:controller:inmemory-datastore-provider">
<schema-service>
<type xmlns:dom="urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom">dom:schema-service</type>
<name>yang-schema-service</name>
</schema-service>
<max-data-store-executor-queue-size>65000</max-data-store-executor-queue-size>
</inmemory-config-datastore-provider>
</module>
<module>
<type xmlns:prefix="urn:opendaylight:params:xml:ns:yang:controller:inmemory-datastore-provider">prefix:inmemory-operational-datastore-provider</type>
<name>operational-store-service</name>
<inmemory-operational-datastore-provider xmlns="urn:opendaylight:params:xml:ns:yang:controller:inmemory-datastore-provider">
<schema-service>
<type xmlns:dom="urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom">dom:schema-service</type>
<name>yang-schema-service</name>
</schema-service>
<max-data-store-executor-queue-size>65000</max-data-store-executor-queue-size>
</inmemory-operational-datastore-provider>
</module>
— snip —

The .yang file suggests these configuration elements are placed correctly here. Anybody has any idea what is/might be wrong here (or elsewhere)?

Comment by Milos Fabian [ 24/Jul/15 ]

There is another "queue size" configuration parameter in "dom-in-memory-data-broker" named "max-data-broker-commit-queue-size"
https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=blob;f=opendaylight/md-sal/sal-dom-broker/src/main/yang/opendaylight-dom-broker-impl.yang;h=fa6d4961939b6f5bcbf4f5eb50d042ea9aa86556;hb=refs/heads/stable/helium

Snippet example (01-mdsal.xml):

<module>
<type xmlns:prefix="urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:impl">prefix:dom-inmemory-data-broker</type>
<name>inmemory-data-broker</name>

<schema-service>
<type xmlns:dom="urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom">dom:schema-service</type>
<name>yang-schema-service</name>
</schema-service>

<config-data-store>
<type xmlns:config-dom-store-spi="urn:opendaylight:params:xml:ns:yang:controller:md:sal:core:spi:config-dom-store">config-dom-store-spi:config-dom-datastore</type>
<name>config-store-service</name>
</config-data-store>

<operational-data-store>
<type xmlns:operational-dom-store-spi="urn:opendaylight:params:xml:ns:yang:controller:md:sal:core:spi:operational-dom-store">operational-dom-store-spi:operational-dom-datastore</type>
<name>operational-store-service</name>
</operational-data-store>

<max-data-broker-commit-queue-size xmlns="urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:impl">50000</max-data-broker-commit-queue-size>
</module>

Comment by RichardHill [ 27/Jul/15 ]

New tests with configuration edited as described show that Helium scales to above 1M prefixes when only RIB is updated and only 5k prefixes when RIB and Topology are both updated.

Test Setup

Cluster size 1
RAM 16G
Heap 6553M
permGen 0.5G
CPUs 4
Garbage Collection Default
Karaf Features odl-bgpcep-all, odl-restconf-noauth
ODL build Helium SR4

Java version java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

Results

Total Failed Passed Pass %
Critical tests 5 0 5 100%
All tests 6 0 6 100%

Comment by RichardHill [ 27/Jul/15 ]

Fixed

1M routes on IMDS, RIB only, 16G RAM, 6553M HEAP: 3m 58s

5k routes on IMDS, RIB+Topology, 16G RAM, 6553M HEAP: 1m 24s + 5m 26s

We can process 1m routes RIB only but 5k routes when using RIB and topology. To process more than 5k we expect we need to use the ping-pong data-broker.

This is no longer available in helium because of incompatibilities.

Generated at Wed Feb 07 19:12:30 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.