[CONTROLLER-1715] 6 GB heap is not entirely enough for BGP ingest test with 1 million prefixes when tell-based protocol is used Created: 09/Jun/17  Updated: 25/Jul/23  Resolved: 02/Jul/21

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: 4.0.0

Type: Bug
Reporter: Vratko Polak Assignee: Oleksii Mozghovyi
Resolution: Cannot Reproduce Votes: 0
Labels: pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8649

 Description   

This failure started happening consistently in a BGP test with single node, but using tell-based protocol, so this may be related to cluster testing.

The BGP job runs several scenarios. For this bug I compare 4 of them, differing on which side initiates BGP connection and whether ask- or tell- based protocol is used.

The bug manifests as inability of ODL to clear ipv4 topology [0] after the connection has been closed. This does not happen if ODL is not initiating the connection, perhaps because that scenario is run first in the suite and memory pressure is lower. This does not happen with the exact same suite running against ODL with ask-based protocol (both times it is the first suite after initial start or hard reboot).

The title was chosen to reflect the only difference I have recognized in karaf.log [1]:
2017-06-09 06:33:50,808 | INFO | lt-dispatcher-20 | EmptyLocalActorRef | 155 - com.typesafe.akka.slf4j - 2.4.18 | Message [org.opendaylight.controller.cluster.datastore.messages.DataTreeChanged] from Actor[akka://opendaylight-cluster-data/deadLetters] to Actor[akka://opendaylight-cluster-data/user/$M] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

Though subsequent messages suggest there is still some progress being done somewhere (those messages are also present in the preceding passing scenario):
2017-06-09 06:33:51,408 | WARN | lt-dispatcher-41 | FrontendClientMetadataBuilder | 180 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.Carbon | member-1-shard-topology-operational: Unknown history for aborted transaction member-1-datastore-operational-fe-0-txn-31-1, ignoring

[0] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-periodic-bgp-ingest-only-carbon/306/log.html.gz#s1-s9-t17
[1] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-periodic-bgp-ingest-only-carbon/306/odl1_karaf.log.gz

 

 

[0] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/bgpcep-csit-1node-bgp-ingest-all-silicon/216/robot-plugin/log.html.gz#s1-s9-t3-k2-k1

[1] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/bgpcep-csit-1node-bgp-ingest-all-silicon/216/robot-plugin/log.html.gz#s1-s9-t5-k2-k1-k1-k1-k2

[2] https://github.com/opendaylight/integration-test/blob/master/csit/suites/bgpcep/bgpingest/singlepeer_prefixcount.robot#L51



 Comments   
Comment by Tom Pantelis [ 09/Jun/17 ]

That indicates the DTCN generator actor's mailbox reached the limit. If it only happens with tell-based perhaps it's not throttling enough compared to ask-based.

You can try increasing the mailbox capacity in the akka.conf. Default settings are:

bounded-mailbox

{ mailbox-capacity = 5000 mailbox-push-timeout-time = 10ms }

Perhaps we shouldn't use a bounded mailbox for DTCN generation especially since DTCL notification actor mailboxes are unbounded. As such, it seems using a bounded mailbox for the former doesn't really make sense.

Comment by Vratko Polak [ 12/Jun/17 ]

This still occurs in every run.

One suspicious thing I have not mentioned is that read from ipv4-example-topology take significantly long time. For example 7 minutes [2], while the previous steps are able to reach stability in around two minutes.

[2] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-periodic-bgp-ingest-only-carbon/309/log.html.gz#s1-s9-t17-k2-k3-k7-k5-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k1

Comment by Vratko Polak [ 12/Jun/17 ]

> significantly long time

That is because Garbage Collection started being busy during that time, see [3].
But it is not clear who is eating the memory, Controller code or BGP code.

[3] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-periodic-bgp-ingest-only-carbon/309/gclogs-1/gc_1497252896.63.log.gz

Comment by Vratko Polak [ 21/Jun/17 ]

The test passed [4] today, even though the garbage collection log [5] still shows significant slowdown during its run.

[4] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-periodic-bgp-ingest-only-carbon/317/log.html.gz#s1-s9
[5] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-periodic-bgp-ingest-only-carbon/317/gclogs-1/gc_1498031207.07.log.gz

Comment by Vratko Polak [ 06/Jul/17 ]

Current runs around Carbon SR1 branch freeze show the suite does not fail [6], but garbage collection [7] is still slowing the test down.

[] https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-periodic-bgp-ingest-only-carbon/334/robot/bgpcep-bgp-ingest.txt/Singlepeer%20Prefixcount_1/
[] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-periodic-bgp-ingest-only-carbon/334/gclogs-1/gc_1499325254.08.log.gz

Comment by Vratko Polak [ 12/Jul/17 ]

> Current runs around Carbon SR1 branch freeze show the suite does not fail

But not 100% of the time, sometimes the test still fails [8].

[8] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-1node-periodic-bgp-ingest-only-carbon/339/log.html.gz#s1-s11-t17-k2-k3-k7-k5-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k1

Comment by Vratko Polak [ 12/Sep/17 ]

Changed title to "6 GB heap is not entirely enough for BGP ingest test with 1 million prefixes when tell-based protocol is used".

Comment by Oleksii Mozghovyi [ 08/Apr/21 ]

I've increased the number of prefixes to 1.000.000 for the following tests:
manypeers_prefixcount.robot
singlepeer_changecount.robot
singlepeer_prefixcount.robot
everything was fine, except for some unstable behavior of the play.py script on the tools VM. The problem might be related to environmental issues in the sandbox.

Generated at Wed Feb 07 19:56:16 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.