[CONTROLLER-1672] Restconf slow to respond when the member is under load Created: 12/May/17  Updated: 25/Jul/23  Resolved: 02/Jul/21

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: 4.0.0

Type: Bug Priority: Highest
Reporter: Vratko Polak Assignee: Oleksii Mozghovyi
Resolution: Cannot Reproduce Votes: 0
Labels: pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by AAA-99 Upgrade Jersey from 1.X to 2.X Resolved
is blocked by BGPCEP-948 BgpPeer instantiated twice in cluster... Resolved
External issue ID: 8434

 Description   

Marking as a clustering bug, because this is not seen in 1node tests.

Seen in BGP suite (if CONTROLLER-1645 does not come first). The URL is currently /restconf/modules, to be removed soon, but I think this may also happen when querying jolokia.

The read has currently a timeout set to 5 seconds, which can be overstepped [0], but previous reads [1] were also taking more than 4 seconds to finish. I see nothing critical in karaf.log considering this is tell-based protocol.

Note that when testing locally, garbage collector was doing small collects only, so pauses under 1 second.

This might be just a less severe version of 8432 (5s as opposed to 30s), or this might be a less severe version of CONTROLLER-1645 when the delay is not long enough to cause the leader movement due to transient dissociation.

If this is not deemed critical, we can change suites to allow for slower responses.

[0] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/264/archives/log.html.gz#s1-s9-t8-k2-k3-k7-k2-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k2-k4-k2-k1-k6-k3-k1-k2-k1-k1-k3-k3-k1
[1] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/264/archives/log.html.gz#s1-s9-t8-k2-k3-k7-k2-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k2-k3-k2-k1-k6-k3-k1-k2-k1-k1-k3-k3-k1



 Comments   
Comment by Tom Pantelis [ 12/May/17 ]

/restconf/modules doesn't hit the datastore so if that is slow then the whole system is likely bogged down. With clustering, there's more load on the system due to replication etc. Could it be low on memory? CPU usage high? I would suggest tracking memory and CPU usage. If CPU is high, get thread dumps.

Comment by Robert Varga [ 15/May/17 ]

Does this occur with odl-restconf-noauth? I think AAA has interactions with the data store, which would explain why it takes longer than expected.

Comment by Vratko Polak [ 16/May/17 ]

We hit this again [2], but few runs passed in the meantime. This time we do not access /restconf/modules but jolokia [3] directly.

> Does this occur with odl-restconf-noauth

I think on Sandbox it passed. We will switch the job to odl-restconf-noauth now.

[2] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/276/archives/log.html.gz#s1-s7-t8-k2-k3-k7-k4-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k2-k3-k2-k1-k6-k2-k1-k5-k1-k3-k1
[3] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/276/archives/log.html.gz#s1-s7-t8-k2-k3-k7-k4-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k2-k3-k2-k1-k6-k2-k1-k5-k1-k1

Comment by Vratko Polak [ 17/May/17 ]

> switch the job to odl-restconf-noauth

Done, but the jolokia read is still [4] overstepping 5 seconds.

This time we use tell-based protocol, rib owner and leaders were already on member-1 (no actual leader movement, neither by request nor Bug-8318).

Karaf.log [5] shows member-2 was (probably) installing snapshots, and there are warnings such as:
2017-05-17 11:21:20,929 | WARN | ult-dispatcher-4 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.Carbon | member-2-shard-default-operational (Follower): Missing index 3062 from log. Cannot apply state. Ignoring 3062 to 3062
2017-05-17 11:21:25,620 | WARN | lt-dispatcher-29 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | member-2-shard-topology-operational: Unknown history for aborted transaction member-1-datastore-operational-fe-0-txn-7-1, ignoring

I am not sure whether that explains why Jolokia is so slow. UnreachableMember has happened some time later after the test already failed:
2017-05-17 11:21:57,351 | INFO | lt-dispatcher-72 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | Received UnreachableMember: memberName MemberName

{name=member-3}

, address: akka.tcp://opendaylight-cluster-data@10.29.12.21:2550

[4] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/279/archives/log.html.gz#s1-s2-t8-k2-k3-k7-k4-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k2-k1-k2-k1-k6-k2-k1-k5-k1-k3-k1
[5] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/279/archives/odl2_karaf.log.gz

Comment by Vratko Polak [ 26/May/17 ]

> jolokia read

This can still happen. We were testing BGP longevity (tell-based protocol) on Sandbox, on a build which contained a chain [0] for Carbon post-release with increased akka timers [1], and Jolokia did not respond within 30 seconds [2].

The request was asking member-2 about state of default-operational shard (member-2 was a follower and rib owner).
Here is the relevant segment of karaf.log [3] , the request was open between 18:12:42.608 and 18:13:12.641

2017-05-25 18:11:59,944 | WARN | lt-dispatcher-33 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 242 from log. Cannot apply state. Ignoring 242 to 256
2017-05-25 18:12:11,601 | INFO | lt-dispatcher-77 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational Received follower initial sync status for member-2-shard-default-operational status sync done true
2017-05-25 18:12:11,606 | INFO | lt-dispatcher-77 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-operational Received follower initial sync status for member-2-shard-default-operational status sync done false
2017-05-25 18:12:25,496 | WARN | lt-dispatcher-33 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-2-shard-topology-operational: Unknown history for aborted transaction member-1-datastore-operational-fe-0-txn-6-2, ignoring
2017-05-25 18:12:25,523 | WARN | lt-dispatcher-33 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-inventory-config (Leader) : handleAppendEntriesReply delayed beyond election timeout, appendEntriesReply : AppendEntriesReply [term=1, success=true, followerId=member-1-shard-inventory-config, logLastIndex=-1, logLastTerm=-1, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], timeSinceLastActivity : 16517, lastApplied : -1, commitIndex : -1
2017-05-25 18:12:25,523 | WARN | lt-dispatcher-27 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-toaster-config (Leader) : handleAppendEntriesReply delayed beyond election timeout, appendEntriesReply : AppendEntriesReply [term=1, success=true, followerId=member-1-shard-toaster-config, logLastIndex=-1, logLastTerm=-1, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], timeSinceLastActivity : 16517, lastApplied : -1, commitIndex : -1
2017-05-25 18:12:25,523 | WARN | lt-dispatcher-82 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-config (Leader) : handleAppendEntriesReply delayed beyond election timeout, appendEntriesReply : AppendEntriesReply [term=1, success=true, followerId=member-1-shard-default-config, logLastIndex=68, logLastTerm=1, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], timeSinceLastActivity : 16517, lastApplied : 68, commitIndex : 68
2017-05-25 18:12:25,523 | WARN | lt-dispatcher-32 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-topology-operational (Leader) : handleAppendEntriesReply delayed beyond election timeout, appendEntriesReply : AppendEntriesReply [term=1, success=true, followerId=member-1-shard-topology-operational, logLastIndex=58, logLastTerm=1, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], timeSinceLastActivity : 22653, lastApplied : 59, commitIndex : 59
2017-05-25 18:12:25,524 | WARN | lt-dispatcher-79 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-topology-config (Leader) : handleAppendEntriesReply delayed beyond election timeout, appendEntriesReply : AppendEntriesReply [term=1, success=true, followerId=member-1-shard-topology-config, logLastIndex=5, logLastTerm=1, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], timeSinceLastActivity : 16517, lastApplied : 5, commitIndex : 5
2017-05-25 18:12:25,525 | WARN | lt-dispatcher-33 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-toaster-operational (Leader) : handleAppendEntriesReply delayed beyond election timeout, appendEntriesReply : AppendEntriesReply [term=1, success=true, followerId=member-1-shard-toaster-operational, logLastIndex=-1, logLastTerm=-1, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], timeSinceLastActivity : 16519, lastApplied : -1, commitIndex : -1
2017-05-25 18:12:25,595 | WARN | lt-dispatcher-33 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-2-shard-topology-operational: Unknown history for purged transaction member-1-datastore-operational-fe-0-txn-6-2, ignoring
2017-05-25 18:13:05,067 | WARN | lt-dispatcher-27 | ClusterCoreDaemon | 174 - com.typesafe.akka.slf4j - 2.4.17 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.12.69:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.29.12.236:2550, status = Up)]. Node roles [member-2]

So even with increased timers [1], CONTROLLER-1645 can still happen.
The question is whether UnreachableMember can cause Jolokia to respond slowly. One possibility is AAA sill storing something in the datastore, even though this run was using odl-restconf-noauth. Another possibility is shard access being blocked under load for some reason.

[0] https://git.opendaylight.org/gerrit/#/c/57822/1
[1] https://git.opendaylight.org/gerrit/#/c/57699/5
[2] https://logs.opendaylight.org/sandbox/jenkins091/bgpcep-csit-3node-bgpclustering-longevity-only-carbon/13/archives/log.html.gz#s1-s2-t1-k9-k1-k1-k1-k1-k1-k1-k1-k1-k1-k2-k1-k3-k7-k5-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k2-k2-k2-k1-k6-k2-k1-k5-k1-k3-k1
[3] https://logs.opendaylight.org/sandbox/jenkins091/bgpcep-csit-3node-bgpclustering-longevity-only-carbon/13/archives/odl1_karaf.log.gz

Comment by Tom Pantelis [ 26/May/17 ]

The request to get the Shard stats sends a message to the Shard so it will wait in the queue and can get delayed. That's probably what's happening.

Comment by Robert Varga [ 05/Jun/17 ]

Tom, I think we can improve the situation now that we have AAA-99 addressed by adding making shard stats message also a control message. Do you agree?

Comment by Tom Pantelis [ 14/Nov/18 ]

sure

Comment by Oleksii Mozghovyi [ 26/Mar/21 ]

This looks like pretty old issue, so I had to fix several issues before taking care of RESTconf performance:
1. The 'odl-restconf-noauth' feature was removed in 2019; as a temporary solution, it can be replaced with the 'odl-restconf' feature. We would need to replace it with 'odl-restconf-nb-rfc8040' and do not install unnecessary dependencies for the test in a long-term perspective.
https://git.opendaylight.org/gerrit/c/releng/builder/+/95526
2. After running the test scenario multiple times in the clustered configuration, I noticed, that periodically the BGPCEP component is trying to instantiate the BgpPeer instances twice. Such behavior leads to the example-bgp-rib-service-group service failure. I didn't find the root cause of that behavior yet, but I could fix it by adding a small check to the BgpPeer class.
https://git.opendaylight.org/gerrit/c/bgpcep/+/95561

Comment by Oleksii Mozghovyi [ 30/Mar/21 ]

3. On the sandbox I was able to run stability test for 3h with 6Gb of RAM allocated for controller. I've scheduled another run for 23h (original value for this particular test) and will check logs once its done.

Comment by Oleksii Mozghovyi [ 31/Mar/21 ]

23H run on a sandbox with 6Gb of RAM allocated finished successfully; I will try to reduce memory to 4Gb and see what happens(on my local environment it failed).

Comment by Oleksii Mozghovyi [ 08/Apr/21 ]

23H run on sandbox with 4Gb of RAM allocated finished successfully; the only problem is that size of logs files generated doesn't allow to push them into nexus. However, this can't be closed until the # 95561 change is merged to the bgpcep project.

Comment by Robert Varga [ 02/Jul/21 ]

We cannot reproduce this issue with 4G heap, hence resolved.

Generated at Wed Feb 07 19:56:09 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.