[CONTROLLER-1672] Restconf slow to respond when the member is under load Created: 12/May/17 Updated: 25/Jul/23 Resolved: 02/Jul/21 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | 4.0.0 |
| Type: | Bug | Priority: | Highest |
| Reporter: | Vratko Polak | Assignee: | Oleksii Mozghovyi |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | pt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||
| External issue ID: | 8434 | ||||||||||||
| Description |
|
Marking as a clustering bug, because this is not seen in 1node tests. Seen in BGP suite (if CONTROLLER-1645 does not come first). The URL is currently /restconf/modules, to be removed soon, but I think this may also happen when querying jolokia. The read has currently a timeout set to 5 seconds, which can be overstepped [0], but previous reads [1] were also taking more than 4 seconds to finish. I see nothing critical in karaf.log considering this is tell-based protocol. Note that when testing locally, garbage collector was doing small collects only, so pauses under 1 second. This might be just a less severe version of 8432 (5s as opposed to 30s), or this might be a less severe version of CONTROLLER-1645 when the delay is not long enough to cause the leader movement due to transient dissociation. If this is not deemed critical, we can change suites to allow for slower responses. [0] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/264/archives/log.html.gz#s1-s9-t8-k2-k3-k7-k2-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k2-k4-k2-k1-k6-k3-k1-k2-k1-k1-k3-k3-k1 |
| Comments |
| Comment by Tom Pantelis [ 12/May/17 ] |
|
/restconf/modules doesn't hit the datastore so if that is slow then the whole system is likely bogged down. With clustering, there's more load on the system due to replication etc. Could it be low on memory? CPU usage high? I would suggest tracking memory and CPU usage. If CPU is high, get thread dumps. |
| Comment by Robert Varga [ 15/May/17 ] |
|
Does this occur with odl-restconf-noauth? I think AAA has interactions with the data store, which would explain why it takes longer than expected. |
| Comment by Vratko Polak [ 16/May/17 ] |
|
We hit this again [2], but few runs passed in the meantime. This time we do not access /restconf/modules but jolokia [3] directly. > Does this occur with odl-restconf-noauth I think on Sandbox it passed. We will switch the job to odl-restconf-noauth now. [2] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/276/archives/log.html.gz#s1-s7-t8-k2-k3-k7-k4-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k2-k3-k2-k1-k6-k2-k1-k5-k1-k3-k1 |
| Comment by Vratko Polak [ 17/May/17 ] |
|
> switch the job to odl-restconf-noauth Done, but the jolokia read is still [4] overstepping 5 seconds. This time we use tell-based protocol, rib owner and leaders were already on member-1 (no actual leader movement, neither by request nor Bug-8318). Karaf.log [5] shows member-2 was (probably) installing snapshots, and there are warnings such as: I am not sure whether that explains why Jolokia is so slow. UnreachableMember has happened some time later after the test already failed: , address: akka.tcp://opendaylight-cluster-data@10.29.12.21:2550 [4] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/279/archives/log.html.gz#s1-s2-t8-k2-k3-k7-k4-k1-k6-k1-k1-k1-k1-k1-k2-k1-k1-k2-k1-k2-k1-k6-k2-k1-k5-k1-k3-k1 |
| Comment by Vratko Polak [ 26/May/17 ] |
|
> jolokia read This can still happen. We were testing BGP longevity (tell-based protocol) on Sandbox, on a build which contained a chain [0] for Carbon post-release with increased akka timers [1], and Jolokia did not respond within 30 seconds [2]. The request was asking member-2 about state of default-operational shard (member-2 was a follower and rib owner). 2017-05-25 18:11:59,944 | WARN | lt-dispatcher-33 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-operational (Follower): Missing index 242 from log. Cannot apply state. Ignoring 242 to 256 So even with increased timers [1], CONTROLLER-1645 can still happen. [0] https://git.opendaylight.org/gerrit/#/c/57822/1 |
| Comment by Tom Pantelis [ 26/May/17 ] |
|
The request to get the Shard stats sends a message to the Shard so it will wait in the queue and can get delayed. That's probably what's happening. |
| Comment by Robert Varga [ 05/Jun/17 ] |
|
Tom, I think we can improve the situation now that we have |
| Comment by Tom Pantelis [ 14/Nov/18 ] |
|
sure |
| Comment by Oleksii Mozghovyi [ 26/Mar/21 ] |
|
This looks like pretty old issue, so I had to fix several issues before taking care of RESTconf performance: |
| Comment by Oleksii Mozghovyi [ 30/Mar/21 ] |
|
3. On the sandbox I was able to run stability test for 3h with 6Gb of RAM allocated for controller. I've scheduled another run for 23h (original value for this particular test) and will check logs once its done. |
| Comment by Oleksii Mozghovyi [ 31/Mar/21 ] |
|
23H run on a sandbox with 6Gb of RAM allocated finished successfully; I will try to reduce memory to 4Gb and see what happens(on my local environment it failed). |
| Comment by Oleksii Mozghovyi [ 08/Apr/21 ] |
|
23H run on sandbox with 4Gb of RAM allocated finished successfully; the only problem is that size of logs files generated doesn't allow to push them into nexus. However, this can't be closed until the # 95561 change is merged to the bgpcep project. |
| Comment by Robert Varga [ 02/Jul/21 ] |
|
We cannot reproduce this issue with 4G heap, hence resolved. |