[CONTROLLER-1703] Tweak Akka and Java timeouts to a reasonable compromise between stability and failure detection Created: 06/Jun/17  Updated: 22/Jan/24

Status: In Review
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: 8.0.5, 9.0.1

Type: Improvement
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Unresolved Votes: 0
Labels: pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks CONTROLLER-1706 Large transaction traffic prevents le... Resolved
is blocked by NETCONF-582 become-prefix-leader can run into Fut... Resolved

 Description   

There are several bugs (such as CONTROLLER-1645) which track failures caused by an unexpected UnreachableMember.
One hypothesis is that these can happen when cluster is under load, so that members have multiple (or big) messages to process, and they are late to read heartbeats from peers.

In order to test functional bug fixes, we are frequently testing with increased Akka timeouts, for example with [0].

But it seems large Akka timeout can also have downsides. This Improvement is to make sure various (default) timeouts within ODL are consistent and suitable for performance tests.

This is an umbrella bug, specific symptoms will be described in child bugs.

[0] https://git.opendaylight.org/gerrit/#/c/57699/5



 Comments   
Comment by Tom Pantelis [ 06/Jun/17 ]

Hopefully this gets addressed by message chunking and converting to artery.

Generated at Wed Feb 07 19:56:14 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.