[NETCONF-1193] Support maximum value and random selection of wait time between connection attempts Created: 30/Oct/23  Updated: 25/Jan/24  Resolved: 25/Jan/24

Status: Resolved
Project: netconf
Component/s: netconf-client-mdsal
Affects Version/s: None
Fix Version/s: 7.0.0, 5.0.10, 6.0.7

Type: New Feature Priority: Medium
Reporter: Sangwook Ha Assignee: Ľuboš Čičut
Resolution: Done Votes: 0
Labels: pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Currently there are following two parameters that control the wait time between NETCONF connection attempts:

Since the wait time increases with the multiplicative factor of sleep-factor after each connection failure the gap between connection attempts increases very quickly. And this also means that the wait time increases as fast as the total elapsed time of connectivity loss or connection attempts.

For example, assuming the following default configuration values:

  • between-attempts-timeout-millis: 2 seconds
  • sleep-factor: 1.5
  • connection-timeout-millis: 20 seconds

Total elapsed time after n reconnections is {{ 20 x n + (1.5^(n - 1) - 1) / (1.5 - 1) x 2 }} and the next wait time is {{ 1.5^n x 2 }}:

So the total elapsed times look like this:

  • n=17: 2963
  • n=18: 4297
  • n=19: 6288
  • n=20: 9263

The next wait times are:

  • n=17: 1970
  • n=18: 2955
  • n=19: 4433
  • n=20: 6650

This means after over 2.5 hours of connectivity loss (n=20, 9263 seconds), the the NETCONF session may not recover for almost another 1 hour and 50 minutes (6650 seconds) even if the issue has been resolved right after the 20th connection attempt because of the long wait time.

This most likely is not an expected behavior and it is common to have the maximum value limiting the gap between connection attempts when exponential backoff is used - also it's common to introduce randomness to avoid synchronization between different connection attempts - e.g. if multiple devices are disconnected due to the same network connectivity issue, the controller may try to connect to the devices almost at the same time.



 Comments   
Comment by Ivan Hrasko [ 10/Jan/24 ]

IMO the next wait time is {{ 1.5^(n-1) x 2 }}, for example:

  • n=17: 1313
  • n=18: 1970
  • n=19: 2955
  • n=20: 4433
Generated at Wed Feb 07 20:16:54 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.