[NETCONF-359] restconf response taking 20s when sent to isolated Candidate node. Created: 03/Mar/17  Updated: 15/Mar/19  Resolved: 31/Aug/18

Status: Verified
Project: netconf
Component/s: restconf-nb
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Jamo Luhrsen Assignee: Jamo Luhrsen
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 7892

 Description   

this is affected some CSIT jobs and makes them spend a lot of wasted time.

This email thread sort of explains it:

https://lists.opendaylight.org/pipermail/integration-dev/2017-January/008967.html

to reproduce you can bring up a three node cluster. then kill the leader
node and one of the followers. the remaining node that stays up will be in
Candidate state. next send a simple rest request (e.g. /restconf/streams )
to this node. The request will be open and pending for 20s before it's
answered.

in csit where there tend to be a lot of these requests, when this
is seen it can severely add to the runtime of the test.

If possible, even if the response needs to be an error, the response should
come back right away. It seems there is some internal timer/timeout
getting in the way here, since it's always 20s.



 Comments   
Comment by Tom Pantelis [ 05/Mar/17 ]

This is by design. It tries to fulfill the request by waiting for a new leader to be elected for a period of time (2 X the election timeout, which is 10 sec by default).

This is the way it currently works however Robert is changing the front-end implementation with the CONTROLLER-1483 patch series so not sure what the exact behavior will be.

Comment by Jamo Luhrsen [ 05/Mar/17 ]

(In reply to Tom Pantelis from comment #1)
> This is by design. It tries to fulfill the request by waiting for a new
> leader to be elected for a period of time (2 X the election timeout, which
> is 10 sec by default).
>
> This is the way it currently works however Robert is changing the front-end
> implementation with the CONTROLLER-1483 patch series so not sure what the exact
> behavior will be.

what would be the downside to giving an "unavailable" response in such a
case instead of waiting for a new leader?

Comment by Tom Pantelis [ 05/Mar/17 ]

As I mentioned, it makes every reasonable effort to fulfill the request before returning an error. I don't think that's wrong and I don't think we should eliminate retries just to make tests run shorter.

The election-timeout is shard-heartbeat-interval-in-millis (500) x shard-election-timeout-factor (20) which is 10 sec by default. For the test you can reduce the shard-election-timeout-factor in the .cfg file to say 10 to cut the time in half.

Comment by Tom Pantelis [ 06/Mar/17 ]

The reason for retries is for resiliency. So if the leader was transitioning when a transaction was submitted, retries enable the transaction to succeed rather than failing immediately.

Comment by Robert Varga [ 06/Mar/17 ]

This is a question of the programming model (as exposed to the application) versus implementation (as implemented by a particular data store).

Exposing all transient states in the implementation (i.e. when elections are happening) will clutter all applications with boiler-plate copy&pasted code, which needs to be consistent with internal implementation state machine – which can and will change in time.

The CONTROLLER-1483 rework does not change things in this area (except perhaps having misaligned timers, which will need revision).

The only way to address this is to have add an alternative read() method, where the user can specify an upper bound on the request response (and report an equivalent of TimeoutException). This in turn will need figuring out how to specify this is the RESTCONF request (i.e. define an ODL extension query parameter) and have the RESTCONF implementation use the constrained read method.

We cannot reasonably support this for modifications, simply because we cannot guarantee convergence in arbitrary time interval. Hence if an application were to specify an upper bound on write-type operations, the results reported to the application would not necessarily be consistent with what actually happened.

In any case, we are past offset-0 M5, which means we cannot really deliver this in Carbon.

Comment by Robert Varga [ 06/Mar/17 ]

One more thing: 'upper bound' in this context is obviously not a hard deadline, but rather a wish. For variety of reasons implementations can only guarantee a response "soon after" the deadline.

Comment by Robert Varga [ 06/Mar/17 ]

Actually, no change is needed in the APIs themselves: read() already returns a Future, hence all that is required is the extension and making RESTCONF pass an appropriate time constraint to get().

Comment by Robert Varga [ 28/Aug/18 ]

jluhrsen this reads like the AAA issue we fixed in Fluorine. Is this still happening?

Comment by Tom Pantelis [ 28/Aug/18 ]

That fix just pushes the delay to the restconf read, which will still take 20s w/o a leader. This will get worse with tell-based as I believe the timeout after retries is 2 min.

Comment by Jamo Luhrsen [ 31/Aug/18 ]

well, I'm not sure we are on the same page then. The last time I looked at this, I guess, was 1.5 years ago? Anyway,
going by my steps in the description I am seeing that there is no more response issue.

1. kill the leader
2. kill another node
3. GET on /restconf/streams to the remaining node

the result is a proper response (200) and data ( {"streams":{}} )

does it jive with what you guys expect?

Comment by Tom Pantelis [ 31/Aug/18 ]

That's b/c /restconf/streams doesn't hit the DS. You saw this before due to the AAA issue as Robert alluded to. However it would still happen for a request that does actually access the DS - that's expected.

If there's no further action here then please close it.

Comment by Jamo Luhrsen [ 31/Aug/18 ]

deal. nothing else here for me to worry about. closing.

Comment by Jamo Luhrsen [ 31/Aug/18 ]

resolved with https://git.opendaylight.org/gerrit/#/c/73991/

Generated at Wed Feb 07 20:14:49 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.