[NETCONF-359] restconf response taking 20s when sent to isolated Candidate node. Created: 03/Mar/17 Updated: 15/Mar/19 Resolved: 31/Aug/18 |
|
| Status: | Verified |
| Project: | netconf |
| Component/s: | restconf-nb |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Jamo Luhrsen | Assignee: | Jamo Luhrsen |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 7892 |
| Description |
|
this is affected some CSIT jobs and makes them spend a lot of wasted time. This email thread sort of explains it: https://lists.opendaylight.org/pipermail/integration-dev/2017-January/008967.html to reproduce you can bring up a three node cluster. then kill the leader in csit where there tend to be a lot of these requests, when this If possible, even if the response needs to be an error, the response should |
| Comments |
| Comment by Tom Pantelis [ 05/Mar/17 ] |
|
This is by design. It tries to fulfill the request by waiting for a new leader to be elected for a period of time (2 X the election timeout, which is 10 sec by default). This is the way it currently works however Robert is changing the front-end implementation with the |
| Comment by Jamo Luhrsen [ 05/Mar/17 ] |
|
(In reply to Tom Pantelis from comment #1) what would be the downside to giving an "unavailable" response in such a |
| Comment by Tom Pantelis [ 05/Mar/17 ] |
|
As I mentioned, it makes every reasonable effort to fulfill the request before returning an error. I don't think that's wrong and I don't think we should eliminate retries just to make tests run shorter. The election-timeout is shard-heartbeat-interval-in-millis (500) x shard-election-timeout-factor (20) which is 10 sec by default. For the test you can reduce the shard-election-timeout-factor in the .cfg file to say 10 to cut the time in half. |
| Comment by Tom Pantelis [ 06/Mar/17 ] |
|
The reason for retries is for resiliency. So if the leader was transitioning when a transaction was submitted, retries enable the transaction to succeed rather than failing immediately. |
| Comment by Robert Varga [ 06/Mar/17 ] |
|
This is a question of the programming model (as exposed to the application) versus implementation (as implemented by a particular data store). Exposing all transient states in the implementation (i.e. when elections are happening) will clutter all applications with boiler-plate copy&pasted code, which needs to be consistent with internal implementation state machine – which can and will change in time. The The only way to address this is to have add an alternative read() method, where the user can specify an upper bound on the request response (and report an equivalent of TimeoutException). This in turn will need figuring out how to specify this is the RESTCONF request (i.e. define an ODL extension query parameter) and have the RESTCONF implementation use the constrained read method. We cannot reasonably support this for modifications, simply because we cannot guarantee convergence in arbitrary time interval. Hence if an application were to specify an upper bound on write-type operations, the results reported to the application would not necessarily be consistent with what actually happened. In any case, we are past offset-0 M5, which means we cannot really deliver this in Carbon. |
| Comment by Robert Varga [ 06/Mar/17 ] |
|
One more thing: 'upper bound' in this context is obviously not a hard deadline, but rather a wish. For variety of reasons implementations can only guarantee a response "soon after" the deadline. |
| Comment by Robert Varga [ 06/Mar/17 ] |
|
Actually, no change is needed in the APIs themselves: read() already returns a Future, hence all that is required is the extension and making RESTCONF pass an appropriate time constraint to get(). |
| Comment by Robert Varga [ 28/Aug/18 ] |
|
jluhrsen this reads like the AAA issue we fixed in Fluorine. Is this still happening? |
| Comment by Tom Pantelis [ 28/Aug/18 ] |
|
That fix just pushes the delay to the restconf read, which will still take 20s w/o a leader. This will get worse with tell-based as I believe the timeout after retries is 2 min. |
| Comment by Jamo Luhrsen [ 31/Aug/18 ] |
|
well, I'm not sure we are on the same page then. The last time I looked at this, I guess, was 1.5 years ago? Anyway, 1. kill the leader the result is a proper response (200) and data ( {"streams":{}} ) does it jive with what you guys expect? |
| Comment by Tom Pantelis [ 31/Aug/18 ] |
|
That's b/c /restconf/streams doesn't hit the DS. You saw this before due to the AAA issue as Robert alluded to. However it would still happen for a request that does actually access the DS - that's expected. If there's no further action here then please close it. |
| Comment by Jamo Luhrsen [ 31/Aug/18 ] |
|
deal. nothing else here for me to worry about. closing. |
| Comment by Jamo Luhrsen [ 31/Aug/18 ] |
|
resolved with https://git.opendaylight.org/gerrit/#/c/73991/ |