[CONTROLLER-1668] Tell based protocol doesnt time out transactions after hard-timeout. Created: 11/May/17  Updated: 25/Jul/23  Resolved: 17/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Tomas Cere Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks CONTROLLER-1665 C: write-transactions does not return... Resolved
blocks CONTROLLER-1674 Timeout waiting for task in producer ... Resolved
External issue ID: 8422

 Description   

When running with tell-based protocol transactions dont seem to abort and fail with Transaction timeout after reaching hard-time out which seems to be 60seconds.

https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-2nd/13/archives/log.html.gz#s1-s1-t1-k2-k18-k3

This suite tries to run write-transactions on an isolated node and expects
that the transactions are aborted after reaching the hard timeout, however
it fails after reaching timeout of 150seconds on the robot side.



 Comments   
Comment by Tomas Cere [ 11/May/17 ]

Upon investigating the code paths it seems like were losing track of the original enque times once requests are replayed, leading to never reaching the timeout since the timer is reset for each request.

Comment by Robert Varga [ 11/May/17 ]

This is an artifact of the code structure change with the request queue. The expectation was that the original submitter would retry the request on timeout, but that is not feasible and we really need to separate the timeouts for the request (i.e. its forward progress) and reconnect-and-retry logic.

As a first step, https://git.opendaylight.org/gerrit/56747 cleans up the replay logic so that we do not lose the enqueue timers. A follow-up patch will separate out the retry logic.

Comment by Tom Pantelis [ 11/May/17 ]

Lowering this to major as it's not a blocking issue for the Carbon release since tell-based won't be enabled by default.

Comment by Robert Varga [ 14/May/17 ]

This actually affects prefix-based shards, rendering them vulnerable.

The timer part:
https://git.opendaylight.org/gerrit/56874

Comment by A H [ 17/May/17 ]

Can your team take a look and update the status of this bug. With the patch https://git.opendaylight.org/gerrit/#/c/56874/ was successfully merged, can we mark this bug as fixed and resolved?

Generated at Wed Feb 07 19:56:08 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.