[CONTROLLER-1635] TimeoutException: Futures timed out after [5 seconds] Created: 13/Apr/17 Updated: 25/Jul/23 Resolved: 24/Apr/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Claudio David Gasparini | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8219 |
| Description |
|
2017-04-12 23:30:08,853 | ERROR | lt-dispatcher-15 | SimpleShardDataTreeCohort | 182 - org.opendaylight.controller.sal-distributed-datastore - 1.4.4.SNAPSHOT | Transaction member-1-datastore-operational-fe-0-chn-7-txn-11752 failed to prepare |
| Comments |
| Comment by Claudio David Gasparini [ 13/Apr/17 ] |
|
Intermittent issue under Boron tests |
| Comment by Tom Pantelis [ 13/Apr/17 ] |
|
The fact that this code is executing indicates there is a DOMDataTreeCommitCohort present in BGP somewhere or in the test suite. The TimeoutException indicates the DOMDataTreeCommitCohort implementation didn't respond within 5 sec. So this doesn't appear to be an issue with clustering. |
| Comment by Robert Varga [ 13/Apr/17 ] |
|
Actually BGPCEP is not using commit cohorts at all, the logs show: 2017-04-12 23:28:38,665 | INFO | entLoopGroup-5-1 | BGPPeer | 195 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.4.SNAPSHOT | Session with peer 10.29.13.222 went up with tables [BgpTableTypeImpl [getAfi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.Ipv4AddressFamily, getSafi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.UnicastSubsequentAddressFamily]] and Add Path tables [] I would assume this was a huge transaction and we just missed the deadline to respond. Tom, I think we should sync up on the various timers we have in here and make sure they make sense Also we should include transaction ID in the DTCL-too-long warning. |
| Comment by Tom Pantelis [ 13/Apr/17 ] |
|
Hmm... It failed here: org.opendaylight.controller.cluster.datastore.CompositeDataTreeCohort.processResponses(CompositeDataTreeCohort.java:162)[182:org.opendaylight.controller.sal-distributed-datastore:1.4.4.SNAPSHOT] private void processResponses(final Future<Iterable<Object>> resultsFuture, final State currentState, final State afterState) catch (Exception e) { successfulFromPrevious = null; Throwables.propagateIfInstanceOf(e, TimeoutException.class); throw Throwables.propagate(e); }If there were no commit cohorts then resultsFuture should be empty and thus complete immediately. Or am I missing something here? I agree we need to revisit the various timers. The timeout above is hard-coded at 5 sec |
| Comment by Robert Varga [ 21/Apr/17 ] |
|
We have just reproduced this in a single-node scenario with BGP load test. The stack trace is slightly different here, reflecting tell-based-protocol: java.util.concurrent.TimeoutException: Futures timed out after [5 seconds] I have just grepped the entire autorelease for cohort registrations in carbon and there is only SFC, which is not installed in this case. It looks rather weird and the code structure makes it really hard to extract any sort of useful information. I have refactored the code so it is retains actor identities, hence we can actually reason about what went down (if anything). I have also addressed the FIXME, which contains a bypass on empty cohorts, which may actually address this issue, since we do not go through the global execution context in that cases, but return directly. |
| Comment by Tom Pantelis [ 21/Apr/17 ] |
|
(In reply to Robert Varga from comment #5) > I have refactored the code so it is retains actor identities, hence we can yeah I think that's what's happening - akka/scala futures still go through the global executor even when immediately complete so if the all the executor threads are in use it could timeout. In some other part of the code I first checked if the future was done before adding a callback. I plan on removing the actor - it really isn't necessary plus we need to track the cohort state per transaction now that we have the pipelining - the current code assumes only one transaction at a time is undergoing 3PC since that's the way it was before pipelining. |
| Comment by Robert Varga [ 21/Apr/17 ] |
|
boron shortcuts: https://git.opendaylight.org/gerrit/55821 |
| Comment by Robert Varga [ 24/Apr/17 ] |
| Comment by Robert Varga [ 24/Apr/17 ] |
|
Agreed on killing DataTreeCohortActor. I have filed |