[CONTROLLER-1737] Current transaction in state READY leads to RequestTimeoutException (120s) after make-leader-local Created: 03/Jul/17 Updated: 25/Jul/23 Resolved: 02/Jul/21 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | 4.0.0 |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Oleksii Mozghovyi |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | pt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 8792 | ||||||||
| Description |
|
The Robot symptom is basically the same as in {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Submit failed","error-info":"TransactionCommitFailedException {message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 120.023957045seconds]]}\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)\n\tat Contrary to Very many "rejecting request" messages right after "Received LeaderStateChanged": 2017-07-02 15:57:40,058 | INFO | rd-dispatcher-32 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-1-shard-default-config: not currently leader, rejecting request Envelope{sessionId=0, txSequence=3115, message=ModifyTransactionRequest{target=member-2-datastore-config-fe-0-chn-2-txn-4416-0, sequence=0, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.2:2550/user/$a#-603757263], modifications=1, protocol=READY}}. isLeader: true, isLeaderActive: false,isLeadershipTransferInProgress: true. The NotLeaderException has been received at 15:57:41,474. The new connection has been established at 15:57:46,839. The next reconnect (without a visible reason) has been initiated at 15:59:16,939 and finished at 15:59:16,945. Another one initiated at 15:59:46,958 and finished at 15:59:46,962. Finally request timeout happened: 2017-07-02 15:59:51,878 | WARN | lt-dispatcher-43 | ConcurrentDOMDataBroker | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Tx: DOM-CHAIN-1-5382 Error during phase CAN_COMMIT, starting Abort In the whole segment, there were multiple (arguably harmless) snapshot persist events. On the new leader [3], there is only one suspicious message: 2017-07-02 15:58:06,430 | WARN | rd-dispatcher-31 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: Current transaction member-1-datastore-config-fe-0-chn-2-txn-5382-0 has timed out after 15000 ms in state READY Update: 2017-07-03 11:29:50,834 | WARN | rd-dispatcher-44 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: Current transaction member-1-datastore-config-fe-0-chn-2-txn-11757-0 has timed out after 15000 ms in state READY [1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/769/log.html.gz#s1-s36-t1-k2-k12 |
| Comments |
| Comment by Vratko Polak [ 03/Jul/17 ] |
|
> In the latter test, the failing writer was on the old leader Releng run from today has also seen the former [7] and the latter [8] type of failure. > Current transaction ... in state READY This type of message is quite frequent in the logs (together with state COMMIT_PENDING). I think this is expected in isolation scenarios, just not for make-leader-local scenarios. [7] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/770/log.html.gz#s1-s36-t1-k2-k12 |
| Comment by Robert Varga [ 03/Jul/17 ] |
|
Unfortunately there is quite a bit of difference between timeouts in READY and COMMIT_PENDING. In case of READY, we have seen the transaction as sealed, but have not gotten the appropriate commit request. This happens with reconnects and remote leader, if the frontend takes too long to actually forward the commit request for whatever reason. In case of COMMIT_PENDING, should be a pure-backend transient, generated when achieving consensus in final stage of committing is taking too long. Let's deal with |
| Comment by Vratko Polak [ 17/Jul/17 ] |
|
After SR1 and branch unfreeze ( Robot failure: [9], karaf.log: [10]. 2017-07-16 12:47:57,411 | WARN | rd-dispatcher-36 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | member-1-shard-default-config: Current transaction member-2-datastore-config-fe-0-chn-345-txn-6634-0 has timed out after 15000 ms in state READY [9] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/15/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10 |
| Comment by Robert Varga [ 17/Jul/17 ] |
|
https://git.opendaylight.org/gerrit/60490 to give some more cushion during reconnect. |
| Comment by Vratko Polak [ 20/Jul/17 ] |
|
> https://git.opendaylight.org/gerrit/60490 That worked on sandbox and is now merged. |
| Comment by Jamo Luhrsen [ 21/May/18 ] |
|
I think we still have these, or maybe they have just come back and I'm now noticing
|
| Comment by Vratko Polak [ 22/May/18 ] |
|
> #s1-s16-t7-k8-k1-k1-k1 That is Ddb-Sanity-Prefix-Based suite, which was never entirely stable. I mean, the failure means there IS a bug, ending with RequestTimeoutException. I think it is a different bug, not reported so far just because none of the official tests scenarios hit it. |
| Comment by Oleksii Mozghovyi [ 09/Apr/21 ] |
|
So, currently, the ddb-sanity-module-based suite is stable; I don't see any failures from that one on the master branch (job: controller-csit-3node-clustering-tell-all-phosphorus). The issues related to prefix-based are still there(I've checked on sandbox), but was commented out in 2018. If for some reason we want to bring prefix-based, then it should be a separate issue. |
| Comment by Robert Varga [ 02/Jul/21 ] |
|
We have removed prefix-based shards in version 4.0.0, closing as cannot reproduce. |