[CONTROLLER-1737] Current transaction in state READY leads to RequestTimeoutException (120s) after make-leader-local Created: 03/Jul/17  Updated: 25/Jul/23  Resolved: 02/Jul/21

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: 4.0.0

Type: Bug
Reporter: Vratko Polak Assignee: Oleksii Mozghovyi
Resolution: Cannot Reproduce Votes: 0
Labels: pt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by CONTROLLER-1707 Client not reconnecting successfully ... Resolved
External issue ID: 8792

 Description   

The Robot symptom is basically the same as in CONTROLLER-1707.
There is test which started failing [1] quite reliably, the difference from original CONTROLLER-1707 description is that there is a transaction writer on each member and there is a data tree change listener on the old leader member. Once again, the only failing writer is on the old leader member:

{"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Submit failed","error-info":"TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 120.023957045seconds]]}

\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)\n\tat
...

Contrary to CONTROLLER-1707, "No transactions enqueued while attempting to start canCommit" is not seen on the new leader.
Instead, the following is seen in karaf.log [2] of the writer (member-1, old leader and listener). Not sure which log message is the critical one (if any)

Very many "rejecting request" messages right after "Received LeaderStateChanged":

2017-07-02 15:57:40,058 | INFO | rd-dispatcher-32 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-1-shard-default-config: not currently leader, rejecting request Envelope{sessionId=0, txSequence=3115, message=ModifyTransactionRequest{target=member-2-datastore-config-fe-0-chn-2-txn-4416-0, sequence=0, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.2:2550/user/$a#-603757263], modifications=1, protocol=READY}}. isLeader: true, isLeaderActive: false,isLeadershipTransferInProgress: true.

The NotLeaderException has been received at 15:57:41,474. The new connection has been established at 15:57:46,839. The next reconnect (without a visible reason) has been initiated at 15:59:16,939 and finished at 15:59:16,945. Another one initiated at 15:59:46,958 and finished at 15:59:46,962. Finally request timeout happened:

2017-07-02 15:59:51,878 | WARN | lt-dispatcher-43 | ConcurrentDOMDataBroker | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Tx: DOM-CHAIN-1-5382 Error during phase CAN_COMMIT, starting Abort
org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 120.023957045seconds
at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.lockedCheckTimeout(AbstractClientConnection.java:372)[197:org.opendaylight.controller.cds-access-client:1.1.1.SNAPSHOT]
...

In the whole segment, there were multiple (arguably harmless) snapshot persist events.

On the new leader [3], there is only one suspicious message:

2017-07-02 15:58:06,430 | WARN | rd-dispatcher-31 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: Current transaction member-1-datastore-config-fe-0-chn-2-txn-5382-0 has timed out after 15000 ms in state READY

Update:
While trying to replicate on Sandbox, two [4] tests [5] have failed. In the latter test, the failing writer was on the old leader (without listener listener), but in the former one, the failing writer was actually on the member which stayed follower (also without listener).
In both cases, the leader was member-2. Karaf log [6] is huge, but I have confirmed in both tests there was a "Current transaction ... has timed out after 15000 ms in state READY" message during the time the test was waiting for writer response:

2017-07-03 11:29:50,834 | WARN | rd-dispatcher-44 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-2-shard-default-config: Current transaction member-1-datastore-config-fe-0-chn-2-txn-11757-0 has timed out after 15000 ms in state READY

[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/769/log.html.gz#s1-s36-t1-k2-k12
[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/769/odl1_karaf.log.gz
[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/769/odl2_karaf.log.gz
[4] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/3/log.html.gz#s1-s2-t1-k2-k12
[5] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/3/log.html.gz#s1-s2-t3-k2-k12
[6] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/3/odl2_karaf.log.gz



 Comments   
Comment by Vratko Polak [ 03/Jul/17 ]

> In the latter test, the failing writer was on the old leader
> (without listener listener), but in the former one,
> the failing writer was actually on the member which stayed follower

Releng run from today has also seen the former [7] and the latter [8] type of failure.

> Current transaction ... in state READY

This type of message is quite frequent in the logs (together with state COMMIT_PENDING). I think this is expected in isolation scenarios, just not for make-leader-local scenarios.

[7] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/770/log.html.gz#s1-s36-t1-k2-k12
[8] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/770/log.html.gz#s1-s36-t5-k2-k12

Comment by Robert Varga [ 03/Jul/17 ]

Unfortunately there is quite a bit of difference between timeouts in READY and COMMIT_PENDING.

In case of READY, we have seen the transaction as sealed, but have not gotten the appropriate commit request. This happens with reconnects and remote leader, if the frontend takes too long to actually forward the commit request for whatever reason.

In case of COMMIT_PENDING, should be a pure-backend transient, generated when achieving consensus in final stage of committing is taking too long.

Let's deal with CONTROLLER-1707 first and see its effects on this one.

Comment by Vratko Polak [ 17/Jul/17 ]

After SR1 and branch unfreeze (CONTROLLER-1707 not present), this has happened, albeit in a longevity job after 13 hours.

Robot failure: [9], karaf.log: [10].

2017-07-16 12:47:57,411 | WARN | rd-dispatcher-36 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | member-1-shard-default-config: Current transaction member-2-datastore-config-fe-0-chn-345-txn-6634-0 has timed out after 15000 ms in state READY

[9] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/15/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10
[10] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/15/odl1_karaf.log.gz

Comment by Robert Varga [ 17/Jul/17 ]

https://git.opendaylight.org/gerrit/60490 to give some more cushion during reconnect.

Comment by Vratko Polak [ 20/Jul/17 ]

> https://git.opendaylight.org/gerrit/60490

That worked on sandbox and is now merged.

Comment by Jamo Luhrsen [ 21/May/18 ]

I think we still have these, or maybe they have just come back and I'm now noticing
them. example:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-clustering-all-fluorine/91/robot-plugin/log.html.gz#s1-s16-t7-k8-k1-k1-k1

 

Comment by Vratko Polak [ 22/May/18 ]

> #s1-s16-t7-k8-k1-k1-k1

That is Ddb-Sanity-Prefix-Based suite, which was never entirely stable.
The test Produce_Transactions_One_Node_Leader is failing on every run and stream I look at.

I mean, the failure means there IS a bug, ending with RequestTimeoutException.
But it does not happen in the test scenario described here, and probably does not have transactions in READY state (I have not seen debug logs).

I think it is a different bug, not reported so far just because none of the official tests scenarios hit it.

Comment by Oleksii Mozghovyi [ 09/Apr/21 ]

So, currently, the ddb-sanity-module-based suite is stable; I don't see any failures from that one on the master branch (job: controller-csit-3node-clustering-tell-all-phosphorus). The issues related to prefix-based are still there(I've checked on sandbox), but was commented out in 2018. If for some reason we want to bring prefix-based, then it should be a separate issue.

Comment by Robert Varga [ 02/Jul/21 ]

We have removed prefix-based shards in version 4.0.0, closing as cannot reproduce.

Generated at Wed Feb 07 19:56:19 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.