[CONTROLLER-1723] RequestTimeoutException after ~250s after removing module shard replica Created: 16/Jun/17  Updated: 25/Jul/23  Resolved: 12/Jul/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8704

 Description   

This appeared in a test case which was previously failing on CONTROLLER-1707 but there is not enough runs to estimate how frequent this new bug is.

The scenario uses module-based shards (tell-based protocol) with one transaction writer running on each member and DTCL present on the shard leader node. (If the listener is on the follower node, shard replica removal has not lead to failures, only CONTROLLER-1711 is detected by the listener).

The writer response [0] is:
{"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Unexpected-exception","error-info":"TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 253.973720644seconds]]}

\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)\n\tat org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:102)\n\tat org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.handleException(ConcurrentDOMDataBroker.java:222)\n\tat org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.access$000(ConcurrentDOMDataBroker.java:46)\n\tat org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$1.onFailure(ConcurrentDOMDataBroker.java:126)\n\tat com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)\n\tat com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)\n\tat com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)\n\tat com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)\n\tat com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)\n\tat com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:68)\n\tat org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.lambda$directCommit$4(AbstractProxyTransaction.java:473)\n\tat org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransaction.lambda$handleReplayedRemoteRequest$12(RemoteProxyTransaction.java:442)\n\tat org.opendaylight.controller.cluster.access.client.ConnectionEntry.complete(ConnectionEntry.java:50)\n\tat org.opendaylight.controller.cluster.access.client.AbstractClientConnection.lockedCheckTimeout(AbstractClientConnection.java:372)\n\tat org.opendaylight.controller.cluster.access.client.AbstractClientConnection.runTimer(AbstractClientConnection.java:303)\n\tat org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:121)\n\tat org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:45)\n\tat org.opendaylight.controller.cluster.access.client.AbstractClientActor.onReceiveCommand(AbstractClientActor.java:59)\n\tat akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)\n\tat akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)\n\tat akka.actor.Actor$class.aroundReceive(Actor.scala:502)\n\tat akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)\n\tat akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)\n\tat akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)\n\tat akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)\n\tat akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)\n\tat akka.actor.ActorCell.invoke(ActorCell.scala:495)\n\tat akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)\n\tat akka.dispatch.Mailbox.run(Mailbox.scala:224)\n\tat akka.dispatch.Mailbox.exec(Mailbox.scala:234)\n\tat scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\nCaused by: org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 253.973720644seconds\n\t... 21 more\n"}]}}

This Bug shares similarities with multiple known bugs:

The writer response is almost identical to CONTROLLER-1707 but that was timing out after 120s (as expected for unsuccessful client reconnect), and is supposedly fixed. Also, CONTROLLER-1707 was not present when moving leader onto writer node.

It is not known whether ~250 seconds is stable or fluctuating value. If it is fluctuating, this may be a less severe case of CONTROLLER-1708 even though that bug is observed on isolation scenarios.

Also, if RequestTimeoutException can happen with duration significantly longer than 120s, it may explain CONTROLLER-1687.

Update: A Sandbox run with verbose logs encountered [1] this with duration of 133 seconds.

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/750/log.html.gz#s1-s36-t1-k2-k15-k3-k1-k1
[1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/35/log.html.gz#s1-s36-t1-k2-k15-k3-k1-k1



 Comments   
Comment by Vratko Polak [ 16/Jun/17 ]

Also similarly to CONTROLLER-1708 the only writer which fail is the one on the member which stayed follower.

Comment by Vratko Polak [ 21/Jun/17 ]

"Timed out" value fluctuates. Recently seen: 370s [2] and 122s [3].

[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/756/log.html.gz#s1-s36-t1-k2-k15-k3
[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/756/log.html.gz#s1-s36-t3-k2-k15-k3-k1-k1

Comment by Vratko Polak [ 22/Jun/17 ]

More comments on fluctuating values. Looking at karaf.log [4] of the faulty node, these might be two different bugs.
In both cases, member-1 was the old leader, member-3 was the new leader, member-2 was the faulty follower.

> 370s [2]

Listener was on member-1. The member-2 client reconnected only once (at 09:33:29,624) but it took 6 minutes to do so.
"Tx: DOM-CHAIN-1-3480 Error during phase CAN_COMMIT, starting Abort" happened at 09:33:29,624, so immediately after the reconnect.
Member-3 writer finished 15506 transactions, member-1 writer 6705.

> 122s [3].

Listener was at member-2. Here member-2 client reconnected at 09:36:06,681 after 25 seconds, but at 09:37:11,241 it decided to reconnect again (without explicit reason) in less than 2 milliseconds.
"Tx: DOM-CHAIN-1-3672 Error during phase CAN_COMMIT, starting Abort" happened at 09:37:41,264 (30 seconds after the second reconnect).
Member-3 finished 7145 transactions, member-1 7141.

[4] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/756/odl2_karaf.log.gz

Comment by Robert Varga [ 27/Jun/17 ]

Right, this is the log message:

2017-06-21 09:33:29,624 | INFO | ult-dispatcher-2 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: replaced connection ReconnectingClientConnection{client=ClientIdentifier

{frontend=member-2-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.98:2550/user/shardmanager-config/member-1-shard-default-config#253026416], sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=false}} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.80:2550/user/shardmanager-config/member-3-shard-default-config#-675495156], sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=false}} in 6.006 min

6 minutes a lot of time for, indicating we have either a lot of transactions or we are hitting throttling. Can we have debug logs to go with this one?

Comment by Robert Varga [ 27/Jun/17 ]

I think debugs on org.opendaylight.controller.cluster.access.client should be sufficient.

Comment by Robert Varga [ 28/Jun/17 ]

Based on the reported backend info, we are dealing with remote->remote reconnect.

Comment by Robert Varga [ 28/Jun/17 ]

I think this boils down to AbstractProxyTransaction.ensureSealed() being called from replay context, which calls down to RemoteProxyTransaction.doSeal(), which, in case of chained transactions, will call sendRequest() – which will be blocking.

Comment by Vratko Polak [ 28/Jun/17 ]

In releng, this usually happens in 3 out of 4 test cases of Leader Stability suite. On Sandbox I have seen it only fail 1 test case of of 4. That is for codebase [5] (with cherry-pick [6]).

The test case runs transaction-writers (duration 40s) on each member, member-1 is the original leader, member-2 is the new leader (by make-leader-local), member-3 contains the listener and writer fails [7] there ~220s after the leader movement (which took 10s itself).
Karaf.log [8] of member-3 is huge, as it uses debug verbosity for several loggers, the failed test case executes between 12:17:33,015 and 12:21:35,274.

[5] https://git.opendaylight.org/gerrit/#/c/59604/1
[6] https://git.opendaylight.org/gerrit/#/c/59575/1
[7] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/9/log.html.gz#s1-s2-t3-k2-k12
[8] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ls-only-carbon/9/odl3_karaf.log.gz

Comment by Vratko Polak [ 29/Jun/17 ]

When trying to reproduce CONTROLLER-1729 on Sandbox, I have encountered this instead in longevity job. The timeout value was 121.21s [9]. The log [10] is less verbose than other Sandbox runs, as the test passes several iterations (16 in this case) making the compressed logs too long large for upload.

This test uses one (randomly placed) writer and make-leader-local to move leader. The failure happened when the writer was on the member which stayed follower.

[9] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/2/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10
[10] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/2/odl3_karaf.log.gz

Comment by Robert Varga [ 29/Jun/17 ]

https://git.opendaylight.org/gerrit/59654

Generated at Wed Feb 07 19:56:17 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.