[CONTROLLER-1723] RequestTimeoutException after ~250s after removing module shard replica Created: 16/Jun/17 Updated: 25/Jul/23 Resolved: 12/Jul/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8704 |
| Description |
|
This appeared in a test case which was previously failing on The scenario uses module-based shards (tell-based protocol) with one transaction writer running on each member and DTCL present on the shard leader node. (If the listener is on the follower node, shard replica removal has not lead to failures, only The writer response [0] is: \n\tat org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)\n\tat org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:102)\n\tat org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.handleException(ConcurrentDOMDataBroker.java:222)\n\tat org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.access$000(ConcurrentDOMDataBroker.java:46)\n\tat org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$1.onFailure(ConcurrentDOMDataBroker.java:126)\n\tat com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)\n\tat com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)\n\tat com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)\n\tat com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)\n\tat com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)\n\tat com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:68)\n\tat org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.lambda$directCommit$4(AbstractProxyTransaction.java:473)\n\tat org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransaction.lambda$handleReplayedRemoteRequest$12(RemoteProxyTransaction.java:442)\n\tat org.opendaylight.controller.cluster.access.client.ConnectionEntry.complete(ConnectionEntry.java:50)\n\tat org.opendaylight.controller.cluster.access.client.AbstractClientConnection.lockedCheckTimeout(AbstractClientConnection.java:372)\n\tat org.opendaylight.controller.cluster.access.client.AbstractClientConnection.runTimer(AbstractClientConnection.java:303)\n\tat org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:121)\n\tat org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:45)\n\tat org.opendaylight.controller.cluster.access.client.AbstractClientActor.onReceiveCommand(AbstractClientActor.java:59)\n\tat akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)\n\tat akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)\n\tat akka.actor.Actor$class.aroundReceive(Actor.scala:502)\n\tat akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)\n\tat akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)\n\tat akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)\n\tat akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)\n\tat akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)\n\tat akka.actor.ActorCell.invoke(ActorCell.scala:495)\n\tat akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)\n\tat akka.dispatch.Mailbox.run(Mailbox.scala:224)\n\tat akka.dispatch.Mailbox.exec(Mailbox.scala:234)\n\tat scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\nCaused by: org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 253.973720644seconds\n\t... 21 more\n"}]}} This Bug shares similarities with multiple known bugs: The writer response is almost identical to It is not known whether ~250 seconds is stable or fluctuating value. If it is fluctuating, this may be a less severe case of Also, if RequestTimeoutException can happen with duration significantly longer than 120s, it may explain Update: A Sandbox run with verbose logs encountered [1] this with duration of 133 seconds. [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/750/log.html.gz#s1-s36-t1-k2-k15-k3-k1-k1 |
| Comments |
| Comment by Vratko Polak [ 16/Jun/17 ] |
|
Also similarly to |
| Comment by Vratko Polak [ 21/Jun/17 ] |
|
"Timed out" value fluctuates. Recently seen: 370s [2] and 122s [3]. [2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/756/log.html.gz#s1-s36-t1-k2-k15-k3 |
| Comment by Vratko Polak [ 22/Jun/17 ] |
|
More comments on fluctuating values. Looking at karaf.log [4] of the faulty node, these might be two different bugs. > 370s [2] Listener was on member-1. The member-2 client reconnected only once (at 09:33:29,624) but it took 6 minutes to do so. > 122s [3]. Listener was at member-2. Here member-2 client reconnected at 09:36:06,681 after 25 seconds, but at 09:37:11,241 it decided to reconnect again (without explicit reason) in less than 2 milliseconds. |
| Comment by Robert Varga [ 27/Jun/17 ] |
|
Right, this is the log message: 2017-06-21 09:33:29,624 | INFO | ult-dispatcher-2 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: replaced connection ReconnectingClientConnection{client=ClientIdentifier {frontend=member-2-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.98:2550/user/shardmanager-config/member-1-shard-default-config#253026416], sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=false}} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.80:2550/user/shardmanager-config/member-3-shard-default-config#-675495156], sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=false}} in 6.006 min 6 minutes a lot of time for, indicating we have either a lot of transactions or we are hitting throttling. Can we have debug logs to go with this one? |
| Comment by Robert Varga [ 27/Jun/17 ] |
|
I think debugs on org.opendaylight.controller.cluster.access.client should be sufficient. |
| Comment by Robert Varga [ 28/Jun/17 ] |
|
Based on the reported backend info, we are dealing with remote->remote reconnect. |
| Comment by Robert Varga [ 28/Jun/17 ] |
|
I think this boils down to AbstractProxyTransaction.ensureSealed() being called from replay context, which calls down to RemoteProxyTransaction.doSeal(), which, in case of chained transactions, will call sendRequest() – which will be blocking. |
| Comment by Vratko Polak [ 28/Jun/17 ] |
|
In releng, this usually happens in 3 out of 4 test cases of Leader Stability suite. On Sandbox I have seen it only fail 1 test case of of 4. That is for codebase [5] (with cherry-pick [6]). The test case runs transaction-writers (duration 40s) on each member, member-1 is the original leader, member-2 is the new leader (by make-leader-local), member-3 contains the listener and writer fails [7] there ~220s after the leader movement (which took 10s itself). [5] https://git.opendaylight.org/gerrit/#/c/59604/1 |
| Comment by Vratko Polak [ 29/Jun/17 ] |
|
When trying to reproduce This test uses one (randomly placed) writer and make-leader-local to move leader. The failure happened when the writer was on the member which stayed follower. [9] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/2/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10 |
| Comment by Robert Varga [ 29/Jun/17 ] |