[CONTROLLER-1674] Timeout waiting for task in producer when moving shard leader Created: 15/May/17  Updated: 25/Jul/23  Resolved: 03/Jul/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by CONTROLLER-1668 Tell based protocol doesnt time out t... Resolved
External issue ID: 8445

 Description   

This happened [0] in a suite for module-based shards with tell-based protocol.

Karaf.log [1] shows an unexpected UnreachableMember at 06:17:00,808, followed by many warning as this:
2017-05-14 06:17:16,519 | WARN | ult-dispatcher-2 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | member-2-shard-default-config: Unknown history for commited transaction member-2-datastore-config-fe-0-chn-2-txn-2-0, ignoring

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/684/archives/log.html.gz#s1-s29-t1-k2-k9-k1
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/684/archives/odl2_karaf.log.gz



 Comments   
Comment by Vratko Polak [ 17/May/17 ]

Today, the writer has timed out [2] on restconf level, without returning a timeout error on its own. Karaf.log [3] shows no suspicious message within the test case duration (between 03:01:22,254 and 03:02:52,753).

Note that the run was still without a potential fix [4].

[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/692/archives/log.html.gz#s1-s24-t1-k2-k8
[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/692/archives/odl2_karaf.log.gz
[4] https://git.opendaylight.org/gerrit/56874

Comment by Vratko Polak [ 18/May/17 ]

This is still happening. The good news is that now only one writer reports this error [5], others pass (albeit with lower performance as they wait for submits).

[5] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/694/archives/log.html.gz#s1-s36-t3-k2-k13

Comment by Vratko Polak [ 18/May/17 ]

In another scenario (the previous one has listener and moves leader gracefully, this one has no listener but isolates the leader for a brief time, both scenarios have 3 transaction writers and 2 of them finish correctly now; module-based shards with tell-based protocol) something similar [6] happened.

But the exceptions are slightly different. The graceful movement response starts with:
{"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Unexpected-exception","error-info":"java.util.concurrent.TimeoutException: Timeout waiting for task.\n\tat com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:269)\n\tat com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)\n\tat com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:69)\n\tat org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.get(MappingCheckedFuture.java:85)\n\tat

The brief isolation response starts with:
{"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Unexpected-exception","error-info":"java.util.concurrent.ExecutionException: TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 144970935433ns]]}

\n\tat com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)\n\tat com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:272)\n\tat com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)\n\tat org.opendaylight.controller.clustering.it.provider.impl.WriteTransactionsHandler.maybeFinish(WriteTransactionsHandler.java:238)\n\tat org.opendaylight.controller.clustering.it.provider.impl.WriteTransactionsHandler.run(WriteTransactionsHandler.java:112)\n\tat

[6] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/694/archives/log.html.gz#s1-s28-t1-k2-k25-k1-k2-k3-k1-k1

Comment by Viera Zelcamova [ 29/May/17 ]

Vratko, please add comment when you have new run, thank you

Comment by Vratko Polak [ 29/May/17 ]

This is still happening. In a recent Sandbox test [7] (which failed to upload archive with logs) one writer failed with "TimeoutException: Timeout waiting for task" (as opposed to RequestTimeoutException), one finished successfully, and one overstepped the generous (440s) timeout for Restconf response.

This was the brief isolation scenario (module-based shards, tell-based protocol), with [8] codebase.

[7] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon/6/robot/report/log.html#s1-s10-t1-k2-k25-k1-k1
[8] https://git.opendaylight.org/gerrit/#/c/57698/3

Comment by Vratko Polak [ 02/Jun/17 ]

2017-06-01 23:37:21,958 | INFO | h for user karaf | command | 265 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Local_Leader_Shutdown
2017-06-01 23:37:27,553 | INFO | qtp10706451-296 | ClusterAdminRpcService | 201 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.1.SNAPSHOT | Removing replica for shard default memberName member-2, datastoreType Config
2017-06-01 23:37:27,569 | INFO | lt-dispatcher-19 | ClusterAdminRpcService | 201 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.1.SNAPSHOT | Successfully removed replica for shard default
2017-06-01 23:37:36,439 | INFO | ult-dispatcher-4 | tActorServerConfigurationSupport | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.SNAPSHOT | member-2-shard-default-config: ServerConfigurationPayload [serverConfig=[ServerInfo [id=member-1-shard-default-config, isVoting=true], ServerInfo [id=member-3-shard-default-config, isVoting=true]]] has been successfully replicated to a majority of followers
2017-06-01 23:37:36,445 | INFO | ult-dispatcher-4 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-2-shard-default-config, leaderId=null, leaderPayloadVersion=5]
2017-06-01 23:37:38,659 | INFO | lt-dispatcher-32 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier

{frontend=member-2-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} indicated no leadership, reconnecting it
org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557 is not the current leader
at org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:455)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:293)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)[193:org.opendaylight.controller.sal-akka-raft:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)[192:org.opendaylight.controller.sal-clustering-commons:1.5.1.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.18]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)[192:org.opendaylight.controller.sal-clustering-commons:1.5.1.SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)[173:com.typesafe.akka.actor:2.4.18]
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[173:com.typesafe.akka.actor:2.4.18]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.18]
at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
2017-06-01 23:37:38,662 | INFO | lt-dispatcher-32 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} reconnecting as ReconnectingClientConnection{client=ClientIdentifier

{frontend=member-2-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}}
2017-06-01 23:37:38,662 | INFO | lt-dispatcher-32 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: refreshing backend for shard 0
2017-06-01 23:37:38,703 | WARN | monPool-worker-2 | AbstractShardBackendResolver | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Failed to resolve shard
java.util.concurrent.TimeoutException: Connection attempt failed
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$connectShard$2(AbstractShardBackendResolver.java:142)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)[:1.8.0_121]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)[:1.8.0_121]
at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:443)[:1.8.0_121]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)[:1.8.0_121]
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)[:1.8.0_121]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)[:1.8.0_121]
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)[:1.8.0_121]
Caused by: org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557 is not the current leader
at org.opendaylight.controller.cluster.datastore.Shard.handleConnectClient(Shard.java:436)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:306)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)[193:org.opendaylight.controller.sal-akka-raft:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)[192:org.opendaylight.controller.sal-clustering-commons:1.5.1.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.18]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)[192:org.opendaylight.controller.sal-clustering-commons:1.5.1.SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)[173:com.typesafe.akka.actor:2.4.18]
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[173:com.typesafe.akka.actor:2.4.18]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.18]
at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
2017-06-01 23:37:39,688 | INFO | lt-dispatcher-23 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-2-shard-default-config (Leader): Term 3 in "AppendEntriesReply [term=3, success=false, followerId=member-3-shard-default-config, logLastIndex=1047, logLastTerm=2, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3]" message is greater than leader's term 2 - switching to Follower
2017-06-01 23:37:39,689 | INFO | lt-dispatcher-23 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-2-shard-default-config (Leader) :- Switching from behavior Leader to Follower, election term: 3
2017-06-01 23:37:39,689 | INFO | lt-dispatcher-20 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-2-shard-default-config, leaderId=null, leaderPayloadVersion=-1]
2017-06-01 23:37:39,689 | INFO | ult-dispatcher-4 | RoleChangeNotifier | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | RoleChangeNotifier for member-2-shard-default-config , received role change from Leader to Follower
2017-06-01 23:37:39,689 | INFO | ult-dispatcher-4 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | shard-manager-config: Received role changed for member-2-shard-default-config from Leader to Follower
2017-06-01 23:37:40,765 | INFO | lt-dispatcher-22 | aftActorLeadershipTransferCohort | 193 - org.opendaylight.controller.sal-akka-raft - 1.5.1.SNAPSHOT | member-2-shard-default-config: Successfully transferred leadership to null in 4.320 s
2017-06-01 23:37:40,768 | INFO | ult-dispatcher-3 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | Stopping Shard member-2-shard-default-config
2017-06-01 23:37:50,032 | INFO | lt-dispatcher-32 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | Capping ReconnectingClientConnection{client=ClientIdentifier{frontend=member-2-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#-579249557, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} throttle delay from 6 to 5 seconds
2017-06-01 23:40:15,167 | ERROR | pool-30-thread-1 | WriteTransactionsHandler | 256 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.1.SNAPSHOT | Write transactions failed.
java.util.concurrent.TimeoutException: Timeout waiting for task.
at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:269)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)[57:com.google.guava:18.0.0]
at org.opendaylight.controller.clustering.it.provider.impl.WriteTransactionsHandler.maybeFinish(WriteTransactionsHandler.java:238)[256:org.opendaylight.controller.samples.clustering-it-provider:1.5.1.SNAPSHOT]
at org.opendaylight.controller.clustering.it.provider.impl.WriteTransactionsHandler.run(WriteTransactionsHandler.java:112)[256:org.opendaylight.controller.samples.clustering-it-provider:1.5.1.SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)[:1.8.0_121]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)[:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)[:1.8.0_121]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)[:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)[:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)[:1.8.0_121]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_121]
2017-06-01 23:40:15,367 | INFO | qtp10706451-308 | ClusterAdminRpcService | 201 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.1.SNAPSHOT | Adding replica for shard default

Comment by Vratko Polak [ 05/Jun/17 ]

The karaf log segment above is from Sandbox running against 57996/5 with cherry-picked 57699/5.

The scenario uses a module-based shard (default-config) with tell-based protocol. The transaction writer is started on the leader member, then remove-shard-replica is called. A similar test case (writer on a follower, still calling remove-shard-replica on the leader) passed (the writer finished without reporting any failure).

Similar failure happened on recent RelEng run [9], but karaf.log [10] segment between 13:53:54,938 and 13:56:47,140 shows UnreachableMember happened.

[9] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/733/log.html.gz#s1-s20-t1-k2-k10-k1
[10] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/733/odl1_karaf.log.gz

Comment by Vratko Polak [ 16/Jun/17 ]

This symptom re-appeared, it now happens [11] consistently but only in one test case. It tests module-based shards (tell based protocol) with single writer started on a follower, but then cluster-admin:make-leader-local attempts to move the leader to the writer member. The test expects writer to finish writing successfully.

The full writer response:
{"errors":{"error":[

{"error-type":"application","error-tag":"operation-failed","error-message":"Unexpected-exception","error-info":"java.util.concurrent.TimeoutException: Timeout waiting for task.\n\tat com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:269)\n\tat com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)\n\tat org.opendaylight.controller.clustering.it.provider.impl.WriteTransactionsHandler.maybeFinish(WriteTransactionsHandler.java:234)\n\tat org.opendaylight.controller.clustering.it.provider.impl.WriteTransactionsHandler.run(WriteTransactionsHandler.java:112)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat java.lang.Thread.run(Thread.java:748)\n"}

]}}

[11] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/750/log.html.gz#s1-s24-t5-k2-k10

Comment by Vratko Polak [ 21/Jun/17 ]

> consistently but only in one test case

Now I would say 50% of the time, still in only one [12] test case.
The response message is now more specific: "Final submit was timed out by the test provider or was interrupted".

The writer took 160 seconds to respond.

Few suspicious lines from writer member karaf.log [13] since 07:45:22,131 follow. The first one is especially confusing, since the connection is now reconnected to the new (local, member-2) leader, but it is still reporting NotLeaderException from the previous (remote, member-1) leader.

2017-06-21 07:45:38,519 | INFO | lt-dispatcher-20 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier

{frontend=member-2-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-2-shard-default-config#494504499, sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} indicated no leadership, reconnecting it
org.opendaylight.controller.cluster.access.commands.NotLeaderException: Actor Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-1667137708 is not the current leader

2017-06-21 07:45:38,624 | ERROR | lt-dispatcher-22 | OneForOneStrategy | 174 - com.typesafe.akka.slf4j - 2.4.18 | Transaction member-2-datastore-config-fe-0-chn-4-txn-5365-0 has already been sealed
java.lang.IllegalStateException: Transaction member-2-datastore-config-fe-0-chn-4-txn-5365-0 has already been sealed
at com.google.common.base.Preconditions.checkState(Preconditions.java:197)[57:com.google.guava:18.0.0]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.checkNotSealed(AbstractProxyTransaction.java:363)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]

2017-06-21 07:45:38,657 | INFO | ult-dispatcher-5 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-2-frontend-datastore-config: Ignoring unknown response Envelope{sessionId=1, txSequence=1, message=TransactionPurgeResponse{target=member-2-datastore-config-fe-0-chn-4-txn-5346-0, sequence=0}}

2017-06-21 07:48:18,136 | WARN | pool-30-thread-1 | WriteTransactionsHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.1.SNAPSHOT | Future #5362/7971 not completed yet

[12] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/756/log.html.gz#s1-s24-t5-k2-k10
[13] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/756/odl2_karaf.log.gz

Comment by Vratko Polak [ 26/Jun/17 ]

> the connection is now reconnected to the new (local, member-2) leader,
> but it is still reporting NotLeaderException from the previous (remote, member-1) leader.

Comparing log messages to code, I think that is a case of multiple FailureEnvelope from the same old backend.

Specifically, this [14] line confirms the connection to the new backend is established, but then this [15] line says the repeated message from the old backend is causing reconnect on the new connection.

In karaf.log [16] both log messages occur at 09:14:37,757.

In theory, spurious reconnects should not cause system test failures, so there is something to fix in other parts of the code as well.
But limiting reconnects (for example by having sessionId in FailureEnvelope and comparing it to the current value) would be helpful.

A cosmetic remark: ReconnectingClientConnection probably only needs to know maxMessages value; printing the whole ShardBackendInfo pointing to the old backend is slightly confusing, as that is not the backend ODL is likely to connect to.

[14] https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=blob;f=opendaylight/md-sal/cds-access-client/src/main/java/org/opendaylight/controller/cluster/access/client/ClientActorBehavior.java;h=554ffe97c77ae7c3011daca2a28c22f660a20e78;hb=refs/heads/stable/carbon#l307
[15] https://git.opendaylight.org/gerrit/gitweb?p=controller.git;a=blob;f=opendaylight/md-sal/cds-access-client/src/main/java/org/opendaylight/controller/cluster/access/client/ClientActorBehavior.java;h=554ffe97c77ae7c3011daca2a28c22f660a20e78;hb=refs/heads/stable/carbon#l177
[16] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/761/odl2_karaf.log.gz

Comment by Robert Varga [ 26/Jun/17 ]

https://git.opendaylight.org/gerrit/59527

Comment by Vratko Polak [ 29/Jun/17 ]

After rework of transaction writer, the error message [17] is different, but I believe it is the same bug. Now it happened on releng when the old leader got remove-shard-replica, and the single writer was on the member which stayed follower.

The new writer response:
{"errors":{"error":[

{"error-type":"application","error-tag":"operation-failed","error-message":"Final submit was timed out by the test provider or was interrupted","error-info":"java.util.concurrent.TimeoutException: Collection did not finish in 300 seconds\n\tat org.opendaylight.controller.clustering.it.provider.impl.AbstractTransactionHandler.checkComplete(AbstractTransactionHandler.java:187)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat java.lang.Thread.run(Thread.java:748)\n"}

]}}

The karaf.log [18] reports 2011 futures as not completed, nothing else is visible on INFO level.

[17] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/765/log.html.gz#s1-s20-t3-k2-k8
[18] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/765/odl1_karaf.log.gz

Generated at Wed Feb 07 19:56:09 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.