[CONTROLLER-1700] Timeout waiting for task from writer started after heal after long isolation Created: 29/May/17  Updated: 25/Jul/23  Resolved: 26/Jun/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by CONTROLLER-1665 C: write-transactions does not return... Resolved
is blocked by CONTROLLER-1687 Transaction writer does not fail expe... Resolved
External issue ID: 8562

 Description   

This symptom is very similar to CONTROLLER-1674 but happens in a different scenario.

Still three writers (module-based shards, tell-based protocol) and leader is isolated, for more than request timeout. The isolated writer fails (with TimeoutException instead of RequestTimeoutException, that might be a separate bug or a cause of this one). Then the member is rejoined and it is verified each shard has a Leader and two Followers.

At the end of the scenario, we start a writer on the rejoined node, and we expect it to finish writing without errors. Instead finishing after 67 seconds, TimeoutException is seen. The other two writers finish correctly.

In a recent Sandbox test [0] (which failed to upload archive with logs) the response starts with:
{"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Unexpected-exception","error-info":"java.util.concurrent.TimeoutException: Timeout waiting for task.\n\tat com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:269)\n\tat com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)\n\tat com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:69)\n\tat org.opendaylight.yangtools.util.concurrent.MappingCheckedFuture.get(MappingCheckedFuture.java:85)\n\tat com.google.common.util.concurrent.AbstractCheckedFuture.checkedGet(AbstractCheckedFuture.java:107)\n\tat org.opendaylight.controller.clustering.it.provider.impl.WriteTransactionsHandler.ensureListExists(WriteTransactionsHandler.java:146)\n\tat org.opendaylight.controller.clustering.it.provider.impl.WriteTransactionsHandler.start(WriteTransactionsHandler.java:126)\n\tat org.opendaylight.controller.clustering.it.provider.MdsalLowLevelTestProvider.writeTransactions(MdsalLowLevelTestProvider.java:250)\n\tat org.opendaylight.yangtools.yang.binding.util.RpcMethodInvokerWithInput.invokeOn(RpcMethodInvokerWithInput.java:31)\n\tat

[0] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon/6/robot/report/log.html#s1-s10-t3-k2-k25-k1-k8



 Comments   
Comment by Vratko Polak [ 05/Jun/17 ]

The same behavior is now seen [1] in production Jenkins.

Karaf.log [2] from the isolated member-1 shows two suspicious messages.
First, there was an additional UnreachableMember at 14:39:41,438 even though the test started isolation only at 14:39:41,475 (but the difference is small enough so that it might be explained by insufficient clock synchronization and asynchronous loging).

Second, there was a failure from the second writer (started after rejoin). If it is an error in the writer, we can change the suite to use different list item id.

2017-06-04 14:44:12,349 | WARN | qtp1030155268-77 | WriteTransactionsHandler | 256 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.1.SNAPSHOT | Unable to ensure IdInts list for id: prefix-1 exists.
java.util.concurrent.TimeoutException: Timeout waiting for task.

[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/733/log.html.gz#s1-s28-t3-k2-k25-k1-k1
[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/733/odl1_karaf.log.gz

Comment by Robert Varga [ 05/Jun/17 ]

The test failed because the previous test case ended up not cleaning up properly:

2017-06-04 13:40:57,384 | ERROR | ult-dispatcher-7 | ClusterAdminRpcService | 201 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.1.SNAPSHOT | Failed to remove replica for shard /(tag:opendaylight.org,2017:controller
:yang:lowlevel:target?revision=2017-02-15)id-ints
org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Shard member-1-shard-id-ints!-config currently has no leader. Try again later.
at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.createNoShardLeaderException(ShardManager.java:944)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.onShardNotInitializedTimeout(ShardManager.java:772)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.shardmanager.ShardManager.handleCommand(ShardManager.java:251)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)[192:org.opendaylight.controller.sal-clustering-commons:1.5.1.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.18]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)[192:org.opendaylight.controller.sal-clustering-commons:1.5.1.SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)[173:com.typesafe.akka.actor:2.4.18]
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[173:com.typesafe.akka.actor:2.4.18]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.18]
at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]

Which means the frontend was instantiated multiple times:

2017-06-04 13:40:57,486 | INFO | h for user karaf | command | 265 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Remove_Follower_Prefix_Shard_Replica_And_Add_It_Back
2017-06-04 13:41:17,996 | INFO | h for user karaf | command | 265 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Produce_Transactions_One_Node_Leader
2017-06-04 13:41:19,125 | ERROR | lt-dispatcher-26 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-1-frontend-datastore-Shard-id-ints!: current generation ClientIdentifier

{frontend=member-1-frontend-datastore-Shard-id-ints!, generation=0}

has been superseded
org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation was superseded by 1
at org.opendaylight.controller.cluster.datastore.Shard.getFrontend(Shard.java:402)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:466)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:293)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.1.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)[193:org.opendaylight.controller.sal-akka-raft:1.5.1.SNAPSHOT]

further down it lead to a failure to bind the producer:

2017-06-04 13:43:20,754 | INFO | h for user karaf | command | 265 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Produce_Transactions
2017-06-04 13:43:22,169 | ERROR | tp1878068729-321 | ContainerResponse | 68 - com.sun.jersey.jersey-server - 1.17.0 | The RuntimeException could not be mapped to a response, re-throwing to the HTTP container
java.lang.IllegalArgumentException: Subtree DOMDataTreeIdentifier{datastore=CONFIGURATION, root=/(tag:opendaylight.org,2017:controller:yang:lowlevel:target?revision=2017-02-15)id-ints/id-int/id-int[

{(tag:opendaylight.org,2017:controller:yang:lowlevel:target?revision=2017-02-15)id=prefix-1}

]} is attached to producer org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@517147f9
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:145)[57:com.google.guava:18.0.0]
at org.opendaylight.mdsal.dom.broker.ShardedDOMDataTree.createProducer(ShardedDOMDataTree.java:144)
at org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTree.createProducer(DistributedShardedDOMDataTree.java:302)
at Proxy6a0a520a_b713_4500_a5c7_f00e5cc1496b.createProducer(Unknown Source)
at Proxycb991ad6_d042_4d13_a04c_2afd2767f9f4.createProducer(Unknown Source)
at org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler.fillInitialList(ProduceTransactionsHandler.java:134)
at org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler.start(ProduceTransactionsHandler.java:115)
at org.opendaylight.controller.clustering.it.provider.MdsalLowLevelTestProvider.produceTransactions(MdsalLowLevelTestProvider.java:562)

Comment by Tomas Cere [ 06/Jun/17 ]

(In reply to Robert Varga from comment #2)

> Which means the frontend was instantiated multiple times:
>

I checked the logs, and I dont see how the frontend can be instantiated twice.
Only place where the datastoreClient is created when we plug the shard frontend into the ShardingService and this only happens in the create-shard rpc.
When the replica is removed the frontend still remains and should reconnect, but both of the removals fail because there is no shard leader present.

Once the removal of the replica fails the test doesnt attempt to add the replica back, but even if it did, when you add a replica of a shard you dont create new frontend since this is completely separated from the layer that handles the interaction with the ShardingService.

Comment by Vratko Polak [ 12/Jun/17 ]

A symptom similar to this (except the restconf call did not finish, instead of returning TimeoutException) happened together with CONTROLLER-1708 and CONTROLLER-1718 in a recent RelEng test [3].

[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/739/log.html.gz#s1-s28-t3-k2-k25-k1-k8

Generated at Wed Feb 07 19:56:13 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.