[CONTROLLER-1664] C: OutOfOrderRequestException: Expecting request 0 (while create-prefix-shard) Created: 09/May/17  Updated: 25/Jul/23  Resolved: 26/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Peter Gubka Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks CONTROLLER-1653 C: UnknownHistoryException: Last know... Resolved
blocks CONTROLLER-1682 write-transactions: LocalReadWritePro... Resolved
blocks CONTROLLER-1686 Shards fail to settle after brief iso... Resolved
is blocked by CONTROLLER-1695 tell-based-protocol: ConcurrentModifi... Resolved
Duplicate
is duplicated by CONTROLLER-1682 write-transactions: LocalReadWritePro... Resolved
is duplicated by CONTROLLER-1684 OutOfOrderRequestException when remov... Resolved
External issue ID: 8402

 Description   

https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-4th/5/archives/odl1_karaf.log.gz

At the bottom of the log file:
2017-05-09 05:02:42,423 | WARN | monPool-worker-0 | PrefixShardHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.0.SNAPSHOT | Shard[/(tag:opendaylight.org,2017:controller:yang:lowlevel:target?revision=2017-02-15)id-ints] creation failed:
org.opendaylight.controller.cluster.sharding.DOMDataTreeShardCreationFailedException: Unable to create a cds shard.
at org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTree$2.onFailure(DistributedShardedDOMDataTree.java:366)
at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:902)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.Futures$ChainingListenableFuture.run(Futures.java:902)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)[57:com.google.guava:18.0.0]
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:68)[57:com.google.guava:18.0.0]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.lambda$directCommit$1(AbstractProxyTransaction.java:353)
at org.opendaylight.controller.cluster.access.client.ConnectionEntry.complete(ConnectionEntry.java:50)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.receiveResponse(AbstractClientConnection.java:300)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onResponse(ClientActorBehavior.java:143)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onRequestFailure(ClientActorBehavior.java:155)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.internalOnRequestFailure(ClientActorBehavior.java:179)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:124)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:42)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.AbstractClientActor.onReceiveCommand(AbstractClientActor.java:59)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.17]
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)[173:com.typesafe.akka.actor:2.4.17]
at akka.actor.Actor$class.aroundReceive(Actor.scala:497)[173:com.typesafe.akka.actor:2.4.17]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.17]
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:664)[180:com.typesafe.akka.persistence:2.4.17]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.17]
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.17]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.17]
at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.17]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
Caused by: org.opendaylight.controller.cluster.access.commands.OutOfOrderRequestException: Expecting request 0
at org.opendaylight.controller.cluster.datastore.LeaderFrontendState.checkRequestSequence(LeaderFrontendState.java:96)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.LeaderFrontendState.handleTransactionRequest(LeaderFrontendState.java:182)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:450)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:283)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)[193:org.opendaylight.controller.sal-akka-raft:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)[192:org.opendaylight.controller.sal-clustering-commons:1.5.0.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.17]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)[192:org.opendaylight.controller.sal-clustering-commons:1.5.0.SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)[173:com.typesafe.akka.actor:2.4.17]
... 14 more



 Comments   
Comment by Peter Gubka [ 09/May/17 ]

The same exception happened for write-transactions rpc in remote to remote explicit leader movement scenario
https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-2nd/12/archives/log.html.gz#s1-s1-t2-k2-k8

Transaction producer was on node3 and default hsard leader moved form node1 to node2.

Node1 logs:
https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-2nd/12/archives/odl1_karaf.log.gz
2017-05-09 04:57:22,379 | INFO | h for user karaf | command | 266 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Remote_To_Remote_Movement

Comment by Peter Gubka [ 09/May/17 ]

Same http status 500 with "OutOfOrderRequestException: Expecting request 0" is returned for clean leader shutdown scenario when remove-shard-replica is used.
Replica is removed from on leader node and no matter where producer runs (same or different node)

Robot and logs:
https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-3rd/8

See logs after
2017-05-09 04:53:22,244 | INFO | h for user karaf | command | 266 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Local_Leader_Shutdown
or
2017-05-09 04:54:45,023 | INFO | h for user karaf | command | 266 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Remote_Leader_Shutdown

Comment by Robert Varga [ 09/May/17 ]

I'll need debug logs for org.opendaylight.controller.cluster.databroker.actors a org.opendaylight.controller.cluster.datastore.

Comment by Peter Gubka [ 10/May/17 ]

(In reply to Robert Varga from comment #3)
> I'll need debug logs for
> org.opendaylight.controller.cluster.databroker.actors a
> org.opendaylight.controller.cluster.datastore.

For simplicity i tried to replicate the scenario explicit leader movement with module shards. I reduced the suite to just one test case.

Without debug logging
https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-3rd/10/
the exception is received as expected.

With debug logging i have not succeeded to replicate the problem.
https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-3rd/11/
write-transactions just timed out from robot side. And it generated huge logs.

Locally when switch debug logging i am not able to replicate it too.

Comment by Robert Varga [ 12/May/17 ]

Aside from the usual replay issues, there seems to be a problem with concurrent access:

java.util.ConcurrentModificationException
at java.util.ArrayDeque$DeqIterator.next(ArrayDeque.java:643)[:1.8.0_121]
at com.google.common.collect.Iterators$5.next(Iterators.java:558)[57:com.google.guava:18.0.0]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.replayMessages(AbstractProxyTransaction.java:507)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.actors.dds.ProxyHistory$ReconnectCohort.replayRequests(ProxyHistory.java:232)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractClientHistory$1.replayRequests(AbstractClientHistory.java:310)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractDataStoreClientBehavior.lambda$connectionUp$0(AbstractDataStoreClientBehavior.java:147)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.backendConnectFinished(ClientActorBehavior.java:280)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.lambda$null$3(ClientActorBehavior.java:329)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:118)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:42)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.AbstractClientActor.onReceiveCommand(AbstractClientActor.java:59)[197:org.opendaylight.controller.cds-access-client:1.1.0.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.17]
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)[173:com.typesafe.akka.actor:2.4.17]
at akka.actor.Actor$class.aroundReceive(Actor.scala:497)[173:com.typesafe.akka.actor:2.4.17]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.17]
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:664)[180:com.typesafe.akka.persistence:2.4.17]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.17]
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.17]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.17]
at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.17]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.17]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]

This wrecks the replay, hence we are probably missing some messages. A concurrent modification should not be occuring, as the queue should be protected by locks in this code path.

Comment by Robert Varga [ 14/May/17 ]

This affects prefix-based shards with RESTCONF's POST operation and requires introduction of a new message to the protocol, hence raising priority.

Note this affects both creation and normal operations on those shards, as well as the data store with tell-based protocol.

Comment by Robert Varga [ 14/May/17 ]

https://git.opendaylight.org/gerrit/57028

Comment by Robert Varga [ 16/May/17 ]

Transmit queue fix: https://git.opendaylight.org/gerrit/57152

Comment by A H [ 17/May/17 ]

We are looking to build Carbon RC2 tomorrow 5/18 at 23:59 UTC time assuming there are no blocker bugs. Is there an ETA for when a fix can be merged and this bug resolved for stable/carbon branch?

Comment by A H [ 17/May/17 ]

Robert, do we need to merge this patch for Carbon Blocker CONTROLLER-1664? https://git.opendaylight.org/gerrit/#/c/57289/

Also, it appears multiple patches have been assigned to this Bug ID. Do you have a list of all relevant patches that must be merged to fix CONTROLLER-1664?

Comment by Robert Varga [ 18/May/17 ]

An, I have split the formerly second patch into three for easier review:

https://git.opendaylight.org/gerrit/57289
https://git.opendaylight.org/gerrit/57293
https://git.opendaylight.org/gerrit/57028

I have updated the spreadsheet with the patch IDs and added their sequences into notes for easy reference.

Comment by Peter Gubka [ 19/May/17 ]

The exception is present in the logs, e.g. https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/698/archives/odl2_karaf.log.gz

2017-05-19 09:41:13,386 | ERROR | lt-dispatcher-25 | DistributedShardedDOMDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | Removal of shard DOMDataTreeIdentifier

{datastore=CONFIGURATION, root=/(tag:opendaylight.org,2017:controller:yang:lowlevel:target?revision=2017-02-15)id-ints}

from configuration failed.
org.opendaylight.controller.cluster.access.commands.OutOfOrderRequestException: Expecting request 0

But it was not propagated to robot via restcong responses. I havent found any in the robot log.
https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/698/archives/log.html.gz

Comment by Peter Gubka [ 23/May/17 ]

The presence of the OutOfOrderRequestException: Expecting request 0 is here:
https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-2nd/6/archives/odl2_karaf.log.gz

Now it is not happening during the prefix shard creation, but during the prefix shard removal.
Test details:
1) Prefix shard created and produce-transactions started on shard leader node
2) Shard leader node isolated
3) produce-transactions rpc should have returned an error, but it did not
4) Isolated node rejoined
5) Produce-transactions finished with TimeoutException
https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-2nd/6/archives/log.html.gz#s1-s2-t1-k2-k21-k4
6) Prefix shards removal invoked (and on the node2 appeared OutOfOrderRequestException

Logs have debug logging on for loggers org.opendaylight.controller.cluster.access.client
org.opendaylight.controller.cluster.databroker.actors.dds

Comment by Robert Varga [ 23/May/17 ]

Okay, the logs now shed some more light on this.

There are two issues here:

1) PrefixedShardConfigWriter is creating a snapshot, invoking an exist and not closing it. This causes the snapshot to exist for the duration of the transaction chain, across all reconnects, when it is sending only IncrementTransactionSequenceRequest.

2) IncrementTransactionSequenceRequest does not have a record of who is originating it, i.e. whether it is a read-only snapshot or a transaction.

Hence when the shard is being deleted, we are opening a read-write transaction and sending a request with correct sequence, but the transaction chain state is incorrect, because it saw only IncrementTransactionSequenceRequest, which allocated a transaction (not a snapshot) and that was never closed, leading to a splat:

Caused by: java.lang.IllegalStateException: Transaction ReadWriteShardDataTreeTransaction{id=member-2-datastore-Shard-prefix-configuration-shard-fe-0-chn-1-txn-1-0, closed=false, snapshot=MutableDataTree [modification=NodeModification [identifier=(urn:ietf:params:xml:ns:netconf:base:1.0)data, modificationType=NONE, childModification={}]]} is open

which leaves the backend still expecting request 0 when it receives the second request, which triggers the OutOfOrderRequestException.

https://git.opendaylight.org/gerrit/57681 fixes PrefixedShardConfigWriter to clean up the read-only transaction, which will cause the first transaction to be closed/purged and hide the issue.

The second fix involves fixing up IncrementTransactionSequenceRequest and will be posted soon.

Comment by Robert Varga [ 23/May/17 ]

https://git.opendaylight.org/gerrit/57692

Generated at Wed Feb 07 19:56:07 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.