[CONTROLLER-1664] C: OutOfOrderRequestException: Expecting request 0 (while create-prefix-shard) Created: 09/May/17 Updated: 25/Jul/23 Resolved: 26/May/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Peter Gubka | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| External issue ID: | 8402 | ||||||||||||||||||||||||||||||||
| Description |
|
At the bottom of the log file: |
| Comments |
| Comment by Peter Gubka [ 09/May/17 ] |
|
The same exception happened for write-transactions rpc in remote to remote explicit leader movement scenario Transaction producer was on node3 and default hsard leader moved form node1 to node2. Node1 logs: |
| Comment by Peter Gubka [ 09/May/17 ] |
|
Same http status 500 with "OutOfOrderRequestException: Expecting request 0" is returned for clean leader shutdown scenario when remove-shard-replica is used. Robot and logs: See logs after |
| Comment by Robert Varga [ 09/May/17 ] |
|
I'll need debug logs for org.opendaylight.controller.cluster.databroker.actors a org.opendaylight.controller.cluster.datastore. |
| Comment by Peter Gubka [ 10/May/17 ] |
|
(In reply to Robert Varga from comment #3) For simplicity i tried to replicate the scenario explicit leader movement with module shards. I reduced the suite to just one test case. Without debug logging With debug logging i have not succeeded to replicate the problem. Locally when switch debug logging i am not able to replicate it too. |
| Comment by Robert Varga [ 12/May/17 ] |
|
Aside from the usual replay issues, there seems to be a problem with concurrent access: java.util.ConcurrentModificationException This wrecks the replay, hence we are probably missing some messages. A concurrent modification should not be occuring, as the queue should be protected by locks in this code path. |
| Comment by Robert Varga [ 14/May/17 ] |
|
This affects prefix-based shards with RESTCONF's POST operation and requires introduction of a new message to the protocol, hence raising priority. Note this affects both creation and normal operations on those shards, as well as the data store with tell-based protocol. |
| Comment by Robert Varga [ 14/May/17 ] |
| Comment by Robert Varga [ 16/May/17 ] |
|
Transmit queue fix: https://git.opendaylight.org/gerrit/57152 |
| Comment by A H [ 17/May/17 ] |
|
We are looking to build Carbon RC2 tomorrow 5/18 at 23:59 UTC time assuming there are no blocker bugs. Is there an ETA for when a fix can be merged and this bug resolved for stable/carbon branch? |
| Comment by A H [ 17/May/17 ] |
|
Robert, do we need to merge this patch for Carbon Blocker Also, it appears multiple patches have been assigned to this Bug ID. Do you have a list of all relevant patches that must be merged to fix |
| Comment by Robert Varga [ 18/May/17 ] |
|
An, I have split the formerly second patch into three for easier review: https://git.opendaylight.org/gerrit/57289 I have updated the spreadsheet with the patch IDs and added their sequences into notes for easy reference. |
| Comment by Peter Gubka [ 19/May/17 ] |
|
The exception is present in the logs, e.g. https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/698/archives/odl2_karaf.log.gz 2017-05-19 09:41:13,386 | ERROR | lt-dispatcher-25 | DistributedShardedDOMDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.Carbon | Removal of shard DOMDataTreeIdentifier {datastore=CONFIGURATION, root=/(tag:opendaylight.org,2017:controller:yang:lowlevel:target?revision=2017-02-15)id-ints} from configuration failed. But it was not propagated to robot via restcong responses. I havent found any in the robot log. |
| Comment by Peter Gubka [ 23/May/17 ] |
|
The presence of the OutOfOrderRequestException: Expecting request 0 is here: Now it is not happening during the prefix shard creation, but during the prefix shard removal. Logs have debug logging on for loggers org.opendaylight.controller.cluster.access.client |
| Comment by Robert Varga [ 23/May/17 ] |
|
Okay, the logs now shed some more light on this. There are two issues here: 1) PrefixedShardConfigWriter is creating a snapshot, invoking an exist and not closing it. This causes the snapshot to exist for the duration of the transaction chain, across all reconnects, when it is sending only IncrementTransactionSequenceRequest. 2) IncrementTransactionSequenceRequest does not have a record of who is originating it, i.e. whether it is a read-only snapshot or a transaction. Hence when the shard is being deleted, we are opening a read-write transaction and sending a request with correct sequence, but the transaction chain state is incorrect, because it saw only IncrementTransactionSequenceRequest, which allocated a transaction (not a snapshot) and that was never closed, leading to a splat: Caused by: java.lang.IllegalStateException: Transaction ReadWriteShardDataTreeTransaction{id=member-2-datastore-Shard-prefix-configuration-shard-fe-0-chn-1-txn-1-0, closed=false, snapshot=MutableDataTree [modification=NodeModification [identifier=(urn:ietf:params:xml:ns:netconf:base:1.0)data, modificationType=NONE, childModification={}]]} is open which leaves the backend still expecting request 0 when it receives the second request, which triggers the OutOfOrderRequestException. https://git.opendaylight.org/gerrit/57681 fixes PrefixedShardConfigWriter to clean up the read-only transaction, which will cause the first transaction to be closed/purged and hide the issue. The second fix involves fixing up IncrementTransactionSequenceRequest and will be posted soon. |
| Comment by Robert Varga [ 23/May/17 ] |