[CONTROLLER-1814] Datastore transactions fail to converge during partitioning Created: 26/Feb/18  Updated: 12/Apr/18  Resolved: 04/Apr/18

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Boron, Carbon, Nitrogen, Oxygen
Fix Version/s: Carbon, Nitrogen, Oxygen, Fluorine

Type: Bug Priority: Highest
Reporter: Robert Varga Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to CONTROLLER-1825 TransactionContextWrapper acting as b... Verified

 Description   

An issue has been reported from the field, which manifested itself as an OpenFlow transaction which does not close and hits OperationLimiter continuously.



 Comments   
Comment by Robert Varga [ 26/Feb/18 ]

Relevant log snippets on leader:

2018-01-26 18:21:05,339 | DEBUG | lt-dispatcher-17 | ShardReadWriteTransaction        | 185 - org.opendaylight.controller.sal-clustering-commons - 1.4.3.Boron-SR3 | Got ReceiveTimeout for inactivity - closing transaction member-3-datastore-operational-fe-1-chn-2373-txn-0
2018-01-26 18:21:05,339 | DEBUG | lt-dispatcher-17 | ShardDataTreeTransactionChain    | 191 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.Boron-SR3 | Aborted transaction ReadWriteShardDataTreeTransaction{id=member-3-datastore-operational-fe-1-chn-2373-txn-0, closed=true

Log messages on frontend:

2018-01-26 18:11:41,191 | DEBUG | ntLoopGroup-11-1 | RemoteTransactionContext         | 191 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.Boron-SR3 | Tx member-3-datastore-operational-fe-1-chn-2373-txn-0 sending 1000 batched modifications, ready: false
2018-01-26 18:11:46,199 | WARN  | ntLoopGroup-11-1 | RemoteTransactionContext         | 191 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.Boron-SR3 | Failed to acquire execute operation permit for transaction member-3-datastore-operational-fe-1-chn-2373-txn-0 on actor ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.18.130.52:2550/), Path(/user/shardmanager-operational/member-1-shard-inventory-operational/shard-inventory-member-3:datastore-operational@1:2373-0#-587302156)]

2018-01-26 18:32:21,289 | WARN  | ntLoopGroup-11-1 | RemoteTransactionContext         | 191 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.Boron-SR3 | Failed to acquire execute operation permit for transaction member-3-d
atastore-operational-fe-1-chn-2373-txn-0 on actor ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.18.130.52:2550/), Path(/user/shardmanager-operational/member-1-shard-inventory-operational/shard-inventory-member-3:datastore
-operational@1:2373-0#-587302156)]

 

Comment by Robert Varga [ 26/Feb/18 ]

As it turns out OperationLimiter accounting is wrong when faced with a combination of:

  • Transaction spanning multiple full BatchedModifications
  • AskTimeout of a BatchedModifications request

It relies on the BatchedModificationsReply's content to release permits. Since the request times out, only a single permit is released instead of 1000, leading to the application being throttled and not being able to proceed to submit the transaction.

Comment by Robert Varga [ 26/Feb/18 ]

Fluorine: https://git.opendaylight.org/gerrit/68757

Generated at Wed Feb 07 19:56:31 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.