[CONTROLLER-1823] Regression in OF cluster test Created: 05/Apr/18  Updated: 12/Apr/18  Resolved: 07/Apr/18

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: Carbon, Nitrogen, Oxygen, Fluorine

Type: Bug Priority: High
Reporter: Luis Gomez Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File karaf1.log.gz     File karaf2.log.gz     File karaf3.log.gz    
Issue Links:
Relates
relates to CONTROLLER-1825 TransactionContextWrapper acting as b... Verified

 Description   

There is a regression in OF cluster test: Stats collection freezes after disconnecting switch from owner or when reconnecting switch to additional members. This does not happen immediately but after a few retries.

After some investigation, I was able to trace back the regression to this patch and all the cherry-picks:

https://git.opendaylight.org/gerrit/#/c/68900/



 Comments   
Comment by Luis Gomez [ 05/Apr/18 ]

Attached are full traces with cluster debug. Problem in this case started when switch was only connected to member-1 and it initiates connection to extra member-2 and member-3 at 01:59:32, you can see the WARNs on member-1 few seconds after:

2018-04-02 01:59:51,328 | WARN  | ofppool-2        | TransactionContextWrapper        | 215 - org.opendaylight.controller.sal-distributed-datastore - 1.6.3.SNAPSHOT | Failed to acquire enqueue operation permit for transaction member-1-datastore-operational-fe-0-chn-6-txn-96-0 on shard inventory
2018-04-02 01:59:56,330 | WARN  | ofppool-2        | RemoteTransactionContext         | 215 - org.opendaylight.controller.sal-distributed-datastore - 1.6.3.SNAPSHOT | Failed to acquire execute operation permit for transaction member-1-datastore-operational-fe-0-chn-6-txn-96-0 on actor ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@192.168.0.102:2550/), Path(/user/shardmanager-operational/member-2-shard-inventory-operational/shard-inventory-member-1:datastore-operational@0:6-96_797#643731958)]
Comment by Robert Varga [ 05/Apr/18 ]

The problem is the stateful semaphore handoff between TransactionContextWrapper and RemoteTransactionContext. RemoteTransactionContext needs to understand when an incoming operations has tried to acquire a permit and whether it was successful in doing so.

With the CONTROLLER-1814 change the backend-controlled feedback which provided a semi-recovery in this scenario went away, leading to OperationLimiter to be under-released, as the permits acquired by TransactionContextWrapper would never get released by RemoteTransactionContext.

Comment by Robert Varga [ 05/Apr/18 ]

Nitrogen: https://git.opendaylight.org/gerrit/70382

Comment by Robert Varga [ 06/Apr/18 ]

Oxygen: https://git.opendaylight.org/gerrit/70384

Fluorine: https://git.opendaylight.org/gerrit/70439

Carbon: https://git.opendaylight.org/gerrit/70440

 

Generated at Wed Feb 07 19:56:32 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.