[OVSDB-428] Suspected memory leak in TransactionInvokerImpl Created: 07/Sep/17  Updated: 24/Feb/21

Status: Open
Project: ovsdb
Component/s: Southbound.Open_vSwitch
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: High
Reporter: Michael Vorburger Assignee: Anil Vishnoi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 9114
Priority: High

 Description   

HPROF analysis with MAT in the context of CONTROLLER-1756 is showing 50 MB used up in org.opendaylight.ovsdb.southbound.transactions.md.TransactionInvokerImpl, at a stage when (according to Sridhar/Sai) the system should be "at rest" - so this looks suspicious.. so I'm guessing something is probably not really done right there.

see also OVSDB-423 in related code, but that just cleaned up shutdown, not fix this.



 Comments   
Comment by Michael Vorburger [ 08/Sep/17 ]

https://drive.google.com/open?id=0B7gTXYDlI5sLcDdaQTFZUDlyX2s has the HPROF where I've seen this 50 MB in TransactionInvokerImpl. (Please just completely IGNORE the md.sal.trace memory consumption you'll also see in this HPROF; that's just a debug thing, not real production memory consumption.)

Comment by Vishal Thapar [ 23/May/18 ]

vorburger Is this still an issue or has it been fixed by subsequent patches?

Comment by Michael Vorburger [ 23/May/18 ]

I am not aware of any subsequent patches which could have fixed this, so unless proven otherwise, most likely still an issue.

Comment by Vishal Thapar [ 23/May/18 ]

Was this a leak or actual usage? OVS sends status updates every 5 seconds unless disabled. Each update from OVS will result in Operational DS read for data on that switch. Stephen fixed transaction leaks in OVSDB, so wondering if we still have actual leaks or just high usage.

Would it be possible to run this analysis with more recent code, with some of the tweaks we done recently in configuration to reduce data needed at 'rest' by cutting down on unnecessary status updates.

Comment by Michael Vorburger [ 23/May/18 ]

Suspected leak not actual usage (50 MB is A LOT even for actual usage) because, see above, "seen at a stage when (according to Sridhar/Sai) the system should be "at rest" - so this looks suspicious". skitt fixed transaction leaks in OVSDB references to JIRA/Gerrits? master only or stable/oxygen also? Re. re-run, we need SaiMarapaReddy and/or JankiChhatbar to produce a new HPROF heap dump whenever the next scale lab runs so that we can check ther eif this is still visible.

Comment by Robert Varga [ 05/Dec/19 ]

Looking at the dump, it is a single TransactionInvokerImpl instance, which breaks down as follows:

  • transactionToCommand -> holding two OvsdbOperationalCommandAggregators, one weighing 9MB, the other 10K
  • inputQueue, which has 4248 elements, for a total of 41MB retained size. again, these are OvsdbOperationalCommandAggregators <10K each
Comment by Robert Varga [ 05/Dec/19 ]

Looking at the retained objects, https://git.opendaylight.org/gerrit/c/ovsdb/+/86105 should help a bit (by making RowUpdates smaller). Other than that, I am not sure – the queue certainly should be drained at some point.

Comment by Robert Varga [ 05/Dec/19 ]

Okay, so looking at TransactionInvokerImpl, it should be left bereft of life.

What it does is provide a MPSC queueing to make things be asynchronous, hence all OVSDB nodes are submitting all commands through this queue – which ends up dancing with a transaction chain and transaction-per command thing.

Now ... each OvsdbConnectionInstance is inherently bound to a particular InstanceIdentifier<Node>, which (I think, needs to be confirmed) essentially binds what subtrees each instance actually updates through its commands. If we have an overlap, we need to understand where and why (and what can we do about it).

The invoker is always invoked from the context of a particular netty thread, which is also bound to an instance (the channel, each is serviced by at most one thread).

Hence we really want to have a per-OvsdbConnectionInstance invoker, which is synchronous (in terms of interaction with other OVSDB code) and does the transaction thing with commands – no command queue, just talk directly to the transaction chain. Note the access needs to be synchronized to guard against concurrent failures (i.e. chain failing when we are touching it).

The transaction chain should then be switched to pingpong – we do not care if transactions are failing in bunches, that's completely fine, as we will just retry them.

Finally, a command's access to the transaction should be mediated by a memoizing supplier – we pass that to the invoker and if it asks for a transaction, we know we need to allocate it, etc. etc.

With that, we:

  • will eliminate 41MiB out of the 50MiB overhead reported here
  • will exert natural backpressure on incoming channels
  • will eliminate unnecessary synchronization on the MPSC queue
  • will elide transaction allocation when not needed
  • will be submitting things into DS in parallel, while also taking advantage of its batching capabilities

 

Comment by Robert Varga [ 05/Dec/19 ]

Based on the analysis below, I think this is not a leak, but a tail-end of a workload: the system was under pressure, which has ceized – but we have a queued backlog we are still processing. Dark buffers

Eliminating the queue will improve latency and while it may impact test case time negatively (by keeping backpressure towards OVSes), the tail end will essentially be non-existent or it will be pushed down to DS processing queue (which I think is pretty efficient).

Generated at Wed Feb 07 20:36:22 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.