[OVSDB-428] Suspected memory leak in TransactionInvokerImpl Created: 07/Sep/17 Updated: 24/Feb/21 |
|
| Status: | Open |
| Project: | ovsdb |
| Component/s: | Southbound.Open_vSwitch |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | High |
| Reporter: | Michael Vorburger | Assignee: | Anil Vishnoi |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 9114 |
| Priority: | High |
| Description |
|
HPROF analysis with MAT in the context of see also |
| Comments |
| Comment by Michael Vorburger [ 08/Sep/17 ] |
|
https://drive.google.com/open?id=0B7gTXYDlI5sLcDdaQTFZUDlyX2s has the HPROF where I've seen this 50 MB in TransactionInvokerImpl. (Please just completely IGNORE the md.sal.trace memory consumption you'll also see in this HPROF; that's just a debug thing, not real production memory consumption.) |
| Comment by Vishal Thapar [ 23/May/18 ] |
|
vorburger Is this still an issue or has it been fixed by subsequent patches? |
| Comment by Michael Vorburger [ 23/May/18 ] |
|
I am not aware of any subsequent patches which could have fixed this, so unless proven otherwise, most likely still an issue. |
| Comment by Vishal Thapar [ 23/May/18 ] |
|
Was this a leak or actual usage? OVS sends status updates every 5 seconds unless disabled. Each update from OVS will result in Operational DS read for data on that switch. Stephen fixed transaction leaks in OVSDB, so wondering if we still have actual leaks or just high usage. Would it be possible to run this analysis with more recent code, with some of the tweaks we done recently in configuration to reduce data needed at 'rest' by cutting down on unnecessary status updates. |
| Comment by Michael Vorburger [ 23/May/18 ] |
|
Suspected leak not actual usage (50 MB is A LOT even for actual usage) because, see above, "seen at a stage when (according to Sridhar/Sai) the system should be "at rest" - so this looks suspicious". skitt fixed transaction leaks in OVSDB references to JIRA/Gerrits? master only or stable/oxygen also? Re. re-run, we need SaiMarapaReddy and/or JankiChhatbar to produce a new HPROF heap dump whenever the next scale lab runs so that we can check ther eif this is still visible. |
| Comment by Robert Varga [ 05/Dec/19 ] |
|
Looking at the dump, it is a single TransactionInvokerImpl instance, which breaks down as follows:
|
| Comment by Robert Varga [ 05/Dec/19 ] |
|
Looking at the retained objects, https://git.opendaylight.org/gerrit/c/ovsdb/+/86105 should help a bit (by making RowUpdates smaller). Other than that, I am not sure – the queue certainly should be drained at some point. |
| Comment by Robert Varga [ 05/Dec/19 ] |
|
Okay, so looking at TransactionInvokerImpl, it should be left bereft of life. What it does is provide a MPSC queueing to make things be asynchronous, hence all OVSDB nodes are submitting all commands through this queue – which ends up dancing with a transaction chain and transaction-per command thing. Now ... each OvsdbConnectionInstance is inherently bound to a particular InstanceIdentifier<Node>, which (I think, needs to be confirmed) essentially binds what subtrees each instance actually updates through its commands. If we have an overlap, we need to understand where and why (and what can we do about it). The invoker is always invoked from the context of a particular netty thread, which is also bound to an instance (the channel, each is serviced by at most one thread). Hence we really want to have a per-OvsdbConnectionInstance invoker, which is synchronous (in terms of interaction with other OVSDB code) and does the transaction thing with commands – no command queue, just talk directly to the transaction chain. Note the access needs to be synchronized to guard against concurrent failures (i.e. chain failing when we are touching it). The transaction chain should then be switched to pingpong – we do not care if transactions are failing in bunches, that's completely fine, as we will just retry them. Finally, a command's access to the transaction should be mediated by a memoizing supplier – we pass that to the invoker and if it asks for a transaction, we know we need to allocate it, etc. etc. With that, we:
|
| Comment by Robert Varga [ 05/Dec/19 ] |
|
Based on the analysis below, I think this is not a leak, but a tail-end of a workload: the system was under pressure, which has ceized – but we have a queued backlog we are still processing. Dark buffers Eliminating the queue will improve latency and while it may impact test case time negatively (by keeping backpressure towards OVSes), the tail end will essentially be non-existent or it will be pushed down to DS processing queue (which I think is pretty efficient). |