[CONTROLLER-1746] OOM with large number of closed transactions Created: 07/Aug/17  Updated: 11/Oct/18  Resolved: 02/Oct/18

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: Carbon
Fix Version/s: Neon, Fluorine SR1, Oxygen SR4

Type: Bug
Reporter: Stephen Kitt Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8941

 Description   

On a single-node Carbon SR1 setup, we got an OOM over the weekend. Looking at the heap dump shows 1.1GB occupied in the frontend history's closed transactions map, with 33M entries.

I'll upload the heap dump and add a link to it.



 Comments   
Comment by Stephen Kitt [ 07/Aug/17 ]

The heap dump is on https://www.sk2.org/java_pid2098.hprof.xz (269MB).

Comment by Robert Varga [ 07/Aug/17 ]

Affects both Carbon and Nitrogen. Blocking as this this a regression.

Confirmed by analyzing the dump reported in https://git.opendaylight.org/gerrit/#/c/61034/.

Comment by Robert Varga [ 07/Aug/17 ]

This is ask-based protocol and we are looking at transactions which have been closed but not purged. Tell-based protocol issues explicit purges from the frontend, for ask-based protocol this has to be done on the backend.

Examining the data, these are free-standing transactions, which have been committed – hence we are missing the purge step.

Comment by Robert Varga [ 07/Aug/17 ]

master: https://git.opendaylight.org/gerrit/61284

Comment by Michael Vorburger [ 08/Aug/17 ]

skitt: it’s a Carbon SR1 regression so anything older is something else

skitt: it’s easy to spot, Eclipse MAT tells you the biggest leak culprit is a huge completedTransactions HashMap$Node[]

Comment by Stephen Kitt [ 08/Aug/17 ]

(In reply to Michael Vorburger from comment #5)
> skitt: it’s a Carbon SR1 regression so anything older is something else
>
> skitt: it’s easy to spot, Eclipse MAT tells you the biggest leak culprit is
> a huge completedTransactions HashMap$Node[]

closeTransactions, not completeTransactions

Comment by Stephen Kitt [ 08/Aug/17 ]

(In reply to Stephen Kitt from comment #6)
> (In reply to Michael Vorburger from comment #5)
> > skitt: it’s a Carbon SR1 regression so anything older is something else
> >
> > skitt: it’s easy to spot, Eclipse MAT tells you the biggest leak culprit is
> > a huge completedTransactions HashMap$Node[]
>
> closeTransactions, not completeTransactions

close*d*Transactions, not complete*d*Transactions

Comment by Michael Vorburger [ 08/Aug/17 ]

> close*d*Transactions, not complete*d*Transactions

closedTransactions <= Dear Bugzilla indexer, please dig this.. [We will not find this bug in the future when searching for "closedTransactions" without this.]

Comment by Robert Varga [ 09/Aug/17 ]

carbon: https://git.opendaylight.org/gerrit/61433

Awaiting confirmation the fix works.

Comment by Stephen Kitt [ 11/Aug/17 ]

The fix looks OK, I’m seeing transactions being purged properly.

Comment by A H [ 14/Aug/17 ]

AR builds are failing on carbon and only one patch that has been merged recently. Please confirm if this change could have potentially caused an issue:

https://lists.opendaylight.org/pipermail/release/2017-August/011971.html

Comment by Robert Varga [ 14/Aug/17 ]

Yes, as the change introduces asynchronous variance to production code under normal operation. This introduces a time window of ~100ms, where UT code can observe a transient state.

As far as we can tell the issue lies solely with the UT suite, which needs to be fixed to account for the variance in behavior between autorelease VM, verify/merge VMs and developer environment.

Of the two issues seen in autorelease, one already has a fix, https://git.opendaylight.org/gerrit/61626, which has been reviewed. The other failure is under investigation.

Comment by Robert Varga [ 14/Aug/17 ]

By no means can we revert the fix: it fixes a long-term stability regression we introduced between Carbon and Carbon SR1. It has been reported from the field.

Comment by Robert Varga [ 14/Aug/17 ]

https://git.opendaylight.org/gerrit/61720 should fix the second UT failure

Comment by Huafei Zhang [ 27/Aug/18 ]

I apologize for accidentally reopen this bug when review it. Hope not make any trouble.

Comment by Chien-Hsin Chen [ 20/Sep/18 ]

This bug wasn't fixed. Still seeing a huge amount of closedTransactions in Oxygen-sr3.

The code in FrontendHistoryMetadataBuilder:

 

void onTransactionCommitted(final TransactionIdentifier txId) {                         
    closedTransactions.put(UnsignedLong.fromLongBits(txId.getTransactionId()), Boolean.TRUE);
}
void onTransactionPurged(final TransactionIdentifier txId) { 
    final UnsignedLong id = UnsignedLong.fromLongBits(txId.getTransactionId());
    closedTransactions.remove(id);
    purgedTransactions.add(Range.closedOpen(id, UnsignedLong.ONE.plus(id))); 
}
 

 

closedTransaction will be added on transaction committed and removed on transaction purged.

But it seems that these two operations have been disordered.

log:set debug org.opendaylight.controller.cluster.datastore.FrontendClientMetadataBuilder

 

2018-09-20 16:17:50,466 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-7-txn-58-0
2018-09-20 16:17:50,466 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-7-txn-58-0
2018-09-20 16:17:50,483 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-7-txn-59-0
2018-09-20 16:17:50,486 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-7-txn-59-0
2018-09-20 16:17:50,496 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-6-txn-90-0
2018-09-20 16:17:50,497 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-6-txn-90-0
2018-09-20 16:17:50,498 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-6-txn-91-0
2018-09-20 16:17:50,498 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-6-txn-91-0
2018-09-20 16:17:50,514 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-7-txn-60-0
2018-09-20 16:17:50,519 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-7-txn-60-0
2018-09-20 16:17:50,520 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-7-txn-61-0
2018-09-20 16:17:50,520 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-7-txn-61-0
2018-09-20 16:17:50,577 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-5-txn-85-0
2018-09-20 16:17:50,577 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed 
transaction member-1-datastore-operational-fe-0-chn-5-txn-85-0

Purged is always executed before committed for the same transaction. So the closed transactions keep being added back.

 

Comment by Robert Varga [ 01/Oct/18 ]

I think we need frontend debugs as well to understand what is going on. Frontend should not be issuing a purge before it sees the transaction committed.

Comment by Robert Varga [ 02/Oct/18 ]

jamescch is this reproducible with the configuration data store?

Comment by Robert Varga [ 02/Oct/18 ]

So this is a single node, with operational store. Unless this is reproducible with three nodes or with configuration data store, it would seem to indicate that sal-akka-raft is playing tricks by invoking us back.

Comment by Robert Varga [ 02/Oct/18 ]

Actually this is a problem with ShardDataTree.applyReplicatedPayload(), where we go down through payloadReplicationComplete(), but invoke allMetadataCommittedTransaction() only once we return.

This works with the Shard.replicatePayload() shortcut to subvert us: payloadReplicationComplete() will apply the state, but will also run through completion callbacks, which will issue a purge payload replication, which will bounce on stack too – hence it will completely execute before we get to invoking allMetadataCommittedTransaction(). Thus the mis-ordering will happen. Replication/persistence will thwart the replicatePayload() shortcut, hence we are good with three nodes or persistence.

Comment by Chien-Hsin Chen [ 11/Oct/18 ]

Great! The fix works well. The order looks correct now.

Generated at Wed Feb 07 19:56:21 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.