[CONTROLLER-1746] OOM with large number of closed transactions Created: 07/Aug/17 Updated: 11/Oct/18 Resolved: 02/Oct/18 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | mdsal |
| Affects Version/s: | Carbon |
| Fix Version/s: | Neon, Fluorine SR1, Oxygen SR4 |
| Type: | Bug | ||
| Reporter: | Stephen Kitt | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8941 |
| Description |
|
On a single-node Carbon SR1 setup, we got an OOM over the weekend. Looking at the heap dump shows 1.1GB occupied in the frontend history's closed transactions map, with 33M entries. I'll upload the heap dump and add a link to it. |
| Comments |
| Comment by Stephen Kitt [ 07/Aug/17 ] |
|
The heap dump is on https://www.sk2.org/java_pid2098.hprof.xz (269MB). |
| Comment by Robert Varga [ 07/Aug/17 ] |
|
Affects both Carbon and Nitrogen. Blocking as this this a regression. Confirmed by analyzing the dump reported in https://git.opendaylight.org/gerrit/#/c/61034/. |
| Comment by Robert Varga [ 07/Aug/17 ] |
|
This is ask-based protocol and we are looking at transactions which have been closed but not purged. Tell-based protocol issues explicit purges from the frontend, for ask-based protocol this has to be done on the backend. Examining the data, these are free-standing transactions, which have been committed – hence we are missing the purge step. |
| Comment by Robert Varga [ 07/Aug/17 ] |
| Comment by Michael Vorburger [ 08/Aug/17 ] |
|
skitt: it’s a Carbon SR1 regression so anything older is something else skitt: it’s easy to spot, Eclipse MAT tells you the biggest leak culprit is a huge completedTransactions HashMap$Node[] |
| Comment by Stephen Kitt [ 08/Aug/17 ] |
|
(In reply to Michael Vorburger from comment #5) closeTransactions, not completeTransactions |
| Comment by Stephen Kitt [ 08/Aug/17 ] |
|
(In reply to Stephen Kitt from comment #6) close*d*Transactions, not complete*d*Transactions |
| Comment by Michael Vorburger [ 08/Aug/17 ] |
|
> close*d*Transactions, not complete*d*Transactions closedTransactions <= Dear Bugzilla indexer, please dig this.. |
| Comment by Robert Varga [ 09/Aug/17 ] |
|
carbon: https://git.opendaylight.org/gerrit/61433 Awaiting confirmation the fix works. |
| Comment by Stephen Kitt [ 11/Aug/17 ] |
|
The fix looks OK, I’m seeing transactions being purged properly. |
| Comment by A H [ 14/Aug/17 ] |
|
AR builds are failing on carbon and only one patch that has been merged recently. Please confirm if this change could have potentially caused an issue: https://lists.opendaylight.org/pipermail/release/2017-August/011971.html |
| Comment by Robert Varga [ 14/Aug/17 ] |
|
Yes, as the change introduces asynchronous variance to production code under normal operation. This introduces a time window of ~100ms, where UT code can observe a transient state. As far as we can tell the issue lies solely with the UT suite, which needs to be fixed to account for the variance in behavior between autorelease VM, verify/merge VMs and developer environment. Of the two issues seen in autorelease, one already has a fix, https://git.opendaylight.org/gerrit/61626, which has been reviewed. The other failure is under investigation. |
| Comment by Robert Varga [ 14/Aug/17 ] |
|
By no means can we revert the fix: it fixes a long-term stability regression we introduced between Carbon and Carbon SR1. It has been reported from the field. |
| Comment by Robert Varga [ 14/Aug/17 ] |
|
https://git.opendaylight.org/gerrit/61720 should fix the second UT failure |
| Comment by Huafei Zhang [ 27/Aug/18 ] |
|
I apologize for accidentally reopen this bug when review it. Hope not make any trouble. |
| Comment by Chien-Hsin Chen [ 20/Sep/18 ] |
|
This bug wasn't fixed. Still seeing a huge amount of closedTransactions in Oxygen-sr3. The code in FrontendHistoryMetadataBuilder:
void onTransactionCommitted(final TransactionIdentifier txId) { closedTransactions.put(UnsignedLong.fromLongBits(txId.getTransactionId()), Boolean.TRUE); } void onTransactionPurged(final TransactionIdentifier txId) { final UnsignedLong id = UnsignedLong.fromLongBits(txId.getTransactionId()); closedTransactions.remove(id); purgedTransactions.add(Range.closedOpen(id, UnsignedLong.ONE.plus(id))); }
closedTransaction will be added on transaction committed and removed on transaction purged. But it seems that these two operations have been disordered. log:set debug org.opendaylight.controller.cluster.datastore.FrontendClientMetadataBuilder
2018-09-20 16:17:50,466 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-7-txn-58-0 2018-09-20 16:17:50,466 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-7-txn-58-0 2018-09-20 16:17:50,483 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-7-txn-59-0 2018-09-20 16:17:50,486 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-7-txn-59-0 2018-09-20 16:17:50,496 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-6-txn-90-0 2018-09-20 16:17:50,497 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-6-txn-90-0 2018-09-20 16:17:50,498 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-6-txn-91-0 2018-09-20 16:17:50,498 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-6-txn-91-0 2018-09-20 16:17:50,514 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-7-txn-60-0 2018-09-20 16:17:50,519 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-7-txn-60-0 2018-09-20 16:17:50,520 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-7-txn-61-0 2018-09-20 16:17:50,520 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-7-txn-61-0 2018-09-20 16:17:50,577 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Purged transaction member-1-datastore-operational-fe-0-chn-5-txn-85-0 2018-09-20 16:17:50,577 | DEBUG | dispatcher-24 | lientMetadataBuilder | member-1-shard-inventory-operational: Committed transaction member-1-datastore-operational-fe-0-chn-5-txn-85-0 Purged is always executed before committed for the same transaction. So the closed transactions keep being added back.
|
| Comment by Robert Varga [ 01/Oct/18 ] |
|
I think we need frontend debugs as well to understand what is going on. Frontend should not be issuing a purge before it sees the transaction committed. |
| Comment by Robert Varga [ 02/Oct/18 ] |
|
jamescch is this reproducible with the configuration data store? |
| Comment by Robert Varga [ 02/Oct/18 ] |
|
So this is a single node, with operational store. Unless this is reproducible with three nodes or with configuration data store, it would seem to indicate that sal-akka-raft is playing tricks by invoking us back. |
| Comment by Robert Varga [ 02/Oct/18 ] |
|
Actually this is a problem with ShardDataTree.applyReplicatedPayload(), where we go down through payloadReplicationComplete(), but invoke allMetadataCommittedTransaction() only once we return. This works with the Shard.replicatePayload() shortcut to subvert us: payloadReplicationComplete() will apply the state, but will also run through completion callbacks, which will issue a purge payload replication, which will bounce on stack too – hence it will completely execute before we get to invoking allMetadataCommittedTransaction(). Thus the mis-ordering will happen. Replication/persistence will thwart the replicatePayload() shortcut, hence we are good with three nodes or persistence. |
| Comment by Chien-Hsin Chen [ 11/Oct/18 ] |
|
Great! The fix works well. The order looks correct now. |