[CONTROLLER-1209] Clustering: Recovering a large journal runs out of memory Created: 15/Mar/15 Updated: 25/Jul/23 Resolved: 31/Mar/15 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | mdsal |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Tom Pantelis | Assignee: | Tom Pantelis |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 2847 |
| Description |
|
I created 1 million data objects (cars) with 50 concurrent transactions each writing 20K entries in the CDS (single-node) which worked fine - memory usage stayed under 0.5G with a max of 2G. So the in-memory journal log trimming did its job. It didn't snapshot b/c it never reached 20K journal entries (due to the batched modifications) and didn't trip the 70% memory threshold. However, when I restarted the controller it ran out of memory trying to recover the journal: Uncaught error from thread [opendaylight-cluster-data-akka.actor.default-dispatcher-18] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-dataUncaught error from thread [opendaylight-cluster-data-akka.actor.default-dispatcher-24] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-data] I increased the max JVM memory to 4G and still ran out of memory. Looking at the heap dump, the currentLogRecoveryBatch retained over 2G as did the ReplicatedLog journal (although they share the MutableCompositeModification instances). Both had 126 entries with each MutableCompositeModification instance retaining ~21M so it didn't get very far. We keep applying entries to the Shard's currentLogRecoveryBatch until we reach the shardJournalRecoveryLogBatchSize, at which point we queue up a transaction. One thing I noticed is that we don't clear the currentLogRecoveryBatch when we end a batch - this is a bug. However we never hit the end of a batch before running out of memory b/c shardJournalRecoveryLogBatchSize is 5000 by default. I dropped it to 10 and it got farther (136 entries) but still ran out of memory. Now the ShardRecoveryCoordinator retains ~2G as it captures the current currentLogRecoveryBatch. The ReplicatedLog only retains ~467M b/c it stores serialized ModificationPayload instances which are much smaller, ~2M. It seems we need also to trim the in-memory log as we're recovering the log. We trim the log as transactions are committed live. In fact, for single-node we don't even retain the in-memory log. I think there's several things we need to address here. Here's initial thoughts:
I can look at this more when I get back. |
| Comments |
| Comment by Tom Pantelis [ 15/Mar/15 ] |
|
The main issues here are the first and third bullet items above. If we store the serialized ModificationPayload instances, which are the same instances stored in the ReplicatedLog, in currentLogRecoveryBatch list then it's not using anymore memory than the ReplicatedLog. However I think we'll still run into a similar problem with caching all the prepared DOM write transactions in the ShardRecoveryCoordinator until onRecoveryComplete. I think we need to immediately prepare and commit each batched transaction to free up the memory. |
| Comment by Tom Pantelis [ 15/Mar/15 ] |
|
I made the first 2 changes below and recovery completed successfully. There were 306 log entries each with 20K merge mods (one for each car) so there were actually over 6M car objects total (I had run my stress test multiple times) in the persisted journal although only the last 1M were current (the test overwrites the cars container). This equates to about 612M used by the ReplicatedLog. All of the merge mods were committed in one huge transaction and it took about 1.3 minutes to complete. Memory spiked over 2.5G and eventually settled to ~2G. I then changed the ShardRecoveryCoordinator to commit each log entries batch immediately and lowered the shardJournalRecoveryLogBatchSize to 100 so all entries would be committed in 4 transactions. This took a little over a minute. Memory profile was about the same. So trimming the in-memory log on recovery would eliminate the 612M used by the ReplicatedLog. It seems like it should've snapshotted though when it originally ran with over 600M in the log. With single-node (no followers), we keep track of the dataSizeSinceLastSnapshot however we divide it by 5 for the memory check. This is a hard-coded heuristic so we don't snapshot too often as we really don't need it to trim the in-memory log. However this prevents trimming the persisted log. I can try it without dividing by 5 to see at what point it would take a snapshot. (In reply to Tom Pantelis from comment #1) |
| Comment by Moiz Raja [ 31/Mar/15 ] |