[CONTROLLER-1209] Clustering: Recovering a large journal runs out of memory Created: 15/Mar/15  Updated: 25/Jul/23  Resolved: 31/Mar/15

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Tom Pantelis Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 2847

 Description   

I created 1 million data objects (cars) with 50 concurrent transactions each writing 20K entries in the CDS (single-node) which worked fine - memory usage stayed under 0.5G with a max of 2G. So the in-memory journal log trimming did its job. It didn't snapshot b/c it never reached 20K journal entries (due to the batched modifications) and didn't trip the 70% memory threshold.

However, when I restarted the controller it ran out of memory trying to recover the journal:

Uncaught error from thread [opendaylight-cluster-data-akka.actor.default-dispatcher-18] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-dataUncaught error from thread [opendaylight-cluster-data-akka.actor.default-dispatcher-24] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-data]
Uncaught error from thread [java.lang.OutOfMemoryError: GC overhead limit exceeded
]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.StringBuilder.toString(StringBuilder.java:405)
at org.opendaylight.controller.cluster.datastore.node.utils.stream.NormalizedNodeInputStreamReader.readQName(NormalizedNodeInputStreamReader.java:223)
at org.opendaylight.controller.cluster.datastore.node.utils.stream.NormalizedNodeInputStreamReader.readPathArgument(NormalizedNodeInputStreamReader.java:350)
at org.opendaylight.controller.cluster.datastore.node.utils.stream.NormalizedNodeInputStreamReader.readYangInstanceIdentifierInternal(NormalizedNodeInputStreamReader.java:326)
at org.opendaylight.controller.cluster.datastore.node.utils.stream.NormalizedNodeInputStreamReader.readYangInstanceIdentifier(NormalizedNodeInputStreamReader.java:317)
at org.opendaylight.controller.cluster.datastore.utils.SerializationUtils.deserializePathAndNode(SerializationUtils.java:76)
at org.opendaylight.controller.cluster.datastore.modification.WriteModification.readExternal(WriteModification.java:62)
at org.opendaylight.controller.cluster.datastore.modification.MergeModification.fromStream(MergeModification.java:59)
at org.opendaylight.controller.cluster.datastore.modification.MutableCompositeModification.readExternal(MutableCompositeModification.java:95)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.opendaylight.controller.cluster.datastore.modification.ModificationPayload.getModification(ModificationPayload.java:46)
at org.opendaylight.controller.cluster.datastore.Shard.appendRecoveredLogEntry(Shard.java:726)
at org.opendaylight.controller.cluster.raft.RaftActor.batchRecoveredLogEntry(RaftActor.java:262)
at org.opendaylight.controller.cluster.raft.RaftActor.onRecoveredApplyLogEntries(RaftActor.java:247)
at org.opendaylight.controller.cluster.raft.RaftActor.handleRecover(RaftActor.java:178)
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveRecover(AbstractUntypedPersistentActor.java:51)
at org.opendaylight.controller.cluster.datastore.Shard.onReceiveRecover(Shard.java:245)
at akka.persistence.UntypedPersistentActor$$anonfun$receiveRecover$1.applyOrElse(Eventsourced.scala:433)

I increased the max JVM memory to 4G and still ran out of memory.

Looking at the heap dump, the currentLogRecoveryBatch retained over 2G as did the ReplicatedLog journal (although they share the MutableCompositeModification instances). Both had 126 entries with each MutableCompositeModification instance retaining ~21M so it didn't get very far.

We keep applying entries to the Shard's currentLogRecoveryBatch until we reach the shardJournalRecoveryLogBatchSize, at which point we queue up a transaction. One thing I noticed is that we don't clear the currentLogRecoveryBatch when we end a batch - this is a bug. However we never hit the end of a batch before running out of memory b/c shardJournalRecoveryLogBatchSize is 5000 by default.

I dropped it to 10 and it got farther (136 entries) but still ran out of memory. Now the ShardRecoveryCoordinator retains ~2G as it captures the current currentLogRecoveryBatch. The ReplicatedLog only retains ~467M b/c it stores serialized ModificationPayload instances which are much smaller, ~2M.

It seems we need also to trim the in-memory log as we're recovering the log. We trim the log as transactions are committed live. In fact, for single-node we don't even retain the in-memory log.

I think there's several things we need to address here. Here's initial thoughts:

  • Null out currentLogRecoveryBatch in applyCurrentLogRecoveryBatch after submitting to the ShardRecoveryCoordinator.
  • Take into account the memory usage when batching recovered journal entries.
  • Store the ModificationPayload instances in currentLogRecoveryBatch instead of the de-serialized instances. De-serialize when applying to the transaction.
  • The ShardRecoveryCoordinator uses an executor to prepare write transactions in parallel and then commits them all on recovery complete. Do we need to do this? This requires significant memory overhead when recovering a large journal - may not scale well. It's probably OK to commit immediately.
  • Lower the default shardJournalRecoveryLogBatchSize (eg 100).
  • Trim the in-memory log as we're recovering. For single-node this is easy. For a cluster we'd need to take into account the previous replicatedToAllIndex so it would have to be persisted. ApplyJournalEntries seems like a good candidate.

I can look at this more when I get back.



 Comments   
Comment by Tom Pantelis [ 15/Mar/15 ]

The main issues here are the first and third bullet items above. If we store the serialized ModificationPayload instances, which are the same instances stored in the ReplicatedLog, in currentLogRecoveryBatch list then it's not using anymore memory than the ReplicatedLog.

However I think we'll still run into a similar problem with caching all the prepared DOM write transactions in the ShardRecoveryCoordinator until onRecoveryComplete. I think we need to immediately prepare and commit each batched transaction to free up the memory.

Comment by Tom Pantelis [ 15/Mar/15 ]

I made the first 2 changes below and recovery completed successfully. There were 306 log entries each with 20K merge mods (one for each car) so there were actually over 6M car objects total (I had run my stress test multiple times) in the persisted journal although only the last 1M were current (the test overwrites the cars container). This equates to about 612M used by the ReplicatedLog. All of the merge mods were committed in one huge transaction and it took about 1.3 minutes to complete. Memory spiked over 2.5G and eventually settled to ~2G.

I then changed the ShardRecoveryCoordinator to commit each log entries batch immediately and lowered the shardJournalRecoveryLogBatchSize to 100 so all entries would be committed in 4 transactions. This took a little over a minute. Memory profile was about the same.

So trimming the in-memory log on recovery would eliminate the 612M used by the ReplicatedLog.

It seems like it should've snapshotted though when it originally ran with over 600M in the log. With single-node (no followers), we keep track of the dataSizeSinceLastSnapshot however we divide it by 5 for the memory check. This is a hard-coded heuristic so we don't snapshot too often as we really don't need it to trim the in-memory log. However this prevents trimming the persisted log. I can try it without dividing by 5 to see at what point it would take a snapshot.

(In reply to Tom Pantelis from comment #1)
> The main issues here are the first and third bullet items above. If we store
> the serialized ModificationPayload instances, which are the same instances
> stored in the ReplicatedLog, in currentLogRecoveryBatch list then it's not
> using anymore memory than the ReplicatedLog.
>
> However I think we'll still run into a similar problem with caching all the
> prepared DOM write transactions in the ShardRecoveryCoordinator until
> onRecoveryComplete. I think we need to immediately prepare and commit each
> batched transaction to free up the memory.

Comment by Moiz Raja [ 31/Mar/15 ]

https://git.opendaylight.org/gerrit/#/c/16540/

Generated at Wed Feb 07 19:54:57 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.