Details
-
Bug
-
Status: Resolved
-
Resolution: Done
-
None
-
None
-
None
-
Operating System: All
Platform: All
-
2847
Description
I created 1 million data objects (cars) with 50 concurrent transactions each writing 20K entries in the CDS (single-node) which worked fine - memory usage stayed under 0.5G with a max of 2G. So the in-memory journal log trimming did its job. It didn't snapshot b/c it never reached 20K journal entries (due to the batched modifications) and didn't trip the 70% memory threshold.
However, when I restarted the controller it ran out of memory trying to recover the journal:
Uncaught error from thread [opendaylight-cluster-data-akka.actor.default-dispatcher-18] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-dataUncaught error from thread [opendaylight-cluster-data-akka.actor.default-dispatcher-24] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-data]
Uncaught error from thread [java.lang.OutOfMemoryError: GC overhead limit exceeded
]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.StringBuilder.toString(StringBuilder.java:405)
at org.opendaylight.controller.cluster.datastore.node.utils.stream.NormalizedNodeInputStreamReader.readQName(NormalizedNodeInputStreamReader.java:223)
at org.opendaylight.controller.cluster.datastore.node.utils.stream.NormalizedNodeInputStreamReader.readPathArgument(NormalizedNodeInputStreamReader.java:350)
at org.opendaylight.controller.cluster.datastore.node.utils.stream.NormalizedNodeInputStreamReader.readYangInstanceIdentifierInternal(NormalizedNodeInputStreamReader.java:326)
at org.opendaylight.controller.cluster.datastore.node.utils.stream.NormalizedNodeInputStreamReader.readYangInstanceIdentifier(NormalizedNodeInputStreamReader.java:317)
at org.opendaylight.controller.cluster.datastore.utils.SerializationUtils.deserializePathAndNode(SerializationUtils.java:76)
at org.opendaylight.controller.cluster.datastore.modification.WriteModification.readExternal(WriteModification.java:62)
at org.opendaylight.controller.cluster.datastore.modification.MergeModification.fromStream(MergeModification.java:59)
at org.opendaylight.controller.cluster.datastore.modification.MutableCompositeModification.readExternal(MutableCompositeModification.java:95)
at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.opendaylight.controller.cluster.datastore.modification.ModificationPayload.getModification(ModificationPayload.java:46)
at org.opendaylight.controller.cluster.datastore.Shard.appendRecoveredLogEntry(Shard.java:726)
at org.opendaylight.controller.cluster.raft.RaftActor.batchRecoveredLogEntry(RaftActor.java:262)
at org.opendaylight.controller.cluster.raft.RaftActor.onRecoveredApplyLogEntries(RaftActor.java:247)
at org.opendaylight.controller.cluster.raft.RaftActor.handleRecover(RaftActor.java:178)
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveRecover(AbstractUntypedPersistentActor.java:51)
at org.opendaylight.controller.cluster.datastore.Shard.onReceiveRecover(Shard.java:245)
at akka.persistence.UntypedPersistentActor$$anonfun$receiveRecover$1.applyOrElse(Eventsourced.scala:433)
I increased the max JVM memory to 4G and still ran out of memory.
Looking at the heap dump, the currentLogRecoveryBatch retained over 2G as did the ReplicatedLog journal (although they share the MutableCompositeModification instances). Both had 126 entries with each MutableCompositeModification instance retaining ~21M so it didn't get very far.
We keep applying entries to the Shard's currentLogRecoveryBatch until we reach the shardJournalRecoveryLogBatchSize, at which point we queue up a transaction. One thing I noticed is that we don't clear the currentLogRecoveryBatch when we end a batch - this is a bug. However we never hit the end of a batch before running out of memory b/c shardJournalRecoveryLogBatchSize is 5000 by default.
I dropped it to 10 and it got farther (136 entries) but still ran out of memory. Now the ShardRecoveryCoordinator retains ~2G as it captures the current currentLogRecoveryBatch. The ReplicatedLog only retains ~467M b/c it stores serialized ModificationPayload instances which are much smaller, ~2M.
It seems we need also to trim the in-memory log as we're recovering the log. We trim the log as transactions are committed live. In fact, for single-node we don't even retain the in-memory log.
I think there's several things we need to address here. Here's initial thoughts:
- Null out currentLogRecoveryBatch in applyCurrentLogRecoveryBatch after submitting to the ShardRecoveryCoordinator.
- Take into account the memory usage when batching recovered journal entries.
- Store the ModificationPayload instances in currentLogRecoveryBatch instead of the de-serialized instances. De-serialize when applying to the transaction.
- The ShardRecoveryCoordinator uses an executor to prepare write transactions in parallel and then commits them all on recovery complete. Do we need to do this? This requires significant memory overhead when recovering a large journal - may not scale well. It's probably OK to commit immediately.
- Lower the default shardJournalRecoveryLogBatchSize (eg 100).
- Trim the in-memory log as we're recovering. For single-node this is easy. For a cluster we'd need to take into account the previous replicatedToAllIndex so it would have to be persisted. ApplyJournalEntries seems like a good candidate.
I can look at this more when I get back.