[CONTROLLER-1983] Akka artery fails with java.lang.OutOfMemoryError: Direct buffer memory Created: 09/Jun/21 Updated: 21/Apr/22 Resolved: 13/Jan/22 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | None |
| Affects Version/s: | 4.0.0 |
| Fix Version/s: | 5.0.0 |
| Type: | Bug | Priority: | High |
| Reporter: | Ivan Hrasko | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | pt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
It is impossible to run cluster node on VM with less than 6GiB RAM. When there is less memory we get: 2021-06-08T16:12:16,816 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-6 | ActorSystemImpl | 203 - org.opendaylight.controller.repackaged-akka - 4.0.0.SNAPSHOT | Uncaught error from thread [opendaylight-cluster-data-akka.remote.default-remote-dispatcher-7]: Direct buffer memory, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-data] java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:175) ~[?:?] at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118) ~[?:?] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) ~[?:?] at akka.remote.artery.EnvelopeBufferPool.acquire(EnvelopeBufferPool.scala:34) ~[bundleFile:?] at akka.remote.artery.Encoder$$anon$1.onPush(Codecs.scala:117) ~[bundleFile:?] at akka.stream.impl.fusing.GraphInterpreter.processPush(GraphInterpreter.scala:541) ~[bundleFile:?] at akka.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:495) ~[bundleFile:?] at akka.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:390) ~[bundleFile:?] at akka.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:625) ~[bundleFile:?] at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:502) ~[bundleFile:?] at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:600) ~[bundleFile:?] at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:773) ~[bundleFile:?] at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$shortCircuitBatch(ActorGraphInterpreter.scala:762) ~[bundleFile:?] at akka.stream.impl.fusing.ActorGraphInterpreter.preStart(ActorGraphInterpreter.scala:753) ~[bundleFile:?] at akka.actor.Actor.aroundPreStart(Actor.scala:548) ~[bundleFile:?] at akka.actor.Actor.aroundPreStart$(Actor.scala:548) ~[bundleFile:?] at akka.stream.impl.fusing.ActorGraphInterpreter.aroundPreStart(ActorGraphInterpreter.scala:691) ~[bundleFile:?] at akka.actor.ActorCell.create(ActorCell.scala:641) ~[bundleFile:?] at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:513) [bundleFile:?] at akka.actor.ActorCell.systemInvoke(ActorCell.scala:535) [bundleFile:?] at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:295) [bundleFile:?] at akka.dispatch.Mailbox.run(Mailbox.scala:230) [bundleFile:?] at akka.dispatch.Mailbox.exec(Mailbox.scala:243) [bundleFile:?] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?] at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?] at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]
|
| Comments |
| Comment by Ivan Hrasko [ 09/Jun/21 ] |
|
This problem has occurred after switching to Akka artery and is similar to: https://jira.opendaylight.org/browse/CONTROLLER-1575 |
| Comment by Ivan Hrasko [ 22/Jun/21 ] |
|
I have made a test where I am investigating the size of direct memory used and direct memory allocated. Test preparation: VMs configuration: 2048MB RAM, 2 CPUs Statistic at time of ODL deployment (after Karaf starts): This step is without problems. 2. Install clustering features (odl-netconf-clustered-topology odl-restconf-nb-rfc8040 odl-restconf-nb-bierman02). During this step OOM error appears and cluster is not operable. Statistics at time of the crash: In debugger we can see (https://github.com/openjdk/jdk/commit/24c70fcf8845b447362aa1da9e672ca896c181c3): 2021-06-22T16:20:44.274684+02:00[Europe/Bratislava] :Cannot reserve 419430400 bytes of direct buffer memory (allocated: 839028241, limit: 1073741824) In jconsole we can see the non-heap usage is less than 200MB. See picture Screenshot_2021-06-22_16-22-46.png. The result: I think that the application in fact needs less memory than artery is trying to allocate and this is casused The frame size is set too much high. Akka is trying to allocate the full size of the buffer - 400 MB regardless of the actual needs. We should use lower frame size for most common traffic and large frame size for actors with traffic with very large data. |
| Comment by Ivan Hrasko [ 23/Jun/21 ] |
|
IMO: We should decrease maximum-frame-size and use maximum-large-frame-size. When increasing maximum-large-frame-size we have to be aware of: See also: https://doc.akka.io/docs/akka/current/general/configuration-reference.html |
| Comment by Ivan Hrasko [ 23/Jun/21 ] |
|
rovarga Do we know destinations (actors) that require to use large frames? |
| Comment by Ivan Hrasko [ 23/Jun/21 ] |
|
Maybe the best would be to use streams: https://doc.akka.io/docs/akka/2.6.15/stream/stream-io.html?language=java |
| Comment by Robert Varga [ 12/Jul/21 ] |
|
Hmm, so these can be any of the AbstractRaftActor subclasses, as well as DataTreeChange actors. Probably others – essentially any actor which sends or receives any message containing a NormalizedNode. |
| Comment by Robert Varga [ 12/Jul/21 ] |
|
Actually what we'd like to leverage https://doc.akka.io/docs/akka/current/typed/reliable-delivery.html#chunk-large-messages – but that requires switching to Akka Typed, which is a rather large undertaking. Another option is that we ditch akka persistence (which seems very likely, as it is a very bad fit). Based on what that's going to do to messaging patterns and replication capabilities, we may end up with something completely different |
| Comment by Robert Varga [ 12/Jul/21 ] |
|
And yet another option (for a FE/BE) is switching to tell-based protocol, but that requires addressing
|
| Comment by Robert Varga [ 13/Jul/21 ] |
|
So the problem here is actually https://doc.akka.io/docs/akka/current/serialization.html That API requires us to serialize every message into a byte[], and I think Akka will not fragment that byte[] into Aeron, i.e. it will always send it verbatim. So here is where we really would like to interface Aeron directly with a Shard's journal – this makes utter sense, as we have immutable DTOs on input (NormalizedNode and such) and can serialize into any DataOutput, I think. We therefore can transparently fragment each message as we are serializing it – and it is just a matter having the proper integration surface. tcere this is the Aeron part we talked about today. We actually have the tools needed to do automatic fragmentation (https://github.com/opendaylight/controller/blob/master/opendaylight/md-sal/sal-clustering-commons/src/main/java/org/opendaylight/controller/cluster/io/ChunkedOutputStream.java#L62) and obviously defrag as well |