[CONTROLLER-1575] Akka artery fails with java.lang.OutOfMemoryError: Direct buffer memory Created: 10/Jan/17  Updated: 19/Oct/17  Resolved: 19/Sep/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: 0.4.0
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 7518

 Description   

This is tracked here:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/controller-csit-3node-clustering-only-carbon/

Looking at the console of a failing test:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/controller-csit-3node-clustering-only-carbon/478/console

I can see:

15:30:51 2017-01-10 15:21:02,052 | ERROR | ult-dispatcher-3 | ActorSystemImpl | 203 - com.typesafe.akka.slf4j - 2.4.16 | Uncaught error from thread [opendaylight-cluster-data-akka.remote.default-remote-dispatcher-10] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled
15:30:51 2017-01-10 15:21:02,070 | ERROR | ult-dispatcher-3 | kka://opendaylight-cluster-data) | 203 - com.typesafe.akka.slf4j - 2.4.16 | Aeron error, Direct buffer memory
15:30:51 java.lang.OutOfMemoryError: Direct buffer memory
15:30:51 at java.nio.Bits.reserveMemory(Bits.java:693)[:1.8.0_111]
15:30:51 at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)[:1.8.0_111]
15:30:51 at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)[:1.8.0_111]
15:30:51 at akka.remote.artery.EnvelopeBufferPool.acquire(BufferPool.scala:40)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.AeronSource$Fragments$$anon$2.onFragment(AeronSource.scala:59)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at io.aeron.FragmentAssembler.onFragment(FragmentAssembler.java:82)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.logbuffer.TermReader.read(TermReader.java:74)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.Image.poll(Image.java:211)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.Subscription.poll(Subscription.java:132)[218:io.aeron.client:1.0.4]
15:30:51 at akka.remote.artery.AeronSource$$anonfun$akka$remote$artery$AeronSource$$pollTask$1.apply$mcZ$sp(AeronSource.scala:38)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.TaskRunner.executeTasks(TaskRunner.scala:171)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.TaskRunner.run(TaskRunner.scala:150)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at java.lang.Thread.run(Thread.java:745)[:1.8.0_111]
15:30:51 2017-01-10 15:21:04,161 | ERROR | ult-dispatcher-3 | kka://opendaylight-cluster-data) | 203 - com.typesafe.akka.slf4j - 2.4.16 | Aeron error, Direct buffer memory
15:30:51 java.lang.OutOfMemoryError: Direct buffer memory
15:30:51 at java.nio.Bits.reserveMemory(Bits.java:693)[:1.8.0_111]
15:30:51 at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)[:1.8.0_111]
15:30:51 at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)[:1.8.0_111]
15:30:51 at akka.remote.artery.EnvelopeBufferPool.acquire(BufferPool.scala:40)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.AeronSource$Fragments$$anon$2.onFragment(AeronSource.scala:59)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at io.aeron.FragmentAssembler.onFragment(FragmentAssembler.java:82)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.logbuffer.TermReader.read(TermReader.java:74)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.Image.poll(Image.java:211)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.Subscription.poll(Subscription.java:127)[218:io.aeron.client:1.0.4]
15:30:51 at akka.remote.artery.AeronSource$$anonfun$akka$remote$artery$AeronSource$$pollTask$1.apply$mcZ$sp(AeronSource.scala:38)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.TaskRunner.executeTasks(TaskRunner.scala:171)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.TaskRunner.run(TaskRunner.scala:150)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at java.lang.Thread.run(Thread.java:745)[:1.8.0_111]
15:30:51 2017-01-10 15:21:04,909 | ERROR | ult-dispatcher-3 | kka://opendaylight-cluster-data) | 203 - com.typesafe.akka.slf4j - 2.4.16 | Aeron error, Direct buffer memory
15:30:51 java.lang.OutOfMemoryError: Direct buffer memory
15:30:51 at java.nio.Bits.reserveMemory(Bits.java:693)[:1.8.0_111]
15:30:51 at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)[:1.8.0_111]
15:30:51 at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)[:1.8.0_111]
15:30:51 at akka.remote.artery.EnvelopeBufferPool.acquire(BufferPool.scala:40)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.AeronSource$Fragments$$anon$2.onFragment(AeronSource.scala:59)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at io.aeron.FragmentAssembler.onFragment(FragmentAssembler.java:82)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.logbuffer.TermReader.read(TermReader.java:74)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.Image.poll(Image.java:211)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.Subscription.poll(Subscription.java:127)[218:io.aeron.client:1.0.4]
15:30:51 at akka.remote.artery.AeronSource$$anonfun$akka$remote$artery$AeronSource$$pollTask$1.apply$mcZ$sp(AeronSource.scala:38)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.TaskRunner.executeTasks(TaskRunner.scala:171)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.TaskRunner.run(TaskRunner.scala:150)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at java.lang.Thread.run(Thread.java:745)[:1.8.0_111]
15:30:51 2017-01-10 15:21:05,638 | ERROR | ult-dispatcher-3 | kka://opendaylight-cluster-data) | 203 - com.typesafe.akka.slf4j - 2.4.16 | Aeron error, Direct buffer memory
15:30:51 java.lang.OutOfMemoryError: Direct buffer memory
15:30:51 at java.nio.Bits.reserveMemory(Bits.java:693)[:1.8.0_111]
15:30:51 at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)[:1.8.0_111]
15:30:51 at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)[:1.8.0_111]
15:30:51 at akka.remote.artery.EnvelopeBufferPool.acquire(BufferPool.scala:40)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.AeronSource$Fragments$$anon$2.onFragment(AeronSource.scala:59)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at io.aeron.FragmentAssembler.onFragment(FragmentAssembler.java:82)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.logbuffer.TermReader.read(TermReader.java:74)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.Image.poll(Image.java:211)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.Subscription.poll(Subscription.java:127)[218:io.aeron.client:1.0.4]
15:30:51 at akka.remote.artery.AeronSource$$anonfun$akka$remote$artery$AeronSource$$pollTask$1.apply$mcZ$sp(AeronSource.scala:38)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.TaskRunner.executeTasks(TaskRunner.scala:171)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.TaskRunner.run(TaskRunner.scala:150)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at java.lang.Thread.run(Thread.java:745)[:1.8.0_111]
15:30:51 2017-01-10 15:21:06,401 | ERROR | lt-dispatcher-14 | kka://opendaylight-cluster-data) | 203 - com.typesafe.akka.slf4j - 2.4.16 | Aeron error, Direct buffer memory
15:30:51 java.lang.OutOfMemoryError: Direct buffer memory
15:30:51 at java.nio.Bits.reserveMemory(Bits.java:693)[:1.8.0_111]
15:30:51 at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)[:1.8.0_111]
15:30:51 at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)[:1.8.0_111]
15:30:51 at akka.remote.artery.EnvelopeBufferPool.acquire(BufferPool.scala:40)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.AeronSource$Fragments$$anon$2.onFragment(AeronSource.scala:59)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at io.aeron.FragmentAssembler.onFragment(FragmentAssembler.java:82)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.logbuffer.TermReader.read(TermReader.java:74)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.Image.poll(Image.java:211)[218:io.aeron.client:1.0.4]
15:30:51 at io.aeron.Subscription.poll(Subscription.java:127)[218:io.aeron.client:1.0.4]
15:30:51 at akka.remote.artery.AeronSource$$anonfun$akka$remote$artery$AeronSource$$pollTask$1.apply$mcZ$sp(AeronSource.scala:38)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.TaskRunner.executeTasks(TaskRunner.scala:171)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at akka.remote.artery.TaskRunner.run(TaskRunner.scala:150)[215:com.typesafe.akka.remote:2.4.16]
15:30:51 at java.lang.Thread.run(Thread.java:745)[:1.8.0_111]
15:30:51 2017-01-10 15:21:07,461 | WARN | saction-30-31'}} | DeadlockMonitor | 151 - org.opendaylight.controller.config-manager - 0.6.0.SNAPSHOT | ModuleIdentifier

{factoryName='binding-broker-impl', instanceName='binding-broker-impl'}

did not finish after 9920 ms

We can also see the used config files in the console:

15:20:16 Dump akka.conf
15:20:16
15:20:16 odl-cluster-data {
15:20:16 akka {
15:20:16 remote {
15:20:16 artery

{ 15:20:16 enabled = on 15:20:16 canonical.hostname = "10.29.12.251" 15:20:16 canonical.port = 2550 15:20:16 }

15:20:16 }
15:20:16
15:20:16 cluster

{ 15:20:16 seed-nodes = ["akka://opendaylight-cluster-data@10.29.12.251:2550", 15:20:16 "akka://opendaylight-cluster-data@10.29.12.96:2550", 15:20:16 "akka://opendaylight-cluster-data@10.29.12.146:2550"] 15:20:16 15:20:16 roles = ["member-1"] 15:20:16 15:20:16 }

15:20:16
15:20:16 persistence {
15:20:16 # By default the snapshots/journal directories live in KARAF_HOME. You can choose to put it somewhere else by
15:20:16 # modifying the following two properties. The directory location specified may be a relative or absolute path.
15:20:16 # The relative path is always relative to KARAF_HOME.
15:20:16
15:20:16 # snapshot-store.local.dir = "target/snapshots"
15:20:16 # journal.leveldb.dir = "target/journal"
15:20:16
15:20:16 journal {
15:20:16 leveldb

{ 15:20:16 # Set native = off to use a Java-only implementation of leveldb. 15:20:16 # Note that the Java-only version is not currently considered by Akka to be production quality. 15:20:16 15:20:16 # native = off 15:20:16 }

15:20:16 }
15:20:16 }
15:20:16 }
15:20:16 }
15:20:16 Dump modules.conf
15:20:16 modules = [
15:20:16
15:20:16

{ 15:20:16 name = "inventory" 15:20:16 namespace = "urn:opendaylight:inventory" 15:20:16 shard-strategy = "module" 15:20:16 }

,
15:20:16

{ 15:20:16 name = "topology" 15:20:16 namespace = "urn:TBD:params:xml:ns:yang:network-topology" 15:20:16 shard-strategy = "module" 15:20:16 }

,
15:20:16

{ 15:20:16 name = "toaster" 15:20:16 namespace = "http://netconfcentral.org/ns/toaster" 15:20:16 shard-strategy = "module" 15:20:16 }

,
15:20:16

{ 15:20:16 name = "car" 15:20:16 namespace = "urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car" 15:20:16 shard-strategy = "module" 15:20:16 }

,
15:20:16

{ 15:20:16 name = "people" 15:20:16 namespace = "urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:people" 15:20:16 shard-strategy = "module" 15:20:16 }

,
15:20:16

{ 15:20:16 name = "car-people" 15:20:16 namespace = "urn:opendaylight:params:xml:ns:yang:controller:config:sal-clustering-it:car-people" 15:20:16 shard-strategy = "module" 15:20:16 }

15:20:16 ]
15:20:16 Dump module-shards.conf
15:20:16 module-shards = [
15:20:16 {
15:20:16 name = "default"
15:20:16 shards = [
15:20:16

{ 15:20:16 name = "default" 15:20:16 replicas = ["member-1", 15:20:16 "member-2", 15:20:16 "member-3"] 15:20:16 }

15:20:16 ]
15:20:16 },
15:20:16 {
15:20:16 name = "inventory"
15:20:16 shards = [
15:20:16

{ 15:20:16 name="inventory" 15:20:16 replicas = ["member-1", 15:20:16 "member-2", 15:20:16 "member-3"] 15:20:16 }

15:20:16 ]
15:20:16 },
15:20:16 {
15:20:16 name = "topology"
15:20:16 shards = [
15:20:16

{ 15:20:16 name="topology" 15:20:16 replicas = ["member-1", 15:20:16 "member-2", 15:20:16 "member-3"] 15:20:16 }

15:20:16 ]
15:20:16 },
15:20:16 {
15:20:16 name = "toaster"
15:20:16 shards = [
15:20:16

{ 15:20:16 name="toaster" 15:20:16 replicas = ["member-1", 15:20:16 "member-2", 15:20:16 "member-3"] 15:20:16 }

15:20:16 ]
15:20:16 },
15:20:16 {
15:20:16 name = "car"
15:20:16 shards = [
15:20:16

{ 15:20:16 name="car" 15:20:16 replicas = ["member-1", 15:20:16 "member-2", 15:20:16 "member-3"] 15:20:16 }

15:20:16 ]
15:20:16 },
15:20:16 {
15:20:16 name = "people"
15:20:16 shards = [
15:20:16

{ 15:20:16 name="people" 15:20:16 replicas = ["member-1", 15:20:16 "member-2", 15:20:16 "member-3"] 15:20:16 }

15:20:16 ]
15:20:16 },
15:20:16 {
15:20:16 name = "car-people"
15:20:16 shards = [
15:20:16

{ 15:20:16 name="car-people" 15:20:16 replicas = ["member-1", 15:20:16 "member-2", 15:20:16 "member-3"] 15:20:16 }

15:20:16 ]
15:20:16 }
15:20:16 ]



 Comments   
Comment by Robert Varga [ 11/Jan/17 ]

What is the RAM/disk situation on the VM running the cluster nodes?

This looks like we are running out of OS memory or disk space for aeron files – more likely the former.

It may be also the case that we are hitting MaxDirectMemorySize limit – what is the output of 'java -XX:+PrintFlagsFinal -version | fgrep MaxDirect' on the target environment?

Comment by Robert Varga [ 11/Jan/17 ]

The stack trace indicates that artery is attempting to allocate direct buffer (i.e. off-heap) and failing that – which indicates VM exhaustion on the OS side.

We need to understand if the problem comes from too many buffers (like a leak), or the OS cannot satisfy the default pool size of 128*256KiB + 32*2MiB.

Comment by Robert Varga [ 11/Jan/17 ]

As a further monitoring request for CSIT jobs, aside from monitoring heap, can we also monitor java.nio:type=BufferPool,name=direct and java.nio:type=BufferPool,name=mapped beans?

The former would be useful in identifying this allocation failure.

Comment by Vratko Polak [ 11/Jan/17 ]

Projects were reverted back from artery to netty for now.
Controller: https://git.opendaylight.org/gerrit/50220
Integration/Distribution: https://git.opendaylight.org/gerrit/50221

Comment by Tom Pantelis [ 11/Jan/17 ]

The problem is that I had set maximum-frame-size (i.e. maximum serialized message size) to 1G to allow for large messages. Little did I know that akka allocates direct ByteBuffers using maximum-frame-size from a reused pool (governed by buffer-pool-size). The default settings are

maximum-frame-size = 256 KiB
buffer-pool-size = 128

which is ~32M which is safely below the default MaxDirectMemorySize of 64M. But with maximum-frame-size of 1G it dies on the first ByteBuffer allocation as MaxDirectMemorySize is exceeded.

The message size limit is problematic for ReadDataReply messages as they can be arbitrarily large and we can't assume any reasonable limit. With netty tcp we set maximum-frame-size to ~400M (don't know why that was chosen) which provides decent cushion (so far only I've only seen one case where this was exceeded). But setting it that high with artery is problematic. We can increase the MaxDirectMemorySize to 512 or more and fiddle with the maximum-frame-size/buffer-pool-size appropriately but that's still fragile plus buffer-pool-size "is not a hard upper limit on number of created buffers. Additional buffers will be created if needed", and, if so, they're just GCed and not reused in the pool. So it seems akka has no protection against exceeding MaxDirectMemorySize (e.g. maybe using HeapByteBuffer instead). They contend you should not send large messages but either chunk or stream by other means. It seems we'll have to do that for ReadDataReply messages and, in fact, probably any message containing NormalizedNode (we already do that with InstallSnapshot in raft). Perhaps this can be built into the new FE/BE stuff.

Until we have a solution for this we'll have to hold off on using artery.

Comment by Tom Pantelis [ 11/Jan/17 ]

I created an issue in akka: https://github.com/akka/akka/issues/22133

Comment by Robert Varga [ 11/Jan/17 ]

Agreeed, the artery codebase looks ... strange to say the least. When the buffer is allocated, the message is already reassembled and the size is known – hence a pool of buffers of differing (powers-of-two) sizes would be more efficient use of memory.

Overall I think we need to dig very deep into akka for alternative ways of interfacing with remoting, as the current mode of operation and the abstraction layers provided are more harmful than helpful.

Comment by Robert Varga [ 11/Jan/17 ]

To aleborate: the entire akka.serialization.Serializer concept revolves around byte[], which means the message has to be reassembled into a contiguous memory area – direct buffers or not, it is completely wasteful.

Hence it would be preferable if akka could pin the Aeron buffers (hey, they are memory-mapped files), do the minimum of its processing on the envelope and then present the scattered buffers as an InputStream for deserialization.

Comment by Tom Pantelis [ 11/Jan/17 ]

yup. Their remoting framework is designed for smallish messages. In order to use it we'll have to follow their "rules/guidelines" and chunk larger messages. Or use some other way (unknown at this point) to interface with remoting which will likely take a lot of effort and require significant refactoring of our code.

Comment by Vladimir Lavor [ 16/Jan/17 ]

Tested with reverted akka.conf, every node started without OutOfMemoryError with default heap memory value.

One note: change https://git.opendaylight.org/gerrit/#/c/50129/ should be reverted as well

After start and sometimes when there is an DS action, this warning can be seen (which have not been there before):
2017-01-16 16:11:57,360 | WARN | lt-dispatcher-26 | ShardManager | 213 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Supervisor Strategy caught unexpected exception - resuming
java.lang.IllegalArgumentException: Unhandled node type 2

Log: http://pastebin.com/68NH2Yic

This appears also in single odl instance after restart as described here: https://bugs.opendaylight.org/show_bug.cgi?id=7515

Full logs:
member-1: http://pastebin.com/8Jv1JhSn
member-2: http://pastebin.com/4BwU7d3z
member-3: http://pastebin.com/e6ZEkAN0

Comment by Tom Pantelis [ 16/Jan/17 ]

Looks like it needs to handle ModificationType.UNMODIFIED.

Comment by Tom Pantelis [ 17/Jan/17 ]

Submitted https://git.opendaylight.org/gerrit/#/c/50535/. This actually was introduced by the pipe-lining changes.

(In reply to Tom Pantelis from comment #11)
> Looks like it needs to handle ModificationType.UNMODIFIED.

Comment by Vratko Polak [ 17/Jan/17 ]

> Unhandled node type 2

That particular symptom s tracked as https://bugs.opendaylight.org/show_bug.cgi?id=7549

This Bug should be about Artery and OutOfMemoryError.

Comment by Ajay L [ 19/Sep/17 ]

If I follow correctly, artery was disabled as a short-term solution for this OOM issue. Is there anything planned to address the OOM and re-enable artery?

Comment by Tom Pantelis [ 19/Sep/17 ]

(In reply to Ajay L from comment #14)
> If I follow correctly, artery was disabled as a short-term solution for this
> OOM issue. Is there anything planned to address the OOM and re-enable artery?

yeah - we'd like to re-enable artery at some point after tell-based protocol is made the default. Of course it will require testing. Would you be able to work on this?

Comment by Ajay L [ 19/Sep/17 ]

Yes we would definitely like to help with that. Let me know..

Generated at Wed Feb 07 19:55:53 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.