[CONTROLLER-1498] Controller crashes with OOM. oldRoot/currentRoot, toString() consumes too much of memory. Created: 11/Mar/16  Updated: 25/Jul/23  Resolved: 14/Jun/16

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Anil Vishnoi Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive java_pid10114_Leak_Suspects.zip     Zip Archive java_pid10114_Top_Consumers.zip     Zip Archive node_1_odl_logs.zip     Zip Archive node_2_odl_logs.zip     Zip Archive node_3_odl_logs.zip    
Issue Links:
Blocks
is blocked by CONTROLLER-1506 CDS: refactor Shard to use behaviors Confirmed
External issue ID: 5504

 Description   

I am running controller in 3 node cluster mode, and when i shutdown the leader node, and try to do some transaction on data store, it fails because of the precondition check failure in a precondition check InMemoryDataTree.java:116. This is the exception i see at the time of OOM.

opendaylight-cluster-data-akka.actor.default-dispatcher-5
at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
at java.util.Arrays.copyOf([CI)[C (Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(I)V (AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(I)V (AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(Ljava/lang/String;)Ljava/lang/AbstractStringBuilder; (AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(Ljava/lang/String;)Ljava/lang/StringBuilder; (StringBuilder.java:136)
at java.lang.StringBuilder.append(Ljava/lang/Object;)Ljava/lang/StringBuilder; (StringBuilder.java:131)
at java.util.AbstractMap.toString()Ljava/lang/String; (AbstractMap.java:536)
at java.lang.String.valueOf(Ljava/lang/Object;)Ljava/lang/String; (String.java:2994)
at java.lang.StringBuilder.append(Ljava/lang/Object;)Ljava/lang/StringBuilder; (StringBuilder.java:131)
at com.google.common.base.MoreObjects$ToStringHelper.toString()Ljava/lang/String; (MoreObjects.java:359)
at org.opendaylight.yangtools.yang.data.api.schema.tree.spi.AbstractTreeNode.toString()Ljava/lang/String; (AbstractTreeNode.java:45)
at java.lang.String.valueOf(Ljava/lang/Object;)Ljava/lang/String; (String.java:2994)
at java.lang.StringBuilder.append(Ljava/lang/Object;)Ljava/lang/StringBuilder; (StringBuilder.java:131)
at java.util.AbstractMap.toString()Ljava/lang/String; (AbstractMap.java:536)
at java.lang.String.valueOf(Ljava/lang/Object;)Ljava/lang/String; (String.java:2994)
at java.lang.StringBuilder.append(Ljava/lang/Object;)Ljava/lang/StringBuilder; (StringBuilder.java:131)
at com.google.common.base.MoreObjects$ToStringHelper.toString()Ljava/lang/String; (MoreObjects.java:359)
at org.opendaylight.yangtools.yang.data.api.schema.tree.spi.AbstractTreeNode.toString()Ljava/lang/String; (AbstractTreeNode.java:45)
at java.lang.String.valueOf(Ljava/lang/Object;)Ljava/lang/String; (String.java:2994)
at java.lang.StringBuilder.append(Ljava/lang/Object;)Ljava/lang/StringBuilder; (StringBuilder.java:131)
at java.util.AbstractMap.toString()Ljava/lang/String; (AbstractMap.java:536)
at java.lang.String.valueOf(Ljava/lang/Object;)Ljava/lang/String; (String.java:2994)
at java.lang.StringBuilder.append(Ljava/lang/Object;)Ljava/lang/StringBuilder; (StringBuilder.java:131)
at com.google.common.base.MoreObjects$ToStringHelper.toString()Ljava/lang/String; (MoreObjects.java:359)
at org.opendaylight.yangtools.yang.data.api.schema.tree.spi.AbstractTreeNode.toString()Ljava/lang/String; (AbstractTreeNode.java:45)
at java.lang.String.valueOf(Ljava/lang/Object;)Ljava/lang/String; (String.java:2994)
at java.lang.StringBuilder.append(Ljava/lang/Object;)Ljava/lang/StringBuilder; (StringBuilder.java:131)
at com.google.common.base.Preconditions.format(Ljava/lang/String;[Ljava/lang/Object;)Ljava/lang/String; (Preconditions.java:422)
at com.google.common.base.Preconditions.checkState(ZLjava/lang/String;[Ljava/lang/Object;)V (Preconditions.java:197)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.InMemoryDataTree.commit(Lorg/opendaylight/yangtools/yang/data/api/schema/tree/DataTreeCandidate;)V (InMemoryDataTree.java:116)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.commit()Lcom/google/common/util/concurrent/ListenableFuture; (SimpleShardDataTreeCohort.java:102)
at org.opendaylight.controller.cluster.datastore.ChainedCommitCohort.commit()Lcom/google/common/util/concurrent/ListenableFuture; (ChainedCommitCohort.java:32)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$CohortEntry.commit()V (ShardCommitCoordinator.java:670)
at org.opendaylight.controller.cluster.datastore.Shard.finishCommit(Lakka/actor/ActorRef;Ljava/lang/String;Lorg/opendaylight/controller/cluster/datastore/ShardCommitCoordinator$CohortEntry;)V (Shard.java:348)
at org.opendaylight.controller.cluster.datastore.Shard.finishCommit(Lakka/actor/ActorRef;Ljava/lang/String;)V (Shard.java:398)
at org.opendaylight.controller.cluster.datastore.Shard.applyState(Lakka/actor/ActorRef;Ljava/lang/String;Ljava/lang/Object;)V (Shard.java:646)
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(Ljava/lang/Object;)V (RaftActor.java:228)
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(Ljava/lang/Object;)V (AbstractUntypedPersistentActor.java:36)
at org.opendaylight.controller.cluster.datastore.Shard.onReceiveCommand(Ljava/lang/Object;)V (Shard.java:273)
at akka.persistence.UntypedPersistentActor.onReceive(Ljava/lang/Object;)V (Eventsourced.scala:430)
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(Ljava/lang/Object;)V (MeteringBehavior.java:97)
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(Ljava/lang/Object;Lscala/Function1;)Ljava/lang/Object; (ActorCell.scala:534)
at akka.persistence.Recovery$State$class.process(Lakka/persistence/Recovery$State;Lscala/PartialFunction;Ljava/lang/Object;)V (Recovery.scala:30)
at akka.persistence.ProcessorImpl$$anon$2.process(Lscala/PartialFunction;Ljava/lang/Object;)V (Processor.scala:103)
at akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V (Processor.scala:114)
at akka.persistence.Recovery$class.aroundReceive(Lakka/persistence/Recovery;Lscala/PartialFunction;Ljava/lang/Object;)V (Recovery.scala:265)
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V (Eventsourced.scala:428)
at akka.persistence.Eventsourced$$anon$2.doAroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V (Eventsourced.scala:82)
at akka.persistence.Eventsourced$$anon$2.aroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V (Eventsourced.scala:78)
at akka.persistence.Eventsourced$class.aroundReceive(Lakka/persistence/Eventsourced;Lscala/PartialFunction;Ljava/lang/Object;)V (Eventsourced.scala:369)
at akka.persistence.UntypedPersistentActor.aroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V (Eventsourced.scala:428)
at akka.actor.ActorCell.receiveMessage(Ljava/lang/Object;)V (ActorCell.scala:516)
at akka.actor.ActorCell.invoke(Lakka/dispatch/Envelope;)V (ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(IJ)V (Mailbox.scala:238)
at akka.dispatch.Mailbox.run()V (Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec()Z (AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec()I (ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(Lscala/concurrent/forkjoin/ForkJoinTask;)V (ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(Lscala/concurrent/forkjoin/ForkJoinPool$WorkQueue;)V (ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run()V (ForkJoinWorkerThread.java:107)
'

Precondition fails trying to generate the failure message and it tries to build toString() for oldRoot and currentRoot, and looks like it requires too much memory to build the toString of the thole subtree. I tried to patch it by skipping the toString() of oldRoot and currentRoot in Precondition and things looks stable.



 Comments   
Comment by Anil Vishnoi [ 11/Mar/16 ]

Attachment java_pid10114_Leak_Suspects.zip has been added with description: Leak Suspect report of MAT

Comment by Anil Vishnoi [ 11/Mar/16 ]

Attachment java_pid10114_Top_Consumers.zip has been added with description: Tom Consume Report of MAT from Headdump

Comment by Anil Vishnoi [ 11/Mar/16 ]

All the logs files (karaf.log.10) were consumed in printing the ReadFailedException to string.

Comment by Anil Vishnoi [ 11/Mar/16 ]

Attachment node_1_odl_logs.zip has been added with description: Last few minutes logs of cluster node 1

Comment by Anil Vishnoi [ 11/Mar/16 ]

Attachment node_2_odl_logs.zip has been added with description: Logs from Controller node 2

Comment by Anil Vishnoi [ 11/Mar/16 ]

Attachment node_3_odl_logs.zip has been added with description: Controller Node 3

Comment by Anil Vishnoi [ 11/Mar/16 ]

I build a local yangtools-impl jar with this patch and we stop seeing the frequent OOM we were seeing in our environment.

https://git.opendaylight.org/gerrit/36148

I personally feel building subtree in precondition is probably dangerous and can eat up lot of memory and probably performance issue. This fix might not be final, but we need to fix it asap.

Comment by A H [ 14/Mar/16 ]

Severity: Could you elaborate on the severity of this bug? Is this a BLOCKER such that we cannot release Beryllium without it? Is there a workaround such that we can write a release note and fix in Beryllium SR2?
Testing: Could you also elaborate on the testing or this patch? How extensively has this patch been tested? Is it covered by any unit tests or system tests?
Impact: Does this fix impact any dependent projects?

Comment by Robert Varga [ 15/Mar/16 ]

The root cause is sal-distributed-datastore violating DataTree invocation sequencing. Current solution just hides the problem by relying on implementation behavior. This needs to be fixed in CDS.

Comment by Tom Pantelis [ 16/Mar/16 ]

Submitted https://git.opendaylight.org/gerrit/#/c/36172/ to stable/be as interim fix.

https://git.opendaylight.org/gerrit/#/c/36171/ will implement better solution in Boron.

Comment by A H [ 18/Mar/16 ]

Have we verified that this bug is fixed in the latest build for Beryllium?

Comment by Jamo Luhrsen [ 18/Mar/16 ]

(In reply to A H from comment #9)
> Have we verified that this bug is fixed in the latest build for Beryllium?

I have verified in snapshot distros produced after this yangtools patch
was merged and the problem was not there. Is that good enough?

Generated at Wed Feb 07 19:55:42 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.