[CONTROLLER-1620] StackOverflowError while shard is committing Created: 23/Mar/17  Updated: 25/Jul/23  Resolved: 15/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Guy Sela Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File odl3.tar.gz    
Issue Links:
Duplicate
is duplicated by CONTROLLER-1666 C: JVM down, StackOverflowError; OutO... Resolved
External issue ID: 8056

 Description   

Scale testing on latest Boron code with 3-node cluster as part of an openstack solution.
200 OVSs are connected, and 1 VM per OVS.
ODL3 crashed because of a StackOverflowError:
2017-03-22T21:00:38.704+0200: [GC pause (G1 Evacuation Pause) (young) 21G->8029M(24G), 0.0591195 secs]
Uncaught error from thread [opendaylight-cluster-data-akka.actor.default-dispatcher-49] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-data]
java.lang.StackOverflowError
at org.opendaylight.yangtools.yang.data.impl.schema.nodes.AbstractImmutableDataContainerNode.getValue(AbstractImmutableDataContainerNode.java:38)
at org.opendaylight.yangtools.yang.data.impl.schema.builder.impl.ImmutableMapEntryNodeBuilder.<init>(ImmutableMapEntryNodeBuilder.java:45)
at org.opendaylight.yangtools.yang.data.impl.schema.builder.impl.ImmutableMapEntryNodeBuilder.create(ImmutableMapEntryNodeBuilder.java:61)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.ListEntryModificationStrategy.createBuilder(ListEntryModificationStrategy.java:64)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.ListEntryModificationStrategy.createBuilder(ListEntryModificationStrategy.java:22)
....

This can be seen in karaf.out



 Comments   
Comment by Guy Sela [ 23/Mar/17 ]

Attachment odl3.tar.gz has been added with description: ODL3 logs

Comment by Robert Varga [ 23/Mar/17 ]

I assume we can ignore the NPEs from neutron for now.

The interesting part of the stack trace is actually:

at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractDataTreeTip.prepare(AbstractDataTreeTip.java:53)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startPreCommit(ShardDataTree.java:577)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.preCommit(SimpleShardDataTreeCohort.java:91)
at org.opendaylight.controller.cluster.datastore.CohortEntry.preCommit(CohortEntry.java:102)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.doCommit(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$200(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:243)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:237)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulCanCommit(SimpleShardDataTreeCohort.java:145)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:526)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:623)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCommit(ShardDataTree.java:635)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.commit(SimpleShardDataTreeCohort.java:132)
at org.opendaylight.controller.cluster.datastore.CohortEntry.commit(CohortEntry.java:106)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.finishCommit(ShardCommitCoordinator.java:316)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$400(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:299)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulPreCommit(SimpleShardDataTreeCohort.java:169)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startPreCommit(ShardDataTree.java:591)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.preCommit(SimpleShardDataTreeCohort.java:91)
at org.opendaylight.controller.cluster.datastore.CohortEntry.preCommit(CohortEntry.java:102)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.doCommit(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$200(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:243)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:237)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulCanCommit(SimpleShardDataTreeCohort.java:145)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:526)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:623)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCommit(ShardDataTree.java:635)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.commit(SimpleShardDataTreeCohort.java:132)
at org.opendaylight.controller.cluster.datastore.CohortEntry.commit(CohortEntry.java:106)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.finishCommit(ShardCommitCoordinator.java:316)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$400(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:299)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulPreCommit(SimpleShardDataTreeCohort.java:169)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startPreCommit(ShardDataTree.java:591)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.preCommit(SimpleShardDataTreeCohort.java:91)
at org.opendaylight.controller.cluster.datastore.CohortEntry.preCommit(CohortEntry.java:102)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.doCommit(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$200(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:243)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:237)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulCanCommit(SimpleShardDataTreeCohort.java:145)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:526)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:623)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCommit(ShardDataTree.java:635)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.commit(SimpleShardDataTreeCohort.java:132)
at org.opendaylight.controller.cluster.datastore.CohortEntry.commit(CohortEntry.java:106)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.finishCommit(ShardCommitCoordinator.java:316)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$400(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:299)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulPreCommit(SimpleShardDataTreeCohort.java:169)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startPreCommit(ShardDataTree.java:591)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.preCommit(SimpleShardDataTreeCohort.java:91)
at org.opendaylight.controller.cluster.datastore.CohortEntry.preCommit(CohortEntry.java:102)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.doCommit(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$200(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:243)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:237)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulCanCommit(SimpleShardDataTreeCohort.java:145)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:526)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:623)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCommit(ShardDataTree.java:635)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.commit(SimpleShardDataTreeCohort.java:132)
at org.opendaylight.controller.cluster.datastore.CohortEntry.commit(CohortEntry.java:106)

Which means we have a deep chain of transactions completing with direct commit protocol, which is exhausting stack to the point where there is not enough head room for codecs to run.

Comment by Guy Sela [ 23/Mar/17 ]

If there is no bug there, just a long stack, we can just use: -Xss

Comment by Robert Varga [ 23/Mar/17 ]

Well, stack size is a rather huge hammer which will cover the problem but not solve it, as the number of transactions completing does not really have an upper bound. It also has the downside of each thread costing more VSS, clamping the ability to spawn threads.

Can you enable TRACE logs for org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator, please?

Comment by Guy Sela [ 23/Mar/17 ]

Yeah but reproduction is not trivial (scale setup), so when I'll reach this scenario again I'll try to add the log.

Comment by Robert Varga [ 23/Mar/17 ]

https://git.opendaylight.org/gerrit/53738 helps a tiny bit by eliminating synthetic accessors seen in the stack trace.

Comment by Robert Varga [ 23/Mar/17 ]

So the code structure seems to indicate that we are dealing with operational data store, tons of small transactions which get unblocked after some sort of an up-event (data store being ready or similar).

What we have here is an on-stack chain of futures completing immediately and invoking callbacks. I suspect we need to keep some sort of a counter as to how many transactions we have batched and check it between:

at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:526)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:623)

I.e. at some point we want processNextTransaction() to decide it has committed enough transactions and it is time to sleep for a bit – perhaps by enqueueing a message to re-trigger itself. Then it will process next batch of transactions with an unwound stack.

Comment by Tom Pantelis [ 24/Mar/17 ]

It's interesting that it's taking the short-cut to bypass replication in startCommit:

if (shard.canSkipPayload() || candidate.getRootNode()
.getModificationType() == ModificationType.UNMODIFIED)

{ finishCommit(cohort); return; }

which is what causes the recursion. Since this is a 3-node cluster, canSkipPayload should not evaluate to true as it has followers. Therefore it seems it's getting a lot of successive UNMODIFIED transactions. Not sure if that is normal.

This occurred in Boron. In Carbon we removed the shortcut for UNMODIFIED transactions - they are now persisted/replicated/applied. However it can still recurse if data isn't persisted nor replicated (i.e. operational data in single-node) plus we added the pipe-lining which I think could also lead to a stack overflow.

We need to limit the number of recursive calls in ShardDataTree in the context of processing a single message. Perhaps we can use a thread local var to track the count.

(In reply to Robert Varga from comment #6)
> So the code structure seems to indicate that we are dealing with operational
> data store, tons of small transactions which get unblocked after some sort
> of an up-event (data store being ready or similar).
>
> What we have here is an on-stack chain of futures completing immediately and
> invoking callbacks. I suspect we need to keep some sort of a counter as to
> how many transactions we have batched and check it between:
>
> at
> org.opendaylight.controller.cluster.datastore.ShardDataTree.
> processNextTransaction(ShardDataTree.java:526)
> at
> org.opendaylight.controller.cluster.datastore.ShardDataTree.
> finishCommit(ShardDataTree.java:623)
>
> I.e. at some point we want processNextTransaction() to decide it has
> committed enough transactions and it is time to sleep for a bit – perhaps
> by enqueueing a message to re-trigger itself. Then it will process next
> batch of transactions with an unwound stack.

Comment by Robert Varga [ 24/Mar/17 ]

It does not need to be thread-local. We can keep a simple counter in ShardDataTree and reset it from Shard everytime we start processing a message.

Comment by suneel verma [ 27/Mar/17 ]

We ran into the same error today in our scale setup using stable/boron.
We are using resource batch manager from genius repo to batch the transactions.
The volume of write from batch manager across transactions is 4k.
Not sure about how many transactions batch manager created for this 4k as we did not enable logs.

Comment by suneel verma [ 27/Mar/17 ]

Actually two out of three controllers wend down from the 3 node cluster with the same error.

Comment by Robert Varga [ 12/May/17 ]

carbon: https://git.opendaylight.org/gerrit/56948

Comment by Robert Varga [ 12/May/17 ]

This can actually occur if a prefix-based shard is not replicated and not persisted – hence this is a blocker for Carbon.

Comment by A H [ 15/May/17 ]

Patch https://git.opendaylight.org/gerrit/#/c/56948/ was successfully merged. Can we mark this bug as fixed and resolved?

Generated at Wed Feb 07 19:56:00 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.