|
Attachment odl3.tar.gz has been added with description: ODL3 logs
|
|
I assume we can ignore the NPEs from neutron for now.
The interesting part of the stack trace is actually:
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractDataTreeTip.prepare(AbstractDataTreeTip.java:53)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startPreCommit(ShardDataTree.java:577)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.preCommit(SimpleShardDataTreeCohort.java:91)
at org.opendaylight.controller.cluster.datastore.CohortEntry.preCommit(CohortEntry.java:102)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.doCommit(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$200(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:243)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:237)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulCanCommit(SimpleShardDataTreeCohort.java:145)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:526)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:623)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCommit(ShardDataTree.java:635)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.commit(SimpleShardDataTreeCohort.java:132)
at org.opendaylight.controller.cluster.datastore.CohortEntry.commit(CohortEntry.java:106)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.finishCommit(ShardCommitCoordinator.java:316)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$400(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:299)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulPreCommit(SimpleShardDataTreeCohort.java:169)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startPreCommit(ShardDataTree.java:591)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.preCommit(SimpleShardDataTreeCohort.java:91)
at org.opendaylight.controller.cluster.datastore.CohortEntry.preCommit(CohortEntry.java:102)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.doCommit(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$200(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:243)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:237)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulCanCommit(SimpleShardDataTreeCohort.java:145)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:526)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:623)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCommit(ShardDataTree.java:635)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.commit(SimpleShardDataTreeCohort.java:132)
at org.opendaylight.controller.cluster.datastore.CohortEntry.commit(CohortEntry.java:106)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.finishCommit(ShardCommitCoordinator.java:316)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$400(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:299)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulPreCommit(SimpleShardDataTreeCohort.java:169)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startPreCommit(ShardDataTree.java:591)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.preCommit(SimpleShardDataTreeCohort.java:91)
at org.opendaylight.controller.cluster.datastore.CohortEntry.preCommit(CohortEntry.java:102)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.doCommit(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$200(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:243)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:237)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulCanCommit(SimpleShardDataTreeCohort.java:145)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:526)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:623)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCommit(ShardDataTree.java:635)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.commit(SimpleShardDataTreeCohort.java:132)
at org.opendaylight.controller.cluster.datastore.CohortEntry.commit(CohortEntry.java:106)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.finishCommit(ShardCommitCoordinator.java:316)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$400(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:299)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$3.onSuccess(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulPreCommit(SimpleShardDataTreeCohort.java:169)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startPreCommit(ShardDataTree.java:591)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.preCommit(SimpleShardDataTreeCohort.java:91)
at org.opendaylight.controller.cluster.datastore.CohortEntry.preCommit(CohortEntry.java:102)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.doCommit(ShardCommitCoordinator.java:296)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.access$200(ShardCommitCoordinator.java:49)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:243)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator$2.onSuccess(ShardCommitCoordinator.java:237)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.successfulCanCommit(SimpleShardDataTreeCohort.java:145)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:526)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:623)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCommit(ShardDataTree.java:635)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.commit(SimpleShardDataTreeCohort.java:132)
at org.opendaylight.controller.cluster.datastore.CohortEntry.commit(CohortEntry.java:106)
Which means we have a deep chain of transactions completing with direct commit protocol, which is exhausting stack to the point where there is not enough head room for codecs to run.
|
|
If there is no bug there, just a long stack, we can just use: -Xss
|
|
Well, stack size is a rather huge hammer which will cover the problem but not solve it, as the number of transactions completing does not really have an upper bound. It also has the downside of each thread costing more VSS, clamping the ability to spawn threads.
Can you enable TRACE logs for org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator, please?
|
|
Yeah but reproduction is not trivial (scale setup), so when I'll reach this scenario again I'll try to add the log.
|
|
https://git.opendaylight.org/gerrit/53738 helps a tiny bit by eliminating synthetic accessors seen in the stack trace.
|
|
So the code structure seems to indicate that we are dealing with operational data store, tons of small transactions which get unblocked after some sort of an up-event (data store being ready or similar).
What we have here is an on-stack chain of futures completing immediately and invoking callbacks. I suspect we need to keep some sort of a counter as to how many transactions we have batched and check it between:
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:526)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:623)
I.e. at some point we want processNextTransaction() to decide it has committed enough transactions and it is time to sleep for a bit – perhaps by enqueueing a message to re-trigger itself. Then it will process next batch of transactions with an unwound stack.
|
|
It's interesting that it's taking the short-cut to bypass replication in startCommit:
if (shard.canSkipPayload() || candidate.getRootNode()
.getModificationType() == ModificationType.UNMODIFIED)
{
finishCommit(cohort);
return;
}
which is what causes the recursion. Since this is a 3-node cluster, canSkipPayload should not evaluate to true as it has followers. Therefore it seems it's getting a lot of successive UNMODIFIED transactions. Not sure if that is normal.
This occurred in Boron. In Carbon we removed the shortcut for UNMODIFIED transactions - they are now persisted/replicated/applied. However it can still recurse if data isn't persisted nor replicated (i.e. operational data in single-node) plus we added the pipe-lining which I think could also lead to a stack overflow.
We need to limit the number of recursive calls in ShardDataTree in the context of processing a single message. Perhaps we can use a thread local var to track the count.
(In reply to Robert Varga from comment #6)
> So the code structure seems to indicate that we are dealing with operational
> data store, tons of small transactions which get unblocked after some sort
> of an up-event (data store being ready or similar).
>
> What we have here is an on-stack chain of futures completing immediately and
> invoking callbacks. I suspect we need to keep some sort of a counter as to
> how many transactions we have batched and check it between:
>
> at
> org.opendaylight.controller.cluster.datastore.ShardDataTree.
> processNextTransaction(ShardDataTree.java:526)
> at
> org.opendaylight.controller.cluster.datastore.ShardDataTree.
> finishCommit(ShardDataTree.java:623)
>
> I.e. at some point we want processNextTransaction() to decide it has
> committed enough transactions and it is time to sleep for a bit – perhaps
> by enqueueing a message to re-trigger itself. Then it will process next
> batch of transactions with an unwound stack.
|
|
It does not need to be thread-local. We can keep a simple counter in ShardDataTree and reset it from Shard everytime we start processing a message.
|
|
We ran into the same error today in our scale setup using stable/boron.
We are using resource batch manager from genius repo to batch the transactions.
The volume of write from batch manager across transactions is 4k.
Not sure about how many transactions batch manager created for this 4k as we did not enable logs.
|
|
Actually two out of three controllers wend down from the 3 node cluster with the same error.
|
|
carbon: https://git.opendaylight.org/gerrit/56948
|
|
This can actually occur if a prefix-based shard is not replicated and not persisted – hence this is a blocker for Carbon.
|
|
Patch https://git.opendaylight.org/gerrit/#/c/56948/ was successfully merged. Can we mark this bug as fixed and resolved?
|
Generated at Wed Feb 07 19:56:00 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.