[NETVIRT-916] vpnmanager: OptimisticLockFailedException seen during rally scenario Created: 18/Sep/17 Updated: 05/Apr/18 Resolved: 05/Apr/18 |
|
| Status: | Resolved |
| Project: | netvirt |
| Component/s: | General |
| Affects Version/s: | Carbon |
| Fix Version/s: | Oxygen |
| Type: | Bug | Priority: | Medium |
| Reporter: | Sridhar Gaddam | Assignee: | Sam Hague |
| Resolution: | Done | Votes: | 0 |
| Labels: | csit:3node, ds | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||||||
| External issue ID: | 9176 | ||||||||||||||||
| Description |
|
In a Clustered setup with 3 ODL Controllers and 3 Compute nodes, while running rally with a light load of concurrency of 2 and times set to 40, the following exception was seen in karaf. ODL rpm used: opendaylight-6.2.0-0.1.20170913snap58.el7.noarch.rpm 2017-09-15 03:39:25,801 | WARN | CommitFutures-3 | DataStoreJobCoordinator | 319 - org.opendaylight.genius.mdsalutil-api - 0.2.2.SNAPSHOT | Job: JobEntry
{key='VPNNODE-9681529007732', mainWorker=org.opendaylight.netvirt.vpnmanager.VpnNodeListener$1@7a2b1deb, rollbackWorker=null, retryCount=0, futures=[org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$AsyncNotifyingSettableFuture@74ececde]}
failed
OptimisticLockFailedException
{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.]]}
at org.opendaylight.controller.cluster.datastore.ShardDataTree.lambda$processNextPendingTransaction$0(ShardDataTree.java:751)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPending(ShardDataTree.java:789)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPendingTransaction(ShardDataTree.java:736)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:819)
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:90)
at org.opendaylight.controller.cluster.datastore.CohortEntry.canCommit(CohortEntry.java:97)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:236)
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleBatchedModifications(ShardCommitCoordinator.java:170)
at org.opendaylight.controller.cluster.datastore.Shard.handleBatchedModificationsLocal(Shard.java:604)
at org.opendaylight.controller.cluster.datastore.Shard.handleBatchedModifications(Shard.java:627)
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:312)
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44)
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.
at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkWriteApplicable(SchemaAwareApplyOperation.java:174)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:128)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.StructuralContainerModificationStrategy.checkApplicable(StructuralContainerModificationStrategy.java:99)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkTouchApplicable(AbstractNodeContainerModificationStrategy.java:288)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:125)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.RootModificationApplyOperation.checkApplicable(RootModificationApplyOperation.java:72)
at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractDataTreeTip.validate(AbstractDataTreeTip.java:35)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.lambda$processNextPendingTransaction$0(ShardDataTree.java:743)
... 29 more
|
| Comments |
| Comment by Sam Hague [ 13/Nov/17 ] |
|
Next round of scale testing had issue: https://gist.github.com/smalleni/c96b97db950f925aa879dac91ad3b635 |
| Comment by Michael Vorburger [ 27/Nov/17 ] |
|
This one MAY be easy to fix, I'm not sure sure - worth a try to see if with a retry = 3 instead of the default 1 in that enqueueJob() we see in this class this may go away? Here are Gerrits for all branches to try this out: |
| Comment by Sam Hague [ 08/Dec/17 ] |
|
I see the below exception, has retryCount=3 so I guess the code is in. Does the exception mean it did not fix the problem?
2017-11-29 23:51:58,501 | WARN | ult-dispatcher-6 | ConcurrentDOMDataBroker | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.3.SNAPSHOT | Tx: DOM-318 Error during phase CAN_COMMIT, starting Abort
OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.]]}
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.
2017-11-29 23:51:58,503 | WARN | CommitFutures-2 | DataStoreJobCoordinator | 293 - org.opendaylight.genius.mdsalutil-api - 0.2.3.SNAPSHOT | Job: JobEntry{key='VPNNODE-30811640961648', mainWorker=org.opendaylight.netvirt.vpnmanager.VpnNodeListener$$Lambda$368/1576234995@54571922, rollbackWorker=null, retryCount=3, futures=[org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$AsyncNotifyingSettableFuture@5a90edd8]} failed
OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.]]}
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.*no* further _formatting_ is done here
|
| Comment by Michael Vorburger [ 11/Dec/17 ] |
|
shague I wonder if the change really does not fix the problem, or if this is "just" a kind of confusing logging issue.. do you have any way of verifiying if the "functionality" actually worked? Can you link to a full log where this was seen, preferably from master (with JobCoordinatorImpl instead of DataStoreJobCoordinator). For the 2nd log, I'm expecting a closer inspection of a full log should show that after the WARN with retryCount=3, it retried and that then worked (hopefully); to confirm that, you would have to see if there also a retryCount=2 and retryCount=1 (shortly after, for the exact same Job), or not. If there isn't, then this means the fix did work - it just confusing that it's still gives us a WARN on the first attempt which failed but then the retry worked. That could be fixed by proposed https://git.opendaylight.org/gerrit/66355, if others agree to this. For the 1st log, that's a WARN from controller. I'm not sure I agree that controller should emit that WARN.. it's one of those "where shld you log, in the callee or in the caller" kind of discussions; I've just raised this on thread "How about reducing WARN to DEBUG for ConflictingModificationAppliedException in ConcurrentDOMDataBroker?" on controller-dev, see https://lists.opendaylight.org/pipermail/controller-dev/2017-December/014099.html. |
| Comment by Sam Hague [ 11/Dec/17 ] |
|
Here is the log from the master patch (https://git.opendaylight.org/gerrit/#/c/65944/) gate job: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-gate-stateful-oxygen/8/odl_1/odl1_karaf.log.gz In that run there all you see is the retryCount=3 and no decrement. There is a preceding OLFE before the rertryCount exception. Does that imply it only tried once as you mention and worked? If so, then yes, that is very confusing to see the WARN with OLFE - I would never think that it worked in that case. Agreed on the first log. These are things very difficult to control in the downstream apps since the errors pop up regardless of what the app thinks. |
| Comment by Michael Vorburger [ 11/Dec/17 ] |
That is my understanding, yes.
==> |
| Comment by Sam Hague [ 19/Dec/17 ] |
|
Michael, in this job: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-gate-stateful-oxygen/535/odl_1/odl1_exception.log.gz we only got a retryCount=0 - does that mean the patch did what it was supposed to and only log the final exception after the last retry? The bug is still there, but the number of logs was reduced? Also can you clarify that your patches here - they don't fix this bug, they just reduce the number of logs? Do I have that right? 2017-12-12 04:41:41,597 | WARN | ult-dispatcher-3 | ConcurrentDOMDataBroker | 217 - org.opendaylight.controller.sal-distributed-datastore - 1.7.0.SNAPSHOT | Tx: DOM-163473 Error during phase CAN_COMMIT, starting Abort
OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was replaced by other transaction.]]}
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[417:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was replaced by other transaction.
2017-12-12 04:41:41,601 | WARN | CommitFutures-15 | JobCoordinatorImpl | 261 - org.opendaylight.infrautils.jobcoordinator-impl - 1.3.0.SNAPSHOT | Job: JobEntry{key='c7da4167-372c-46c4-b718-9331ac4faff0', mainWorker=ItmTepRemoveWorker { Delete Dpn List : [DPNTEPsInfo [_dPNID=181234940435320, _key=DPNTEPsInfoKey [_dPNID=181234940435320], _tunnelEndPoints=[TunnelEndPoints [_gwIpAddress=IpAddress [_ipv4Address=Ipv4Address [_value=0.0.0.0]], _interfaceName=181234940435320:tunnel_port:0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.15.6]], _key=TunnelEndPointsKey [_portname=tunnel_port, _vLANID=0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.15.6]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan], _optionTunnelTos=0, _portname=tunnel_port, _subnetMask=IpPrefix [_ipv4Prefix=Ipv4Prefix [_value=0.0.0.0/0]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan, _tzMembership=[TzMembership [_key=TzMembershipKey [_zoneName=c7da4167-372c-46c4-b718-9331ac4faff0], _zoneName=c7da4167-372c-46c4-b718-9331ac4faff0, augmentation=[]]], _vLANID=0, _optionOfTunnel=false, augmentation=[]]], augmentation=[]], DPNTEPsInfo [_dPNID=88475369668342, _key=DPNTEPsInfoKey [_dPNID=88475369668342], _tunnelEndPoints=[TunnelEndPoints [_gwIpAddress=IpAddress [_ipv4Address=Ipv4Address [_value=0.0.0.0]], _interfaceName=88475369668342:tunnel_port:0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.13.254]], _key=TunnelEndPointsKey [_portname=tunnel_port, _vLANID=0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.13.254]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan], _optionTunnelTos=0, _portname=tunnel_port, _subnetMask=IpPrefix [_ipv4Prefix=Ipv4Prefix [_value=0.0.0.0/0]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan, _tzMembership=[TzMembership [_key=TzMembershipKey [_zoneName=c7da4167-372c-46c4-b718-9331ac4faff0], _zoneName=c7da4167-372c-46c4-b718-9331ac4faff0, augmentation=[]]], _vLANID=0, _optionOfTunnel=false, augmentation=[]]], augmentation=[]], DPNTEPsInfo [_dPNID=84424921983078, _key=DPNTEPsInfoKey [_dPNID=84424921983078], _tunnelEndPoints=[TunnelEndPoints [_gwIpAddress=IpAddress [_ipv4Address=Ipv4Address [_value=0.0.0.0]], _interfaceName=84424921983078:tunnel_port:0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.14.8]], _key=TunnelEndPointsKey [_portname=tunnel_port, _vLANID=0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.14.8]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan], _optionTunnelTos=0, _portname=tunnel_port, _subnetMask=IpPrefix [_ipv4Prefix=Ipv4Prefix [_value=0.0.0.0/0]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan, _tzMembership=[TzMembership [_key=TzMembershipKey [_zoneName=c7da4167-372c-46c4-b718-9331ac4faff0], _zoneName=c7da4167-372c-46c4-b718-9331ac4faff0, augmentation=[]]], _vLANID=0, _optionOfTunnel=false, augmentation=[]]], augmentation=[]]] }, rollbackWorker=null, retryCount=0, futures=[org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$AsyncNotifyingSettableFuture@62cfb748, com.google.common.util.concurrent.ImmediateFuture$ImmediateSuccessfulCheckedFuture@66f03fc9]} failed
OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was replaced by other transaction.]]}
|
| Comment by Michael Vorburger [ 19/Dec/17 ] |
Now wait a second, your new log above from JobCoordinatorImpl is re. ItmTepRemoveWorker.. whereas originally this issue was re. VpnNodeListener... so what is the "scope" of this bug? Now you either have to do the same in ItmTepRemoveWorker, and everywhere else, or we change the default retries in JobCoordinator from 0 to 3 for EVERYTHING... I'm not sure that everyone will agree that is a good idea, but I can propose it, for discussion, in c/66618 ...
Well, no careful, you mix up 2 things here: Those 3 retries added e.g. to VpnNodeListener, or maybe by changing the default in JobCoordinator, DO fix this bug. But then the |
| Comment by Sam Hague [ 21/Dec/17 ] |
|
Ok so the fact we don't see any errors in the VpnNodeListener patch means somewhere in the 1-3 retries it passed - and failures would only be logged at debug, which is not on by default, so we don't know how many tries it took, but it did pass. Then we need to see if we want to do the same for the new error. And in this one, it says retries =0 since that is the default - and not that is got through all retries and ended at 0. |
| Comment by Michael Vorburger [ 26/Feb/18 ] |
|
That (above) is correct. FTR c/66618 which changed the default retries in JobCoordinator from 0 to 3 globally for everything was discussed, accepted and merged on Jan 9th. Perhaps we could finally close this bug (with Fix Version: Oxygen), and open new ones for other places? |
| Comment by Michael Vorburger [ 14/Mar/18 ] |
|
shague SridharG can this one finally be closed now, or is there anything else I can do here? |
| Comment by Sam Hague [ 05/Apr/18 ] |