[NETVIRT-916] vpnmanager: OptimisticLockFailedException seen during rally scenario Created: 18/Sep/17  Updated: 05/Apr/18  Resolved: 05/Apr/18

Status: Resolved
Project: netvirt
Component/s: General
Affects Version/s: Carbon
Fix Version/s: Oxygen

Type: Bug Priority: Medium
Reporter: Sridhar Gaddam Assignee: Sam Hague
Resolution: Done Votes: 0
Labels: csit:3node, ds
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Duplicate
duplicates NETVIRT-1005 OptimisticLockFailedException... VpnN... Resolved
Relates
relates to NETVIRT-1012 OptimisticLockFailedException: Confli... Resolved
External issue ID: 9176

 Description   

In a Clustered setup with 3 ODL Controllers and 3 Compute nodes, while running rally with a light load of concurrency of 2 and times set to 40, the following exception was seen in karaf.

ODL rpm used: opendaylight-6.2.0-0.1.20170913snap58.el7.noarch.rpm

2017-09-15 03:39:25,801 | WARN | CommitFutures-3 | DataStoreJobCoordinator | 319 - org.opendaylight.genius.mdsalutil-api - 0.2.2.SNAPSHOT | Job: JobEntry
{key='VPNNODE-9681529007732', mainWorker=org.opendaylight.netvirt.vpnmanager.VpnNodeListener$1@7a2b1deb, rollbackWorker=null, retryCount=0, futures=[org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$AsyncNotifyingSettableFuture@74ececde]}
failed
 OptimisticLockFailedException
{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.]]}
at org.opendaylight.controller.cluster.datastore.ShardDataTree.lambda$processNextPendingTransaction$0(ShardDataTree.java:751)
 at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPending(ShardDataTree.java:789)
 at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPendingTransaction(ShardDataTree.java:736)
 at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:819)
 at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:90)
 at org.opendaylight.controller.cluster.datastore.CohortEntry.canCommit(CohortEntry.java:97)
 at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:236)
 at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleBatchedModifications(ShardCommitCoordinator.java:170)
 at org.opendaylight.controller.cluster.datastore.Shard.handleBatchedModificationsLocal(Shard.java:604)
 at org.opendaylight.controller.cluster.datastore.Shard.handleBatchedModifications(Shard.java:627)
 at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:312)
 at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)
 at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44)
 at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)
 at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)
 at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
 at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)
 at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)
 at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)
 at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
 at akka.actor.ActorCell.invoke(ActorCell.scala:495)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
 at akka.dispatch.Mailbox.run(Mailbox.scala:224)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkWriteApplicable(SchemaAwareApplyOperation.java:174)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:128)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkMergeApplicable(AbstractNodeContainerModificationStrategy.java:313)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:131)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.StructuralContainerModificationStrategy.checkApplicable(StructuralContainerModificationStrategy.java:99)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:305)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkTouchApplicable(AbstractNodeContainerModificationStrategy.java:288)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:125)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.RootModificationApplyOperation.checkApplicable(RootModificationApplyOperation.java:72)
 at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractDataTreeTip.validate(AbstractDataTreeTip.java:35)
 at org.opendaylight.controller.cluster.datastore.ShardDataTree.lambda$processNextPendingTransaction$0(ShardDataTree.java:743)
 ... 29 more

 



 Comments   
Comment by Sam Hague [ 13/Nov/17 ]

Next round of scale testing had issue: https://gist.github.com/smalleni/c96b97db950f925aa879dac91ad3b635

Comment by Michael Vorburger [ 27/Nov/17 ]

This one MAY be easy to fix, I'm not sure sure - worth a try to see if with a retry = 3 instead of the default 1 in that enqueueJob() we see in this class this may go away? Here are Gerrits for all branches to try this out:

Comment by Sam Hague [ 08/Dec/17 ]

I see the below exception, has retryCount=3 so I guess the code is in. Does the exception mean it did not fix the problem?

 

2017-11-29 23:51:58,501 | WARN  | ult-dispatcher-6 | ConcurrentDOMDataBroker          | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.3.SNAPSHOT | Tx: DOM-318 Error during phase CAN_COMMIT, starting Abort
OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.]]}
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.
2017-11-29 23:51:58,503 | WARN  | CommitFutures-2  | DataStoreJobCoordinator          | 293 - org.opendaylight.genius.mdsalutil-api - 0.2.3.SNAPSHOT | Job: JobEntry{key='VPNNODE-30811640961648', mainWorker=org.opendaylight.netvirt.vpnmanager.VpnNodeListener$$Lambda$368/1576234995@54571922, rollbackWorker=null, retryCount=3, futures=[org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$AsyncNotifyingSettableFuture@5a90edd8]} failed
OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.]]}
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was created by other transaction.*no* further _formatting_ is done here
Comment by Michael Vorburger [ 11/Dec/17 ]

shague I wonder if the change really does not fix the problem, or if this is "just" a kind of confusing logging issue.. do you have any way of verifiying if the "functionality" actually worked? Can you link to a full log where this was seen, preferably from master (with JobCoordinatorImpl instead of DataStoreJobCoordinator).

For the 2nd log, I'm expecting a closer inspection of a full log should show that after the WARN with retryCount=3, it retried and that then worked (hopefully); to confirm that, you would have to see if there also a retryCount=2 and retryCount=1 (shortly after, for the exact same Job), or not. If there isn't, then this means the fix did work - it just confusing that it's still gives us a WARN on the first attempt which failed but then the retry worked. That could be fixed by proposed https://git.opendaylight.org/gerrit/66355, if others agree to this.

For the 1st log, that's a WARN from controller. I'm not sure I agree that controller should emit that WARN.. it's one of those "where shld you log, in the callee or in the caller" kind of discussions; I've just raised this on thread "How about reducing WARN to DEBUG for ConflictingModificationAppliedException in ConcurrentDOMDataBroker?" on controller-dev, see https://lists.opendaylight.org/pipermail/controller-dev/2017-December/014099.html.

Comment by Sam Hague [ 11/Dec/17 ]

Here is the log from the master patch (https://git.opendaylight.org/gerrit/#/c/65944/) gate job: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-gate-stateful-oxygen/8/odl_1/odl1_karaf.log.gz

In that run there all you see is the retryCount=3 and no decrement. There is a preceding OLFE before the rertryCount exception. Does that imply it only tried once as you mention and worked? If so, then yes, that is very confusing to see the WARN with OLFE - I would never think that it worked in that case.

Agreed on the first log. These are things very difficult to control in the downstream apps since the errors pop up regardless of what the app thinks.

Comment by Michael Vorburger [ 11/Dec/17 ]

Does that imply it only tried once as you mention and worked?

That is my understanding, yes.

Agreed on the first log. These are things very difficult to control in the downstream apps since the errors pop up regardless of what the app thinks.

==> CONTROLLER-1802

Comment by Sam Hague [ 19/Dec/17 ]

Michael,

in this job: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-gate-stateful-oxygen/535/odl_1/odl1_exception.log.gz we only got a retryCount=0 - does that mean the patch did what it was supposed to and only log the final exception after the last retry? The bug is still there, but the number of logs was reduced?

Also can you clarify that your patches here - they don't fix this bug, they just reduce the number of logs? Do I have that right?

2017-12-12 04:41:41,597 | WARN  | ult-dispatcher-3 | ConcurrentDOMDataBroker          | 217 - org.opendaylight.controller.sal-distributed-datastore - 1.7.0.SNAPSHOT | Tx: DOM-163473 Error during phase CAN_COMMIT, starting Abort
OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was replaced by other transaction.]]}
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[417:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was replaced by other transaction.
2017-12-12 04:41:41,601 | WARN  | CommitFutures-15 | JobCoordinatorImpl               | 261 - org.opendaylight.infrautils.jobcoordinator-impl - 1.3.0.SNAPSHOT | Job: JobEntry{key='c7da4167-372c-46c4-b718-9331ac4faff0', mainWorker=ItmTepRemoveWorker  { Delete Dpn List : [DPNTEPsInfo [_dPNID=181234940435320, _key=DPNTEPsInfoKey [_dPNID=181234940435320], _tunnelEndPoints=[TunnelEndPoints [_gwIpAddress=IpAddress [_ipv4Address=Ipv4Address [_value=0.0.0.0]], _interfaceName=181234940435320:tunnel_port:0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.15.6]], _key=TunnelEndPointsKey [_portname=tunnel_port, _vLANID=0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.15.6]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan], _optionTunnelTos=0, _portname=tunnel_port, _subnetMask=IpPrefix [_ipv4Prefix=Ipv4Prefix [_value=0.0.0.0/0]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan, _tzMembership=[TzMembership [_key=TzMembershipKey [_zoneName=c7da4167-372c-46c4-b718-9331ac4faff0], _zoneName=c7da4167-372c-46c4-b718-9331ac4faff0, augmentation=[]]], _vLANID=0, _optionOfTunnel=false, augmentation=[]]], augmentation=[]], DPNTEPsInfo [_dPNID=88475369668342, _key=DPNTEPsInfoKey [_dPNID=88475369668342], _tunnelEndPoints=[TunnelEndPoints [_gwIpAddress=IpAddress [_ipv4Address=Ipv4Address [_value=0.0.0.0]], _interfaceName=88475369668342:tunnel_port:0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.13.254]], _key=TunnelEndPointsKey [_portname=tunnel_port, _vLANID=0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.13.254]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan], _optionTunnelTos=0, _portname=tunnel_port, _subnetMask=IpPrefix [_ipv4Prefix=Ipv4Prefix [_value=0.0.0.0/0]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan, _tzMembership=[TzMembership [_key=TzMembershipKey [_zoneName=c7da4167-372c-46c4-b718-9331ac4faff0], _zoneName=c7da4167-372c-46c4-b718-9331ac4faff0, augmentation=[]]], _vLANID=0, _optionOfTunnel=false, augmentation=[]]], augmentation=[]], DPNTEPsInfo [_dPNID=84424921983078, _key=DPNTEPsInfoKey [_dPNID=84424921983078], _tunnelEndPoints=[TunnelEndPoints [_gwIpAddress=IpAddress [_ipv4Address=Ipv4Address [_value=0.0.0.0]], _interfaceName=84424921983078:tunnel_port:0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.14.8]], _key=TunnelEndPointsKey [_portname=tunnel_port, _vLANID=0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=10.29.14.8]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan], _optionTunnelTos=0, _portname=tunnel_port, _subnetMask=IpPrefix [_ipv4Prefix=Ipv4Prefix [_value=0.0.0.0/0]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan, _tzMembership=[TzMembership [_key=TzMembershipKey [_zoneName=c7da4167-372c-46c4-b718-9331ac4faff0], _zoneName=c7da4167-372c-46c4-b718-9331ac4faff0, augmentation=[]]], _vLANID=0, _optionOfTunnel=false, augmentation=[]]], augmentation=[]]] }, rollbackWorker=null, retryCount=0, futures=[org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$AsyncNotifyingSettableFuture@62cfb748, com.google.common.util.concurrent.ImmediateFuture$ImmediateSuccessfulCheckedFuture@66f03fc9]} failed
OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was replaced by other transaction.]]}
Comment by Michael Vorburger [ 19/Dec/17 ]

we only got a retryCount=0 - does that mean the patch did what it was supposed to and only log the final exception after the last retry? The bug is still there, but the number of logs was reduced?

Now wait a second, your new log above from JobCoordinatorImpl is re. ItmTepRemoveWorker.. whereas originally this issue was re. VpnNodeListener... so what is the "scope" of this bug? The change above only fixed VpnNodeListener, by adding a ,3 parameter for retries, which we proved helped.

Now you either have to do the same in ItmTepRemoveWorker, and everywhere else, or we change the default retries in JobCoordinator from 0 to 3 for EVERYTHING... I'm not sure that everyone will agree that is a good idea, but I can propose it, for discussion, in c/66618 ...

Also can you clarify that your patches here - they don't fix this bug, they just reduce the number of logs? Do I have that right?

Well, no careful, you mix up 2 things here: Those 3 retries added e.g. to VpnNodeListener, or maybe by changing the default in JobCoordinator, DO fix this bug. But then the CONTROLLER-1802 improves logging to reduce where it does not make sense to log (because there are retries), that's what c/66362 does for controller (merged on Dec 13th but still appears as WARN from ConcurrentDOMDataBroker because your log above is from a Dec 12th run!) and c/66355 forJobCoordinator (finally just merged, but likely buggy; follow up in c/66634).

Comment by Sam Hague [ 21/Dec/17 ]

Ok so the fact we don't see any errors in the VpnNodeListener patch means somewhere in the 1-3 retries it passed - and failures would only be logged at debug, which is not on by default, so we don't know how many tries it took, but it did pass.

Then we need to see if we want to do the same for the new error. And in this one, it says retries =0 since that is the default - and not that is got through all retries and ended at 0.

Comment by Michael Vorburger [ 26/Feb/18 ]

That (above) is correct. FTR c/66618 which changed the default retries in JobCoordinator from 0 to 3 globally for everything was discussed, accepted and merged on Jan 9th.

Perhaps we could finally close this bug (with Fix Version: Oxygen), and open new ones for other places?

Comment by Michael Vorburger [ 14/Mar/18 ]

shague SridharG can this one finally be closed now, or is there anything else I can do here?

Comment by Sam Hague [ 05/Apr/18 ]

https://git.opendaylight.org/gerrit/65944

Generated at Wed Feb 07 20:22:47 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.