[NETVIRT-1215] ElanUtils.waitForTransactionToComplete should not LOG.error for OptimisticLockFailedException Created: 16/Apr/18  Updated: 26/Apr/18  Resolved: 26/Apr/18

Status: Resolved
Project: netvirt
Component/s: None
Affects Version/s: None
Fix Version/s: Oxygen-SR2, Fluorine

Type: Bug Priority: Medium
Reporter: Michael Vorburger Assignee: Michael Vorburger
Resolution: Done Votes: 0
Labels: csit:exception, ds
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Java Source File FutureGetOnFailureTest.java    
Issue Links:
Relates
relates to NETVIRT-972 Conflicting modification for path...n... Resolved
relates to NETVIRT-1011 OptimisticLockFailedException: Confli... Resolved

 Description   

https://bugzilla.redhat.com/show_bug.cgi?id=1567509 requests that this LOG.error() be suppressed:

2018-04-13T23:40:08,068 | ERROR | ForkJoinPool-1-worker-27 | ElanUtils                        | 351 - org.opendaylight.netvirt.elanmanager-impl - 0.6.0.redhat-5 | Error writing to datastore {}
java.util.concurrent.ExecutionException: OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was deleted by other transaction.]]}
        at org.opendaylight.mdsal.common.api.MappingCheckedFuture.wrapInExecutionException(MappingCheckedFuture.java:66) [296:org.opendaylight.mdsal.common-api:2.4.0.redhat-5]
        at org.opendaylight.mdsal.common.api.MappingCheckedFuture.get(MappingCheckedFuture.java:79) [296:org.opendaylight.mdsal.common-api:2.4.0.redhat-5]
        at org.opendaylight.netvirt.elan.utils.ElanUtils.waitForTransactionToComplete(ElanUtils.java:1381) [351:org.opendaylight.netvirt.elanmanager-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.elan.internal.ElanInterfaceManager.removeElanInterface(ElanInterfaceManager.java:275) [351:org.opendaylight.netvirt.elanmanager-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.elan.internal.InterfaceRemoveWorkerOnElan.call(InterfaceRemoveWorkerOnElan.java:52) [351:org.opendaylight.netvirt.elanmanager-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.elan.internal.InterfaceRemoveWorkerOnElan.call(InterfaceRemoveWorkerOnElan.java:21) [351:org.opendaylight.netvirt.elanmanager-impl:0.6.0.redhat-5]
        at org.opendaylight.infrautils.jobcoordinator.internal.JobCoordinatorImpl$MainTask.runWithUncheckedExceptionLogging(JobCoordinatorImpl.java:416) [281:org.opendaylight.infrautils.jobcoordinator-impl:1.3.0.redhat-5]
        at org.opendaylight.infrautils.utils.concurrent.LoggingUncaughtThreadDeathContextRunnable.run(LoggingUncaughtThreadDeathContextRunnable.java:60) [287:org.opendaylight.infrautils.util:1.3.0.redhat-5]
        at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) [?:?]
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:?]
        at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:?]
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:?]
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) [?:?]
Caused by: org.opendaylight.controller.md.sal.common.api.data.OptimisticLockFailedException: Optimistic lock failed.
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.lambda$processNextPendingTransaction$0(ShardDataTree.java:740) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPending(ShardDataTree.java:778) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPendingTransaction(ShardDataTree.java:725) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:808) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:84) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.CohortEntry.canCommit(CohortEntry.java:97) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:236) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleReadyLocalTransaction(ShardCommitCoordinator.java:200) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.Shard.handleReadyLocalTransaction(Shard.java:731) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:333) ~[?:?]
        at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270) ~[?:?]
        at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44) ~[?:?]
        at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:274) ~[?:?]
        at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) ~[?:?]
        at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:545) ~[?:?]
        at akka.actor.Actor.aroundReceive(Actor.scala:514) ~[?:?]
        at akka.actor.Actor.aroundReceive$(Actor.scala:512) ~[?:?]
 at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:272) ~[?:?]
        at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:663) ~[?:?]
        at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:183) ~[?:?]
        at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:182) ~[?:?]
        at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:272) ~[?:?]
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527) ~[?:?]
        at akka.actor.ActorCell.invoke(ActorCell.scala:496) ~[?:?]
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) ~[?:?]
        at akka.dispatch.Mailbox.run(Mailbox.scala:224) ~[?:?]
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~[?:?]
Caused by: org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node was deleted by other transaction.
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkTouchApplicable(AbstractNodeContainerModificationStrategy.java:287) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:132) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:307) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkTouchApplicable(AbstractNodeContainerModificationStrategy.java:290) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:132) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:307) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkTouchApplicable(AbstractNodeContainerModificationStrategy.java:290) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:132) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.StructuralContainerModificationStrategy.checkApplicable(StructuralContainerModificationStrategy.java:101) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkChildPreconditions(AbstractNodeContainerModificationStrategy.java:307) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractNodeContainerModificationStrategy.checkTouchApplicable(AbstractNodeContainerModificationStrategy.java:290) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.SchemaAwareApplyOperation.checkApplicable(SchemaAwareApplyOperation.java:132) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.RootModificationApplyOperation.checkApplicable(RootModificationApplyOperation.java:72) ~[?:?]
        at org.opendaylight.yangtools.yang.data.impl.schema.tree.AbstractDataTreeTip.validate(AbstractDataTreeTip.java:35) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.lambda$processNextPendingTransaction$0(ShardDataTree.java:732) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPending(ShardDataTree.java:778) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextPendingTransaction(ShardDataTree.java:725) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:808) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:84) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.CohortEntry.canCommit(CohortEntry.java:97) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:236) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleReadyLocalTransaction(ShardCommitCoordinator.java:200) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.Shard.handleReadyLocalTransaction(Shard.java:731) ~[?:?]
        at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:333) ~[?:?]
        at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270) ~[?:?]
        at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:44) ~[?:?]
        at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:274) ~[?:?]
        at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) ~[?:?]
        at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:545) ~[?:?]
        at akka.actor.Actor.aroundReceive(Actor.scala:514) ~[?:?]
        at akka.actor.Actor.aroundReceive$(Actor.scala:512) ~[?:?]
        at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:272) ~[?:?]
        at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:663) ~[?:?]
        at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:183) ~[?:?]
        at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:182) ~[?:?]
        at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:272) ~[?:?]
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527) ~[?:?]
        at akka.actor.ActorCell.invoke(ActorCell.scala:496) ~[?:?]
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) ~[?:?]
        at akka.dispatch.Mailbox.run(Mailbox.scala:224) ~[?:?]
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) ~[?:?]
        at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~[?:?]

I'm not entirely convinced yet that it can just be supressed - or if it must be "handled" (retries?) - let's have a closer look, and discuss further here...



 Comments   
Comment by Michael Vorburger [ 16/Apr/18 ]

So InterfaceRemoveWorkerOnElan is a JobCoordinator Worker which is enqueueJob'd in 2 places: org.opendaylight.netvirt.elan.internal.ElanInterfaceManager.remove() and org.opendaylight.netvirt.elan.internal.ElanInterfaceStateChangeListener.remove(). In both of these, it already has 6 (ElanConstants.JOB_MAX_RETRIES) retries.

Therefore, just making that particular LOG.error() a LOG.debug() in the org.opendaylight.netvirt.elan.utils.ElanUtils.waitForTransactionToComplete(WriteTransaction) does seem justified and correct to me, but I would love more eyes and feedback on this...

When we check other usages of ElanUtils.waitForTransactionToComplete, we can see that the return values are always added to futures, except in ElanInterfaceManager.update(InstanceIdentifier<ElanInterface>, ElanInterface, ElanInterface) - I'll add a ListenableFutures.addErrorLogging for it there.

While we are at it, let me also raise a second change to completely remove the catch and LOG.error inside the Worker classes... that is not shown above, but I expect should also haven shown up in the original log, but is equally confusing and wrong - logging errors from the worker call(), if it does not succeed after retries, is the JobCoordinator's role, not the worker's.

BTW: I had actually worked at a more general different solution for this in https://git.opendaylight.org/gerrit/#/c/65839/, but that was -1 and I never picked it up. Let me therefore for now just fix the logging issue here, only.

Overall elanmanager could certainly do with a bigger rewrite re. its use of Future and exception handling, but that's beyond the scope of this issue, and not something I want to jump into right now, due to other priorities.

Comment by Michael Vorburger [ 16/Apr/18 ]

Changes c/70968 and c/70969 prepared, but still Draft due to Broken World - will publish when Peace is back.

Comment by Michael Vorburger [ 16/Apr/18 ]

eupakir I've added you as reviewer on the (still Draft) changes c/70968 and c/70969; please chime in there or here (better than on the private email) with any comments you have, if any, to contribute to this issue... thanks!

Comment by Kiran Upadhyaya [ 16/Apr/18 ]

There was a similar bug that was analysed to be containing conflicting modification between two DJCs with different keys:

https://jira.opendaylight.org/browse/NETVIRT-972.

Analysis:

There are two job-coordinators with different keys trying to add the same flow resulting in conflicting modification exception. One would think that these two jobs should have the same key, but that is not the case here.

One job adds flows onto remote DPNs for a new FIB entry (and has the FIB prefix in the job-key). The other job adds remote flows for all FIB entries for a VPN when that VPN establishes footprint on a new DPN (this cannot have prefix in its job-key). Both the keys are valid, and they will sometimes conflict with each other.

The functionality failure caused due to this was resolved a while back by adding retry for one these jobs, but the exception will still be sent into the logs.

Comment by Kiran Upadhyaya [ 16/Apr/18 ]

Michael,

Thanks for posting the patch. I will review them.

A few observations:

1) I dont think that remove() called from both the ElanInterfaceManager and ElanInterfaceStateChangeListener() is causing this conflicting modification. Assuming 1NODE CSIT run, the JOBS for both of them are enqueued on the same key, and coupled with the synchronous submit from waitForOperationalTransaction(), these two submits should not conflict with each other. (I would have rather expected a ModifiedNodeDoesNotExistException). I guess this is already known and the goal here is to suppress the exception.

2) Now, since there is a clear Conflicting Modification from the stack trace, I suspect 

ElanUtils.waitForTransactionToComplete(flowTx) , such that some other job(with a different key) or a non-DJC task might be trying to remove this flow. This is similar to NETVIRT-972 that is linked to this bug. In this case, both the threads are run with different triggers but both the situations are valid. For this, I agree with changing the trace to debug to suppress the exception.

 I'll post further comments, if any, on the reviews.

Thanks!

Comment by Michael Vorburger [ 26/Apr/18 ]

I suddenly had a doubt whether making ElanUtils.waitForTransactionToComplete() only LOG.debug instead of error (in c/70968) could ever hide any problems, so just to re-assure myself I put together the attached FutureGetOnFailureTes.java, which proves - of course, that IFF all callers of ElanUtils.waitForTransactionToComplete() "handle" the CheckedFuture which that returns, which 3/4 callers already did, and c/70968 adds to the 4th, then this is fine.  I've raised c/71417 which adds a @CheckReturnValue to it.

With this, close this issue as the LOG.error() shown on top is now suppressed. There may be other errors now - let's track that in new JIRAs.

Generated at Wed Feb 07 20:23:31 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.