[MDSAL-362] Thread deadlock between mdsal and bgpcep when running Controller Clustering CSIT job Created: 06/Jul/18 Updated: 20/Aug/18 Resolved: 20/Aug/18 |
|
| Status: | Resolved |
| Project: | mdsal |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Fluorine |
| Type: | Bug | Priority: | Highest |
| Reporter: | Victor Pickard | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
I've seen this issue in 4 job runs so far. Added other job links in subsequent comments.
Logs for the run can be found here:
For reference, here are the logs from the 2 deadlocked threads:
2018-07-03T18:42:02,391 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher | 357 - org.opendaylight.infrautils.metrics-impl - 1.3.3.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-akka.actor.default-dispatcher-18 locked on org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS@3acd01d1 (owned by opendaylight-cluster-data-notification-dispatcher-61): at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS.instantiateServiceInstance(PCEPTopologyProviderBean.java:157) at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lambda$startServices$0(ClusterSingletonServiceGroupImpl.java:648) at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl$$Lambda$1173/320400615.accept(Unknown Source) at java.util.ArrayList.forEach(ArrayList.java:1257) at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.startServices(ClusterSingletonServiceGroupImpl.java:645) at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.cleanupCandidateOwnershipChanged(ClusterSingletonServiceGroupImpl.java:506) at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lockedOwnershipChanged(ClusterSingletonServiceGroupImpl.java:453) at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ownershipChanged(ClusterSingletonServiceGroupImpl.java:433) at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.ownershipChanged(AbstractClusterSingletonServiceProviderImpl.java:234) at org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl.ownershipChanged(DOMClusterSingletonServiceProviderImpl.java:23) at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.onEntityOwnershipChanged(EntityOwnershipListenerActor.java:44) at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.handleReceive(EntityOwnershipListenerActor.java:33) at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) at akka.actor.Actor.aroundReceive(Actor.scala:517) at akka.actor.Actor.aroundReceive$(Actor.scala:515) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590) at akka.actor.ActorCell.invoke(ActorCell.scala:559) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2018-07-03T18:42:02,392 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher | 357 - org.opendaylight.infrautils.metrics-impl - 1.3.3.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-notification-dispatcher-61 locked on java.util.concurrent.locks.ReentrantLock$FairSync@340ae920 (owned by opendaylight-cluster-data-akka.actor.default-dispatcher-18): at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.unregisterService(ClusterSingletonServiceGroupImpl.java:393) at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.removeRegistration(AbstractClusterSingletonServiceProviderImpl.java:151) at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl$1.removeRegistration(AbstractClusterSingletonServiceProviderImpl.java:123) at org.opendaylight.yangtools.concepts.AbstractRegistration.close(AbstractRegistration.java:41) at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS.close(PCEPTopologyProviderBean.java:186) at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean.close(PCEPTopologyProviderBean.java:88) at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.closeTopology(PCEPTopologyDeployerImpl.java:151) at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.updateTopologyProvider(PCEPTopologyDeployerImpl.java:92) at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.lambda$onDataTreeChanged$0(PCEPTopologyDeployerImpl.java:75) at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl$$Lambda$1179/1398407903.accept(Unknown Source) at java.util.Iterator.forEachRemaining(Iterator.java:116) at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.onDataTreeChanged(PCEPTopologyDeployerImpl.java:71) at org.opendaylight.controller.md.sal.binding.impl.BindingDOMDataTreeChangeListenerAdapter.onDataTreeChanged(BindingDOMDataTreeChangeListenerAdapter.java:41) at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:67) at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:41) at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) at akka.actor.Actor.aroundReceive(Actor.scala:517) at akka.actor.Actor.aroundReceive$(Actor.scala:515) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590) at akka.actor.ActorCell.invoke(ActorCell.scala:559) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
|
| Comments |
| Comment by Victor Pickard [ 06/Jul/18 ] |
|
I added an attachment with some preliminary analysis. |
| Comment by Victor Pickard [ 06/Jul/18 ] |
|
Here is another job with the same thread deadlock.
|
| Comment by Robert Varga [ 08/Jul/18 ] |
"opendaylight-cluster-data-notification-dispatcher-61" id=246 state=WAITING
- waiting on <0x340ae920> (a java.util.concurrent.locks.ReentrantLock$FairSync)
- locked <0x340ae920> (a java.util.concurrent.locks.ReentrantLock$FairSync)
owned by opendaylight-cluster-data-akka.actor.default-dispatcher-18 id=200
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.unregisterService(ClusterSingletonServiceGroupImpl.java:393)
at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.removeRegistration(AbstractClusterSingletonServiceProviderImpl.java:151)
at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl$1.removeRegistration(AbstractClusterSingletonServiceProviderImpl.java:123)
at org.opendaylight.yangtools.concepts.AbstractRegistration.close(AbstractRegistration.java:41)
at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS.close(PCEPTopologyProviderBean.java:186)
- locked org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS@3acd01d1
at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean.close(PCEPTopologyProviderBean.java:88)
- locked org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean@2ab96127
at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.closeTopology(PCEPTopologyDeployerImpl.java:151)
- locked org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl@3e59b301
at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.updateTopologyProvider(PCEPTopologyDeployerImpl.java:92)
- locked org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl@3e59b301
at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.lambda$onDataTreeChanged$0(PCEPTopologyDeployerImpl.java:75)
at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl$$Lambda$1179/1398407903.accept(Unknown Source)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.onDataTreeChanged(PCEPTopologyDeployerImpl.java:71)
- locked org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl@3e59b301
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMDataTreeChangeListenerAdapter.onDataTreeChanged(BindingDOMDataTreeChangeListenerAdapter.java:41)
at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:67)
at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:41)
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590)
at akka.actor.ActorCell.invoke(ActorCell.scala:559)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
|
| Comment by Robert Varga [ 08/Jul/18 ] |
"opendaylight-cluster-data-akka.actor.default-dispatcher-18" id=200 state=BLOCKED
- waiting to lock <0x3acd01d1> (a org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS)
owned by opendaylight-cluster-data-notification-dispatcher-61 id=246
at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS.instantiateServiceInstance(PCEPTopologyProviderBean.java:157)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lambda$startServices$0(ClusterSingletonServiceGroupImpl.java:648)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl$$Lambda$1173/320400615.accept(Unknown Source)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.startServices(ClusterSingletonServiceGroupImpl.java:645)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.cleanupCandidateOwnershipChanged(ClusterSingletonServiceGroupImpl.java:506)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lockedOwnershipChanged(ClusterSingletonServiceGroupImpl.java:453)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ownershipChanged(ClusterSingletonServiceGroupImpl.java:433)
at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.ownershipChanged(AbstractClusterSingletonServiceProviderImpl.java:234)
at org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl.ownershipChanged(DOMClusterSingletonServiceProviderImpl.java:23)
at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.onEntityOwnershipChanged(EntityOwnershipListenerActor.java:44)
at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.handleReceive(EntityOwnershipListenerActor.java:33)
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590)
at akka.actor.ActorCell.invoke(ActorCell.scala:559)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Locked synchronizers: count = 1
- java.util.concurrent.locks.ReentrantLock$FairSync@340ae920
|
| Comment by Victor Pickard [ 09/Jul/18 ] |
|
There are 2 more jobs where I saw this deadlock.
https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-vpickard-all-oxygen/37
In this job, 3 deadlocked threads. https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-vpickard-all-oxygen/41 |
| Comment by Victor Pickard [ 10/Jul/18 ] |
|
I looked to see if this deadlock was occurring in netvirt csit job, didn't see anything so far. For reference, this is the netvirt job I took a look at: |
| Comment by Victor Pickard [ 17/Jul/18 ] |
|
Saw another deadlock in a job today. This is slightly different than the other deadlocks. Notice the service is BMP, not bgpcep as the other deadlocks we have seen. 2018-07-17T09:04:02,885 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher | 357 - org.opendaylight.infrautils.metrics-impl - 1.3.3.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-notification-dispatcher-64 locked on java.util.concurrent.locks.ReentrantLock$FairSync@53cb2354 (owned by opendaylight-cluster-data-akka.actor.default-dispatcher-2): 2018-07-17T09:04:02,885 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher | 357 - org.opendaylight.infrautils.metrics-impl - 1.3.3.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-akka.actor.default-dispatcher-2 locked on org.opendaylight.protocol.bmp.impl.app.BmpMonitoringStationImpl@5d0cc75d (owned by opendaylight-cluster-data-notification-dispatcher-64): |
| Comment by Victor Pickard [ 17/Jul/18 ] |
|
Deadlock occurred on odl3, here are the sandbox job logs. I have started the copy-logs job to save these off, will post the link when the job finishes. |
| Comment by Robert Varga [ 19/Jul/18 ] |
|
This will require an internal rework of how registered services are tracked and started, along with splitting the lock into two: one protecting the EOS state and one protecting service state. That implies that services may be starting concurrently to being registered/unregistered – which means we need to track start/stop state on a per-service basis. Furtheremore we need to downgrade the EOS-keeping lock to a non-reentrant one, so we prevent accidental recursion when register would see EOS being up and attempt to start the service, rather than exiting the stack and letting the callback origin take care of that. |
| Comment by Robert Varga [ 17/Aug/18 ] |
|
jluhrsen https://git.opendaylight.org/gerrit/75169 is ready for testing, I think. |
| Comment by Jamo Luhrsen [ 17/Aug/18 ] |
|
on it. running to patch tests (using gerrit keywords) and another openflow job. Links are in the gerrit, but I'll report back here when things are done. |
| Comment by Jamo Luhrsen [ 17/Aug/18 ] |
|
all of our gate csit has passed with the stable/fluorine patch. good to go from my point of view. |