[MDSAL-362] Thread deadlock between mdsal and bgpcep when running Controller Clustering CSIT job Created: 06/Jul/18  Updated: 20/Aug/18  Resolved: 20/Aug/18

Status: Resolved
Project: mdsal
Component/s: None
Affects Version/s: None
Fix Version/s: Fluorine

Type: Bug Priority: Highest
Reporter: Victor Pickard Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File controller-csit-3node-clustering-vpickard-all-oxygen-17.txt    
Issue Links:
Relates
relates to NETVIRT-1315 Troubleshooting Controller CSIT In Progress

 Description   

I've seen this issue in 4 job runs so far. Added other job links in subsequent comments.

 

 

Logs for the run can be found here:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/178/controller-csit-3node-clustering-vpickard-all-oxygen/17/

 

For reference, here are the logs from the 2 deadlocked threads:

 

 

2018-07-03T18:42:02,391 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher                   | 357 - org.opendaylight.infrautils.metrics-impl - 1.3.3.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-akka.actor.default-dispatcher-18 locked on org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS@3acd01d1 (owned by opendaylight-cluster-data-notification-dispatcher-61):
	 at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS.instantiateServiceInstance(PCEPTopologyProviderBean.java:157)
	 at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lambda$startServices$0(ClusterSingletonServiceGroupImpl.java:648)
	 at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl$$Lambda$1173/320400615.accept(Unknown Source)
	 at java.util.ArrayList.forEach(ArrayList.java:1257)
	 at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.startServices(ClusterSingletonServiceGroupImpl.java:645)
	 at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.cleanupCandidateOwnershipChanged(ClusterSingletonServiceGroupImpl.java:506)
	 at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lockedOwnershipChanged(ClusterSingletonServiceGroupImpl.java:453)
	 at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ownershipChanged(ClusterSingletonServiceGroupImpl.java:433)
	 at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.ownershipChanged(AbstractClusterSingletonServiceProviderImpl.java:234)
	 at org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl.ownershipChanged(DOMClusterSingletonServiceProviderImpl.java:23)
	 at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.onEntityOwnershipChanged(EntityOwnershipListenerActor.java:44)
	 at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.handleReceive(EntityOwnershipListenerActor.java:33)
	 at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
	 at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
	 at akka.actor.Actor.aroundReceive(Actor.scala:517)
	 at akka.actor.Actor.aroundReceive$(Actor.scala:515)
	 at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
	 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590)
	 at akka.actor.ActorCell.invoke(ActorCell.scala:559)
	 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	 at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	 at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	 at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	 at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2018-07-03T18:42:02,392 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher                   | 357 - org.opendaylight.infrautils.metrics-impl - 1.3.3.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-notification-dispatcher-61 locked on java.util.concurrent.locks.ReentrantLock$FairSync@340ae920 (owned by opendaylight-cluster-data-akka.actor.default-dispatcher-18):
	 at sun.misc.Unsafe.park(Native Method)
	 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	 at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	 at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
	 at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	 at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.unregisterService(ClusterSingletonServiceGroupImpl.java:393)
	 at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.removeRegistration(AbstractClusterSingletonServiceProviderImpl.java:151)
	 at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl$1.removeRegistration(AbstractClusterSingletonServiceProviderImpl.java:123)
	 at org.opendaylight.yangtools.concepts.AbstractRegistration.close(AbstractRegistration.java:41)
	 at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS.close(PCEPTopologyProviderBean.java:186)
	 at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean.close(PCEPTopologyProviderBean.java:88)
	 at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.closeTopology(PCEPTopologyDeployerImpl.java:151)
	 at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.updateTopologyProvider(PCEPTopologyDeployerImpl.java:92)
	 at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.lambda$onDataTreeChanged$0(PCEPTopologyDeployerImpl.java:75)
	 at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl$$Lambda$1179/1398407903.accept(Unknown Source)
	 at java.util.Iterator.forEachRemaining(Iterator.java:116)
	 at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.onDataTreeChanged(PCEPTopologyDeployerImpl.java:71)
	 at org.opendaylight.controller.md.sal.binding.impl.BindingDOMDataTreeChangeListenerAdapter.onDataTreeChanged(BindingDOMDataTreeChangeListenerAdapter.java:41)
	 at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:67)
	 at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:41)
	 at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
	 at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
	 at akka.actor.Actor.aroundReceive(Actor.scala:517)
	 at akka.actor.Actor.aroundReceive$(Actor.scala:515)
	 at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
	 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590)
	 at akka.actor.ActorCell.invoke(ActorCell.scala:559)
	 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	 at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	 at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	 at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	 at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

 



 Comments   
Comment by Victor Pickard [ 06/Jul/18 ]

I added an attachment with some preliminary analysis. 

Comment by Victor Pickard [ 06/Jul/18 ]

Here is another job with the same thread deadlock.

 

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/180/controller-csit-3node-clustering-vpickard-all-oxygen/22/

Comment by Robert Varga [ 08/Jul/18 ]
"opendaylight-cluster-data-notification-dispatcher-61" id=246 state=WAITING
    - waiting on <0x340ae920> (a java.util.concurrent.locks.ReentrantLock$FairSync)
    - locked <0x340ae920> (a java.util.concurrent.locks.ReentrantLock$FairSync)
     owned by opendaylight-cluster-data-akka.actor.default-dispatcher-18 id=200
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
    at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
    at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.unregisterService(ClusterSingletonServiceGroupImpl.java:393)
    at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.removeRegistration(AbstractClusterSingletonServiceProviderImpl.java:151)
    at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl$1.removeRegistration(AbstractClusterSingletonServiceProviderImpl.java:123)
    at org.opendaylight.yangtools.concepts.AbstractRegistration.close(AbstractRegistration.java:41)
    at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS.close(PCEPTopologyProviderBean.java:186)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS@3acd01d1
    at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean.close(PCEPTopologyProviderBean.java:88)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean@2ab96127
    at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.closeTopology(PCEPTopologyDeployerImpl.java:151)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl@3e59b301
    at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.updateTopologyProvider(PCEPTopologyDeployerImpl.java:92)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl@3e59b301
    at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.lambda$onDataTreeChanged$0(PCEPTopologyDeployerImpl.java:75)
    at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl$$Lambda$1179/1398407903.accept(Unknown Source)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl.onDataTreeChanged(PCEPTopologyDeployerImpl.java:71)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyDeployerImpl@3e59b301
    at org.opendaylight.controller.md.sal.binding.impl.BindingDOMDataTreeChangeListenerAdapter.onDataTreeChanged(BindingDOMDataTreeChangeListenerAdapter.java:41)
    at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:67)
    at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:41)
    at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
    at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590)
    at akka.actor.ActorCell.invoke(ActorCell.scala:559)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Comment by Robert Varga [ 08/Jul/18 ]
"opendaylight-cluster-data-akka.actor.default-dispatcher-18" id=200 state=BLOCKED
    - waiting to lock <0x3acd01d1> (a org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS)
     owned by opendaylight-cluster-data-notification-dispatcher-61 id=246
    at org.opendaylight.bgpcep.pcep.topology.provider.config.PCEPTopologyProviderBean$PCEPTopologyProviderBeanCSS.instantiateServiceInstance(PCEPTopologyProviderBean.java:157)
    at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lambda$startServices$0(ClusterSingletonServiceGroupImpl.java:648)
    at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl$$Lambda$1173/320400615.accept(Unknown Source)
    at java.util.ArrayList.forEach(ArrayList.java:1257)
    at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.startServices(ClusterSingletonServiceGroupImpl.java:645)
    at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.cleanupCandidateOwnershipChanged(ClusterSingletonServiceGroupImpl.java:506)
    at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lockedOwnershipChanged(ClusterSingletonServiceGroupImpl.java:453)
    at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ownershipChanged(ClusterSingletonServiceGroupImpl.java:433)
    at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.ownershipChanged(AbstractClusterSingletonServiceProviderImpl.java:234)
    at org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl.ownershipChanged(DOMClusterSingletonServiceProviderImpl.java:23)
    at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.onEntityOwnershipChanged(EntityOwnershipListenerActor.java:44)
    at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.handleReceive(EntityOwnershipListenerActor.java:33)
    at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
    at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
    at akka.actor.Actor.aroundReceive(Actor.scala:517)
    at akka.actor.Actor.aroundReceive$(Actor.scala:515)
    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590)
    at akka.actor.ActorCell.invoke(ActorCell.scala:559)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

    Locked synchronizers: count = 1
      - java.util.concurrent.locks.ReentrantLock$FairSync@340ae920
Comment by Victor Pickard [ 09/Jul/18 ]

There are 2 more jobs where I saw this deadlock. 

 

https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-vpickard-all-oxygen/37

 

In this job, 3 deadlocked threads.

https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-vpickard-all-oxygen/41

Comment by Victor Pickard [ 10/Jul/18 ]

I looked to see if this deadlock was occurring in netvirt csit job, didn't see anything so far. For reference, this is the netvirt job I took a look at:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-openstack-queens-upstream-stateful-oxygen/315/

Comment by Victor Pickard [ 17/Jul/18 ]

Saw another deadlock in a job today. This is slightly different than the other deadlocks. Notice the service is BMP, not bgpcep as the other deadlocks we have seen.

2018-07-17T09:04:02,885 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher | 357 - org.opendaylight.infrautils.metrics-impl - 1.3.3.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-notification-dispatcher-64 locked on java.util.concurrent.locks.ReentrantLock$FairSync@53cb2354 (owned by opendaylight-cluster-data-akka.actor.default-dispatcher-2):
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.unregisterService(ClusterSingletonServiceGroupImpl.java:393)
at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.removeRegistration(AbstractClusterSingletonServiceProviderImpl.java:151)
at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl$1.removeRegistration(AbstractClusterSingletonServiceProviderImpl.java:123)
at org.opendaylight.yangtools.concepts.AbstractRegistration.close(AbstractRegistration.java:41)
at org.opendaylight.protocol.bmp.impl.app.BmpMonitoringStationImpl.close(BmpMonitoringStationImpl.java:169)
at org.opendaylight.protocol.bmp.impl.config.BmpDeployerImpl.updateBmpMonitor(BmpDeployerImpl.java:112)
at org.opendaylight.protocol.bmp.impl.config.BmpDeployerImpl.handleModification(BmpDeployerImpl.java:98)
at org.opendaylight.protocol.bmp.impl.config.BmpDeployerImpl.lambda$onDataTreeChanged$0(BmpDeployerImpl.java:86)
at org.opendaylight.protocol.bmp.impl.config.BmpDeployerImpl$$Lambda$1230/2084640633.accept(Unknown Source)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at org.opendaylight.protocol.bmp.impl.config.BmpDeployerImpl.onDataTreeChanged(BmpDeployerImpl.java:86)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMDataTreeChangeListenerAdapter.onDataTreeChanged(BindingDOMDataTreeChangeListenerAdapter.java:41)
at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:67)
at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:41)
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590)
at akka.actor.ActorCell.invoke(ActorCell.scala:559)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

2018-07-17T09:04:02,885 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher | 357 - org.opendaylight.infrautils.metrics-impl - 1.3.3.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-akka.actor.default-dispatcher-2 locked on org.opendaylight.protocol.bmp.impl.app.BmpMonitoringStationImpl@5d0cc75d (owned by opendaylight-cluster-data-notification-dispatcher-64):
at org.opendaylight.protocol.bmp.impl.app.BmpMonitoringStationImpl.instantiateServiceInstance(BmpMonitoringStationImpl.java:90)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lambda$startServices$0(ClusterSingletonServiceGroupImpl.java:648)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl$$Lambda$1061/450812025.accept(Unknown Source)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.startServices(ClusterSingletonServiceGroupImpl.java:645)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.cleanupCandidateOwnershipChanged(ClusterSingletonServiceGroupImpl.java:506)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.lockedOwnershipChanged(ClusterSingletonServiceGroupImpl.java:453)
at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ownershipChanged(ClusterSingletonServiceGroupImpl.java:433)
at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.ownershipChanged(AbstractClusterSingletonServiceProviderImpl.java:234)
at org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl.ownershipChanged(DOMClusterSingletonServiceProviderImpl.java:23)
at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.onEntityOwnershipChanged(EntityOwnershipListenerActor.java:44)
at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.handleReceive(EntityOwnershipListenerActor.java:33)
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590)
at akka.actor.ActorCell.invoke(ActorCell.scala:559)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Comment by Victor Pickard [ 17/Jul/18 ]

Deadlock occurred on odl3, here are the sandbox job logs.

https://logs.opendaylight.org/sandbox/vex-yul-odl-jenkins-2/controller-csit-3node-clustering-vpickard-all-oxygen/10/odl3_karaf.log.gz

I have started the copy-logs job to save these off, will post the link when the job finishes.

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/233/controller-csit-3node-clustering-vpickard-all-oxygen/10/

Comment by Robert Varga [ 19/Jul/18 ]

This will require an internal rework of how registered services are tracked and started, along with splitting the lock into two: one protecting the EOS state and one protecting service state. That implies that services may be starting concurrently to being registered/unregistered – which means we need to track start/stop state on a per-service basis.

Furtheremore we need to downgrade the EOS-keeping lock to a non-reentrant one, so we prevent accidental recursion when register would see EOS being up and attempt to start the service, rather than exiting the stack and letting the callback origin take care of that.

Comment by Robert Varga [ 17/Aug/18 ]

jluhrsen https://git.opendaylight.org/gerrit/75169 is ready for testing, I think.

Comment by Jamo Luhrsen [ 17/Aug/18 ]

on it. running to patch tests (using gerrit keywords) and another openflow job. Links are in the gerrit, but I'll report back here when things are done.

Comment by Jamo Luhrsen [ 17/Aug/18 ]

all of our gate csit has passed with the stable/fluorine patch. good to go from my point of view.

Generated at Wed Feb 07 20:09:37 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.