[BGPCEP-901] Deadlock encountered between PCEP session mgmt and stats collection threads Created: 24/Mar/20  Updated: 03/Aug/20

Status: Open
Project: bgpcep
Component/s: PCEP
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Ajay Lele Assignee: Ajay Lele
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

Scenario is that large number of PCEP connections are established between Controller and PCCs (approx. 50) and large number of LSPs are delegated by each PCC (approx. 500). Then PCEP connections are brought down and back up, and this process repeated in quick succession, by restarting PCCs (pcc-mock used for repro). This results in PCEP sessions management and stats collection threads running into deadlock. There are 3 different deadlock scenarios seen. Based on feedback of previous combined patch, I am labelling them instance 1, 2 and 3 and creating separate patches for each (details in comments section below).



 Comments   
Comment by Ajay Lele [ 22/May/20 ]

Instance-1:

Deadlock happens between 2 threads trying to acquire lock on Stateful07TopologySessionListener and TopologyStatsProviderImpl instance in reverse order. In AbstractTopologySessionListener#getDelegatedLspsCount(), locking was added to avoid ConcurrentModificationException for lspData map (ref. BGPCEP-875, BGPCEP-845). Patch achieves same result without having to acquire the lock by using thread-safe ConcurrentHashMap for lspData.

 

"pool-55-thread-1" Id=1637 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@2a544f94
     owned by epollEventLoopGroup-8-15 Id=20525
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.getDelegatedLspsCount(AbstractTopologySessionListener.java:595)
    at org.opendaylight.bgpcep.pcep.topology.provider.session.stats.SessionStateImpl.getDelegatedLspsCount(SessionStateImpl.java:178)
    at org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.pcep.stats.rev171113.pcep.session.state.grouping.PcepSessionStateBuilder.<init>(PcepSessionStateBuilder.java:52)
    at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.updatePcepStats(TopologyStatsProviderImpl.java:85)
      - locked org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl@4ff973fc
    at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.access$000(TopologyStatsProviderImpl.java:46)
    at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl$1.run(TopologyStatsProviderImpl.java:70)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)


"epollEventLoopGroup-8-15" Id=20525 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl@4ff973fc
     owned by pool-55-thread-1 Id=1637
    at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.unbind(TopologyStatsProviderImpl.java:148)
    at Proxyd6938441_9b1d_4b95_a278_67e5916242a6.unbind(Unknown Source)
    at Proxy06c2138c_7c3b_4435_acb6_d58a7b2485f3.unbind(Unknown Source)
    at org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager.unbind(ServerSessionManager.java:272)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@7a762f69
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.clearNodeState(AbstractTopologySessionListener.java:324)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@2a544f94
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.tearDown(AbstractTopologySessionListener.java:220)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@2a544f94
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionDown(AbstractTopologySessionListener.java:261)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@2a544f94
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.endOfInput(PCEPSessionImpl.java:277)
      - locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@7960122c
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.channelInactive(PCEPSessionImpl.java:408)
      - locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@7960122c
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231) 
Comment by Ajay Lele [ 23/May/20 ]

Instance-2:

Multiple threads are waiting on TopologyStatsProviderImpl instance lock which is held from TopologyStatsProviderImpl#unbind() which in turn is waiting for delete transaction commit future to complete. The transaction chain has failed but the callback TopologyStatsProviderImpl#onTransactionChainFailed() is blocked on lock held by earlier thread, thus creating a deadlock. Patch avoids this by registering callback on delete transaction commit future instead of the blocking call. Node IDs for which the delete transaction is in-progress are held in thread-safe statsPendingDelete set, which is used to ensure that those are not updated concurrently from periodic update stats thread. TopologyStatsProviderImpl#bind() which adds entry to statsMap can get called immediately after unbind() and before next updatePcepStats() is called, so relying on presence of entry in statsMap alone is not sufficient.

 

"pool-51-thread-1" Id=324 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl@5d1ccc60
     owned by epollEventLoopGroup-10-14 Id=1075
    at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.updatePcepStats(TopologyStatsProviderImpl.java:79)
    at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.access$000(TopologyStatsProviderImpl.java:46)
    at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl$1.run(TopologyStatsProviderImpl.java:70)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    

"CommitFutures-18" Id=1100 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl@5d1ccc60
     owned by epollEventLoopGroup-10-14 Id=1075
    at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.onTransactionChainFailed(TopologyStatsProviderImpl.java:126)
    at org.opendaylight.controller.md.sal.binding.impl.BindingDOMTransactionChainAdapter.failTransactionChain(BindingDOMTransactionChainAdapter.java:110)
    at org.opendaylight.controller.md.sal.binding.impl.BindingDOMTransactionChainAdapter.access$200(BindingDOMTransactionChainAdapter.java:31)
    at org.opendaylight.controller.md.sal.binding.impl.BindingDOMTransactionChainAdapter$3.onFailure(BindingDOMTransactionChainAdapter.java:91)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1387)
    at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:94)
    at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54)
    at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransaction.onFailure(PingPongTransaction.java:53)
    at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransactionChain.transactionFailed(PingPongTransactionChain.java:298)
    at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransactionChain$2.onFailure(PingPongTransactionChain.java:248)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1387)
    at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713)
    at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54)
    at org.opendaylight.controller.sal.core.compat.LegacyDOMDataBrokerAdapter$DOMDataTransactionAdapter$1.onFailure(LegacyDOMDataBrokerAdapter.java:297)
    at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1387)
    at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
    at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015)
    at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
    at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713)
    at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:94)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)


"epollEventLoopGroup-10-14" Id=1075 in WAITING on lock=com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture@5a42698a
    at sun.misc.Unsafe.park(Native Method)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:499)
    at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:85)
    at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.unbind(TopologyStatsProviderImpl.java:152)
      - locked org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl@5d1ccc60
    at Proxy51f8f312_6f7d_4942_8a59_56b11b75844f.unbind(Unknown Source)
    at Proxyea405bc9_703c_4292_a99f_75b618da2fab.unbind(Unknown Source)
    at org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager.unbind(ServerSessionManager.java:272)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@682b38b5
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.clearNodeState(AbstractTopologySessionListener.java:325)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@47243d15
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.tearDown(AbstractTopologySessionListener.java:221)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@47243d15
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionDown(AbstractTopologySessionListener.java:262)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@47243d15
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.endOfInput(PCEPSessionImpl.java:277)
      - locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@767b4bd4
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.channelInactive(PCEPSessionImpl.java:408)
      - locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@767b4bd4
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
 
Comment by Ajay Lele [ 23/May/20 ]

Instance-3:

2 threads take lock on ServerSessionManager and Stateful07TopologySessionListener instances in opposite order causing deadlock. Patch fixes this by ensuring same order is followed while acquiring the locks (ServerSessionManager followed by Stateful07TopologySessionListener).

 "epollEventLoopGroup-8-16" Id=1504 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@17e301b5
     owned by epollEventLoopGroup-8-15 Id=1503
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.close(AbstractTopologySessionListener.java:316)
    at org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager.takeNodeState(ServerSessionManager.java:158)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@66e3973e
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionUp(AbstractTopologySessionListener.java:133)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@2bc8da08
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.sessionUp(PCEPSessionImpl.java:373)
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.handlerAdded(PCEPSessionImpl.java:425)
      - locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@22fa92d6
    at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:637)
    at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:597)
    at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:534)
    at org.opendaylight.protocol.pcep.impl.AbstractSessionNegotiator.negotiationSuccessful(AbstractSessionNegotiator.java:35)
    at org.opendaylight.protocol.pcep.impl.AbstractPCEPSessionNegotiator.handleMessageKeepAlive(AbstractPCEPSessionNegotiator.java:218)
    at org.opendaylight.protocol.pcep.impl.AbstractPCEPSessionNegotiator.handleMessageKeepWait(AbstractPCEPSessionNegotiator.java:207)
    at org.opendaylight.protocol.pcep.impl.AbstractPCEPSessionNegotiator.handleMessage(AbstractPCEPSessionNegotiator.java:335)
    at org.opendaylight.protocol.pcep.impl.AbstractSessionNegotiator.channelRead(AbstractSessionNegotiator.java:75)


"epollEventLoopGroup-8-15" Id=1503 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@66e3973e
     owned by epollEventLoopGroup-8-16 Id=1504
    at org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager.releaseNodeState(ServerSessionManager.java:124)
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.tearDown(AbstractTopologySessionListener.java:220)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@17e301b5
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionDown(AbstractTopologySessionListener.java:262)
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@17e301b5
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.endOfInput(PCEPSessionImpl.java:277)
      - locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@60ce7c0d
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.channelInactive(PCEPSessionImpl.java:408)
      - locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@60ce7c0d
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
    at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) 
Comment by Ajay Lele [ 23/May/20 ]

Instance-4:

2 threads take lock on PCEPSessionImpl and ServerSessionManager instances in opposite order causing deadlock. Lock on PCEPSessionImpl instance in PCEPSessionImpl#close() is not really required as this.closed which is an AtomicBoolean will prevent concurrent access, so patch gets rid of it to avoid the deadlock.

"epollEventLoopGroup-8-8" Id=362 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@e2050d5\
     owned by epollEventLoopGroup-8-5 Id=359\
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.tearDown(AbstractTopologySessionListener.java:226)\
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionDown(AbstractTopologySessionListener.java:273)\
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.endOfInput(PCEPSessionImpl.java:277)\
      - locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@e7af61\
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.channelInactive(PCEPSessionImpl.java:408)\
      - locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@e7af61\
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)\


"epollEventLoopGroup-8-5" Id=359 in BLOCKED on lock=org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@e7af61\
     owned by epollEventLoopGroup-8-8 Id=362\
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.close(PCEPSessionImpl.java:241)\
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.close(AbstractTopologySessionListener.java:332)\
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@16a0df3e\
      - locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@e2050d5\
    at org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager.takeNodeState(ServerSessionManager.java:158)\
      - locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@e2050d5\
    at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionUp(AbstractTopologySessionListener.java:135)\
      - locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@604066d3\
      - locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@e2050d5\
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.sessionUp(PCEPSessionImpl.java:373)\
    at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.handlerAdded(PCEPSessionImpl.java:425)\
      - locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@4bb66e20\  
 
Comment by Ajay Lele [ 26/May/20 ]

Patches on master:

Instance-1: https://git.opendaylight.org/gerrit/c/bgpcep/+/89966

Instance-2: https://git.opendaylight.org/gerrit/c/bgpcep/+/89968

Instance-3: https://git.opendaylight.org/gerrit/c/bgpcep/+/89970

Instance-4: https://git.opendaylight.org/gerrit/c/bgpcep/+/89971

Generated at Wed Feb 07 19:14:26 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.