[BGPCEP-901] Deadlock encountered between PCEP session mgmt and stats collection threads Created: 24/Mar/20 Updated: 03/Aug/20 |
|
| Status: | Open |
| Project: | bgpcep |
| Component/s: | PCEP |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Medium |
| Reporter: | Ajay Lele | Assignee: | Ajay Lele |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
Scenario is that large number of PCEP connections are established between Controller and PCCs (approx. 50) and large number of LSPs are delegated by each PCC (approx. 500). Then PCEP connections are brought down and back up, and this process repeated in quick succession, by restarting PCCs (pcc-mock used for repro). This results in PCEP sessions management and stats collection threads running into deadlock. There are 3 different deadlock scenarios seen. Based on feedback of previous combined patch, I am labelling them instance 1, 2 and 3 and creating separate patches for each (details in comments section below). |
| Comments |
| Comment by Ajay Lele [ 22/May/20 ] |
|
Instance-1: Deadlock happens between 2 threads trying to acquire lock on Stateful07TopologySessionListener and TopologyStatsProviderImpl instance in reverse order. In AbstractTopologySessionListener#getDelegatedLspsCount(), locking was added to avoid ConcurrentModificationException for lspData map (ref.
"pool-55-thread-1" Id=1637 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@2a544f94
owned by epollEventLoopGroup-8-15 Id=20525
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.getDelegatedLspsCount(AbstractTopologySessionListener.java:595)
at org.opendaylight.bgpcep.pcep.topology.provider.session.stats.SessionStateImpl.getDelegatedLspsCount(SessionStateImpl.java:178)
at org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.pcep.stats.rev171113.pcep.session.state.grouping.PcepSessionStateBuilder.<init>(PcepSessionStateBuilder.java:52)
at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.updatePcepStats(TopologyStatsProviderImpl.java:85)
- locked org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl@4ff973fc
at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.access$000(TopologyStatsProviderImpl.java:46)
at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl$1.run(TopologyStatsProviderImpl.java:70)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
"epollEventLoopGroup-8-15" Id=20525 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl@4ff973fc
owned by pool-55-thread-1 Id=1637
at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.unbind(TopologyStatsProviderImpl.java:148)
at Proxyd6938441_9b1d_4b95_a278_67e5916242a6.unbind(Unknown Source)
at Proxy06c2138c_7c3b_4435_acb6_d58a7b2485f3.unbind(Unknown Source)
at org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager.unbind(ServerSessionManager.java:272)
- locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@7a762f69
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.clearNodeState(AbstractTopologySessionListener.java:324)
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@2a544f94
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.tearDown(AbstractTopologySessionListener.java:220)
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@2a544f94
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionDown(AbstractTopologySessionListener.java:261)
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@2a544f94
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.endOfInput(PCEPSessionImpl.java:277)
- locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@7960122c
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.channelInactive(PCEPSessionImpl.java:408)
- locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@7960122c
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
|
| Comment by Ajay Lele [ 23/May/20 ] |
|
Instance-2: Multiple threads are waiting on TopologyStatsProviderImpl instance lock which is held from TopologyStatsProviderImpl#unbind() which in turn is waiting for delete transaction commit future to complete. The transaction chain has failed but the callback TopologyStatsProviderImpl#onTransactionChainFailed() is blocked on lock held by earlier thread, thus creating a deadlock. Patch avoids this by registering callback on delete transaction commit future instead of the blocking call. Node IDs for which the delete transaction is in-progress are held in thread-safe statsPendingDelete set, which is used to ensure that those are not updated concurrently from periodic update stats thread. TopologyStatsProviderImpl#bind() which adds entry to statsMap can get called immediately after unbind() and before next updatePcepStats() is called, so relying on presence of entry in statsMap alone is not sufficient.
"pool-51-thread-1" Id=324 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl@5d1ccc60
owned by epollEventLoopGroup-10-14 Id=1075
at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.updatePcepStats(TopologyStatsProviderImpl.java:79)
at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.access$000(TopologyStatsProviderImpl.java:46)
at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl$1.run(TopologyStatsProviderImpl.java:70)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"CommitFutures-18" Id=1100 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl@5d1ccc60
owned by epollEventLoopGroup-10-14 Id=1075
at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.onTransactionChainFailed(TopologyStatsProviderImpl.java:126)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMTransactionChainAdapter.failTransactionChain(BindingDOMTransactionChainAdapter.java:110)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMTransactionChainAdapter.access$200(BindingDOMTransactionChainAdapter.java:31)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMTransactionChainAdapter$3.onFailure(BindingDOMTransactionChainAdapter.java:91)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1387)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:94)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713)
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54)
at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransaction.onFailure(PingPongTransaction.java:53)
at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransactionChain.transactionFailed(PingPongTransactionChain.java:298)
at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransactionChain$2.onFailure(PingPongTransactionChain.java:248)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1387)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713)
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:54)
at org.opendaylight.controller.sal.core.compat.LegacyDOMDataBrokerAdapter$DOMDataTransactionAdapter$1.onFailure(LegacyDOMDataBrokerAdapter.java:297)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1387)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:713)
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:94)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"epollEventLoopGroup-10-14" Id=1075 in WAITING on lock=com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture@5a42698a
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:499)
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:85)
at org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl.unbind(TopologyStatsProviderImpl.java:152)
- locked org.opendaylight.bgpcep.pcep.topology.stats.provider.TopologyStatsProviderImpl@5d1ccc60
at Proxy51f8f312_6f7d_4942_8a59_56b11b75844f.unbind(Unknown Source)
at Proxyea405bc9_703c_4292_a99f_75b618da2fab.unbind(Unknown Source)
at org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager.unbind(ServerSessionManager.java:272)
- locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@682b38b5
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.clearNodeState(AbstractTopologySessionListener.java:325)
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@47243d15
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.tearDown(AbstractTopologySessionListener.java:221)
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@47243d15
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionDown(AbstractTopologySessionListener.java:262)
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@47243d15
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.endOfInput(PCEPSessionImpl.java:277)
- locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@767b4bd4
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.channelInactive(PCEPSessionImpl.java:408)
- locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@767b4bd4
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
|
| Comment by Ajay Lele [ 23/May/20 ] |
|
Instance-3: 2 threads take lock on ServerSessionManager and Stateful07TopologySessionListener instances in opposite order causing deadlock. Patch fixes this by ensuring same order is followed while acquiring the locks (ServerSessionManager followed by Stateful07TopologySessionListener). "epollEventLoopGroup-8-16" Id=1504 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@17e301b5
owned by epollEventLoopGroup-8-15 Id=1503
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.close(AbstractTopologySessionListener.java:316)
at org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager.takeNodeState(ServerSessionManager.java:158)
- locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@66e3973e
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionUp(AbstractTopologySessionListener.java:133)
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@2bc8da08
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.sessionUp(PCEPSessionImpl.java:373)
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.handlerAdded(PCEPSessionImpl.java:425)
- locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@22fa92d6
at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:637)
at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:597)
at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:534)
at org.opendaylight.protocol.pcep.impl.AbstractSessionNegotiator.negotiationSuccessful(AbstractSessionNegotiator.java:35)
at org.opendaylight.protocol.pcep.impl.AbstractPCEPSessionNegotiator.handleMessageKeepAlive(AbstractPCEPSessionNegotiator.java:218)
at org.opendaylight.protocol.pcep.impl.AbstractPCEPSessionNegotiator.handleMessageKeepWait(AbstractPCEPSessionNegotiator.java:207)
at org.opendaylight.protocol.pcep.impl.AbstractPCEPSessionNegotiator.handleMessage(AbstractPCEPSessionNegotiator.java:335)
at org.opendaylight.protocol.pcep.impl.AbstractSessionNegotiator.channelRead(AbstractSessionNegotiator.java:75)
"epollEventLoopGroup-8-15" Id=1503 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@66e3973e
owned by epollEventLoopGroup-8-16 Id=1504
at org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager.releaseNodeState(ServerSessionManager.java:124)
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.tearDown(AbstractTopologySessionListener.java:220)
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@17e301b5
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionDown(AbstractTopologySessionListener.java:262)
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@17e301b5
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.endOfInput(PCEPSessionImpl.java:277)
- locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@60ce7c0d
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.channelInactive(PCEPSessionImpl.java:408)
- locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@60ce7c0d
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:390)
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:355)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
|
| Comment by Ajay Lele [ 23/May/20 ] |
|
Instance-4: 2 threads take lock on PCEPSessionImpl and ServerSessionManager instances in opposite order causing deadlock. Lock on PCEPSessionImpl instance in PCEPSessionImpl#close() is not really required as this.closed which is an AtomicBoolean will prevent concurrent access, so patch gets rid of it to avoid the deadlock. "epollEventLoopGroup-8-8" Id=362 in BLOCKED on lock=org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@e2050d5\
owned by epollEventLoopGroup-8-5 Id=359\
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.tearDown(AbstractTopologySessionListener.java:226)\
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionDown(AbstractTopologySessionListener.java:273)\
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.endOfInput(PCEPSessionImpl.java:277)\
- locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@e7af61\
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.channelInactive(PCEPSessionImpl.java:408)\
- locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@e7af61\
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)\
"epollEventLoopGroup-8-5" Id=359 in BLOCKED on lock=org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@e7af61\
owned by epollEventLoopGroup-8-8 Id=362\
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.close(PCEPSessionImpl.java:241)\
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.close(AbstractTopologySessionListener.java:332)\
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@16a0df3e\
- locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@e2050d5\
at org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager.takeNodeState(ServerSessionManager.java:158)\
- locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@e2050d5\
at org.opendaylight.bgpcep.pcep.topology.provider.AbstractTopologySessionListener.onSessionUp(AbstractTopologySessionListener.java:135)\
- locked org.opendaylight.bgpcep.pcep.topology.provider.Stateful07TopologySessionListener@604066d3\
- locked org.opendaylight.bgpcep.pcep.topology.provider.ServerSessionManager@e2050d5\
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.sessionUp(PCEPSessionImpl.java:373)\
at org.opendaylight.protocol.pcep.impl.PCEPSessionImpl.handlerAdded(PCEPSessionImpl.java:425)\
- locked org.opendaylight.protocol.pcep.impl.PCEPSessionImpl@4bb66e20\
|
| Comment by Ajay Lele [ 26/May/20 ] |
|
Patches on master: Instance-1: https://git.opendaylight.org/gerrit/c/bgpcep/+/89966 Instance-2: https://git.opendaylight.org/gerrit/c/bgpcep/+/89968 Instance-3: https://git.opendaylight.org/gerrit/c/bgpcep/+/89970 Instance-4: https://git.opendaylight.org/gerrit/c/bgpcep/+/89971 |