[BGPCEP-1023] Race during peer shutdown Created: 11/Apr/23  Updated: 27/Oct/23

Status: Confirmed
Project: bgpcep
Component/s: BGP
Affects Version/s: None
Fix Version/s: 0.19.9

Type: Bug Priority: Medium
Reporter: Robert Varga Assignee: Robert Varga
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

We have the following splat in CSIT:

2023-04-11T01:32:00,588 | INFO  | epollEventLoopGroup-7-6 | BGPPeer                          | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Session with peer 127.0.0.3 went down
2023-04-11T01:32:00,589 | INFO  | epollEventLoopGroup-7-6 | BGPPeer                          | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Closing session with peer
2023-04-11T01:32:00,590 | INFO  | epollEventLoopGroup-7-6 | AbstractPeer                     | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Closed per Peer /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)bgp-rib/rib/rib[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)id=example-bgp-rib}]/peer/peer[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)peer-id=bgp://127.0.0.3}] removed
2023-04-11T01:32:00,591 | INFO  | epollEventLoopGroup-7-3 | AbstractPeer                     | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Closing peer chain Uri{value=bgp://127.0.0.2}
2023-04-11T01:32:00,591 | INFO  | epollEventLoopGroup-7-3 | BGPSessionImpl                   | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Closing session: BGPSessionImpl{channel=[id: 0x75f3dae8, L:/10.30.170.109:1790 ! R:/127.0.0.2:39923], state=UP}
2023-04-11T01:32:00,592 | WARN  | opendaylight-cluster-data-shard-dispatcher-27 | ShardDataTree                    | 284 - org.opendaylight.controller.sal-distributed-datastore - 7.0.4 | member-1-shard-default-operational: Store Tx member-1-datastore-operational-fe-0-chn-249-txn-2-0: Conflicting modification for path /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)bgp-rib/rib/rib[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)id=example-bgp-rib}]/peer/peer[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)peer-id=bgp://127.0.0.3}].
2023-04-11T01:32:00,601 | INFO  | epollEventLoopGroup-7-2 | BGPSessionImpl                   | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | End of input detected. Close the session.
2023-04-11T01:32:00,601 | INFO  | epollEventLoopGroup-7-2 | BGPPeer                          | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Session with peer 127.0.0.4 went down
2023-04-11T01:32:00,601 | INFO  | epollEventLoopGroup-7-2 | BGPPeer                          | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Closing session with peer
2023-04-11T01:32:00,602 | INFO  | epollEventLoopGroup-7-2 | AbstractPeer                     | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Closed per Peer /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)bgp-rib/rib/rib[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)id=example-bgp-rib}]/peer/peer[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)peer-id=bgp://127.0.0.4}] removed
2023-04-11T01:32:00,603 | INFO  | epollEventLoopGroup-7-2 | AbstractPeer                     | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Closing peer chain Uri{value=bgp://127.0.0.4}
2023-04-11T01:32:00,603 | INFO  | epollEventLoopGroup-7-2 | BGPSessionImpl                   | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Closing session: BGPSessionImpl{channel=[id: 0x144dc834, L:/10.30.170.109:1790 ! R:/127.0.0.4:33429], state=UP}
2023-04-11T01:32:00,602 | ERROR | epollEventLoopGroup-7-6 | AbstractPeer                     | 228 - org.opendaylight.bgpcep.bgp-rib-impl - 0.19.4 | Write routes failed
java.util.concurrent.ExecutionException: OptimisticLockFailedException{message=Optimistic lock failed for path /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)bgp-rib/rib/rib[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)id=example-bgp-rib}]/peer/peer[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)peer-id=bgp://127.0.0.3}], errorList=[RpcError [message=Optimistic lock failed for path /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)bgp-rib/rib/rib[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)id=example-bgp-rib}]/peer/peer[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)peer-id=bgp://127.0.0.3}], severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.tree.api.ConflictingModificationAppliedException: Node was deleted by other transaction.]]}
	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:588) ~[?:?]
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:567) ~[?:?]
	at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:91) ~[?:?]
	at org.opendaylight.protocol.bgp.rib.impl.AbstractPeer.releaseRibOutChain(AbstractPeer.java:490) ~[?:?]
	at org.opendaylight.protocol.bgp.rib.impl.BGPPeer.releaseConnection(BGPPeer.java:576) ~[?:?]
	at org.opendaylight.protocol.bgp.rib.impl.BGPPeer.releaseConnectionGracefully(BGPPeer.java:666) ~[?:?]
	at org.opendaylight.protocol.bgp.rib.impl.BGPPeer.onSessionDown(BGPPeer.java:531) ~[?:?]
	at org.opendaylight.protocol.bgp.rib.impl.BGPSessionImpl.endOfInput(BGPSessionImpl.java:335) ~[?:?]
	at org.opendaylight.protocol.bgp.rib.impl.BGPSessionImpl.channelInactive(BGPSessionImpl.java:548) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:411) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:376) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) ~[?:?]
	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[?:?]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

This seems to be an ordering issue in BGPPeer.releaseConnection(), in the not isRestartingGracefully() case:

  • we first terminateConnection(), which in turn runs removePeer() with AbstractPeer.domChain
  • we then releaseRibOutChain(), which in turn waits for transactions on AbstractPeer.ribOutChain

Given these two are unsynchronized, the domChain transaction can jump in front of whatever is happening on ribOut.


Generated at Wed Feb 07 19:14:48 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.