Please check the karaf log attached.
The problem seems to be started from a customer router restart, which causes a BGP session reset:
2017-11-10 19:40:55,184 | WARN | ntLoopGroup-11-2 | BGPSessionImpl | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | BGP session encountered error io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source) 2017-11-10 19:40:55,476 | WARN | ntLoopGroup-11-2 | BGPSessionImpl | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Failed to send message Notify [_errorCode=6, _errorSubcode=0, augmentation=[]] to socket [id: 0x7fbd98a3, L:/10.64.236.36:50186 ! R:10.64.192.166/10.64.192.166:179] io.netty.channel.unix.Errors$NativeIoException: syscall:write(...)() failed: Broken pipe at io.netty.channel.unix.FileDescriptor.writeAddress(...)(Unknown Source)
Then a OptimisticLockException was captured:
2017-11-10 19:40:56,145 | ERROR | CommitFutures-9 | AdjRibInWriter | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Failed to create Empty Structure OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node children was modified by other transaction]]} at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:532)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3] at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:560)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3] at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:81)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3] at org.opendaylight.controller.cluster.datastore.CohortEntry.canCommit(CohortEntry.java:98)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3] at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:237)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3] at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleReadyLocalTransaction(ShardCommitCoordinator.java:201)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3] at org.opendaylight.controller.cluster.datastore.Shard.handleReadyLocalTransaction(Shard.java:437)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3] at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:243)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3] at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:291)[163:org.opendaylight.controller.sal-akka-raft:1.4.3.Boron-SR3] at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:29)[162:org.opendaylight.controller.sal-clustering-commons:1.4.3.Boron-SR3]
After some time, BGP session re-established, but a netty error occured:
2017-11-10 19:42:57,608 | INFO | ntLoopGroup-11-3 | BGPPeer | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Session with peer 10.64.192.166 went up with tables [BgpTableTypeImpl [getAfi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.Ipv6AddressFamily, getSafi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.flowspec.rev150807.FlowspecSubsequentAddressFamily], BgpTableTypeImpl [getAfi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.Ipv4AddressFamily, getSafi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.flowspec.rev150807.FlowspecSubsequentAddressFamily]] and Add Path tables [] 2017-11-10 19:42:57,675 | WARN | ntLoopGroup-11-3 | DefaultChannelPipeline | 111 - io.netty.common - 4.0.44.Final | An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. io.netty.channel.ChannelPipelineException: org.opendaylight.protocol.bgp.rib.impl.BGPSessionImpl.handlerAdded() has thrown an exception; removed. at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:601)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:545)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:486)[112:io.netty.transport:4.0.44.Final] at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.negotiationSuccessful(AbstractBGPSessionNegotiator.java:253)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3] at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.handleMessage(AbstractBGPSessionNegotiator.java:166)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3] at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.channelRead(AbstractBGPSessionNegotiator.java:292)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)[113:io.netty.codec:4.0.44.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)[113:io.netty.codec:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)[113:io.netty.codec:4.0.44.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)[113:io.netty.codec:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)[112:io.netty.transport:4.0.44.Final] at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:934)[116:io.netty.transport-native-epoll:4.0.44.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397)[116:io.netty.transport-native-epoll:4.0.44.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302)[116:io.netty.transport-native-epoll:4.0.44.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)[111:io.netty.common:4.0.44.Final] at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)[111:io.netty.common:4.0.44.Final] at java.lang.Thread.run(Thread.java:745)[:1.8.0_101] Caused by: java.lang.IllegalStateException: New transaction PingPongTransaction{delegate=org.opendaylight.controller.cluster.databroker.DOMBrokerReadWriteTransaction@69fc1ef} raced with transacion PingPongTransaction{delegate=org.opendaylight.controller.cluster.databroker.DOMBrokerReadWriteTransaction@423d24f4} at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransactionChain.slowAllocateTransaction(PingPongTransactionChain.java:176)[136:org.opendaylight.controller.sal-broker-impl:1.4.3.Boron-SR3]
However, BGP module still thinks the session is established successfully.
2017-11-10 19:42:57,676 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Promise BGPProtocolSessionPromise@670fb461(incomplete) completed 2017-11-10 19:42:57,676 | INFO | ntLoopGroup-11-3 | AbstractBGPSessionNegotiator | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | BGP Session with peer [id: 0xd1cc8f4c, L:/10.64.236.36:52530 - R:10.64.192.166/10.64.192.166:179] established successfully.
But the session went stale since that moment. Consequently, after 3 min, the BGP session expired due to HOLD_TIME_EXPIRE
2017-11-10 19:45:57,610 | ERROR | ntLoopGroup-11-3 | BGPDocumentedException | 347 - org.opendaylight.bgpcep.bgp-parser-api - 0.6.3.Boron-SR3 | Error = HOLD_TIMER_EXPIRED org.opendaylight.protocol.bgp.parser.BGPDocumentedException at org.opendaylight.protocol.bgp.rib.impl.BGPSessionImpl.handleHoldTimer(BGPSessionImpl.java:365) at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)[111:io.netty.common:4.0.44.Final] at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)[111:io.netty.common:4.0.44.Final] at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)[111:io.netty.common:4.0.44.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307)[116:io.netty.transport-native-epoll:4.0.44.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)[111:io.netty.common:4.0.44.Final] at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)[111:io.netty.common:4.0.44.Final] at java.lang.Thread.run(Thread.java:745)[:1.8.0_101] 2017-11-10 19:45:57,612 | INFO | ntLoopGroup-11-3 | BGPPeer | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Session with peer 10.64.192.166 terminated: BGPTerminationReason{error=HOLD_TIMER_EXPIRED} 2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPReconnectPromise | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Reconnecting after connection to /10.64.192.166:179 was dropped 2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Callback for session creation with peer IpAddress [_ipv4Address=Ipv4Address [_value=10.64.192.166]] received 2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Connection to /10.64.192.166:179 already exists
However, BGP session was unable to be re-established, as session was stale, and upon reconnect, BGP session thinks the connection already existed.
Expected fix:
1. Fix the OptismisticLockException
2. Fix the PingPong transaction failure.
3. (Bottom line) BGP session should recover after netty failure.