Uploaded image for project: 'bgpcep'
  1. bgpcep
  2. BGPCEP-724

BGP PingPong transaction racing and BGP session not recovered after that

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: High High
    • Nitrogen, Carbon, Oxygen
    • Nitrogen, Carbon, Oxygen
    • BGP
    • None

      Please check the karaf log attached.

      The problem seems to be started from a customer router restart, which causes a BGP session reset:

      2017-11-10 19:40:55,184 | WARN  | ntLoopGroup-11-2 | BGPSessionImpl                   | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | BGP session encountered error
      io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer
              at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source)
      2017-11-10 19:40:55,476 | WARN  | ntLoopGroup-11-2 | BGPSessionImpl                   | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Failed to send message Notify [_errorCode=6, _errorSubcode=0, augmentation=[]] to socket [id: 0x7fbd98a3, L:/10.64.236.36:50186 ! R:10.64.192.166/10.64.192.166:179]
      io.netty.channel.unix.Errors$NativeIoException: syscall:write(...)() failed: Broken pipe
              at io.netty.channel.unix.FileDescriptor.writeAddress(...)(Unknown Source)
      

      Then a OptimisticLockException was captured:

      2017-11-10 19:40:56,145 | ERROR | CommitFutures-9  | AdjRibInWriter                   | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Failed to create Empty Structure
      OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node children was modified by other transaction]]}
              at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:532)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:560)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:81)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.CohortEntry.canCommit(CohortEntry.java:98)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:237)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleReadyLocalTransaction(ShardCommitCoordinator.java:201)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.Shard.handleReadyLocalTransaction(Shard.java:437)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:243)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:291)[163:org.opendaylight.controller.sal-akka-raft:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:29)[162:org.opendaylight.controller.sal-clustering-commons:1.4.3.Boron-SR3]
      

      After some time, BGP session re-established, but a netty error occured:

      2017-11-10 19:42:57,608 | INFO  | ntLoopGroup-11-3 | BGPPeer                          | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Session with peer 10.64.192.166 went up with tables [BgpTableTypeImpl [getAfi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.Ipv6AddressFamily, getSafi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.flowspec.rev150807.FlowspecSubsequentAddressFamily], BgpTableTypeImpl [getAfi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.Ipv4AddressFamily, getSafi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.flowspec.rev150807.FlowspecSubsequentAddressFamily]] and Add Path tables []
      2017-11-10 19:42:57,675 | WARN  | ntLoopGroup-11-3 | DefaultChannelPipeline           | 111 - io.netty.common - 4.0.44.Final | An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
      io.netty.channel.ChannelPipelineException: org.opendaylight.protocol.bgp.rib.impl.BGPSessionImpl.handlerAdded() has thrown an exception; removed.
              at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:601)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:545)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:486)[112:io.netty.transport:4.0.44.Final]
              at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.negotiationSuccessful(AbstractBGPSessionNegotiator.java:253)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3]
              at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.handleMessage(AbstractBGPSessionNegotiator.java:166)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3]
              at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.channelRead(AbstractBGPSessionNegotiator.java:292)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final]
              at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)[113:io.netty.codec:4.0.44.Final]
              at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)[113:io.netty.codec:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final]
              at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)[113:io.netty.codec:4.0.44.Final]
              at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)[113:io.netty.codec:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:934)[116:io.netty.transport-native-epoll:4.0.44.Final]
              at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397)[116:io.netty.transport-native-epoll:4.0.44.Final]
              at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302)[116:io.netty.transport-native-epoll:4.0.44.Final]
              at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)[111:io.netty.common:4.0.44.Final]
              at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)[111:io.netty.common:4.0.44.Final]
              at java.lang.Thread.run(Thread.java:745)[:1.8.0_101]
      Caused by: java.lang.IllegalStateException: New transaction PingPongTransaction{delegate=org.opendaylight.controller.cluster.databroker.DOMBrokerReadWriteTransaction@69fc1ef} raced with transacion PingPongTransaction{delegate=org.opendaylight.controller.cluster.databroker.DOMBrokerReadWriteTransaction@423d24f4}
              at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransactionChain.slowAllocateTransaction(PingPongTransactionChain.java:176)[136:org.opendaylight.controller.sal-broker-impl:1.4.3.Boron-SR3]
      

      However, BGP module still thinks the session is established successfully.

      2017-11-10 19:42:57,676 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise        | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Promise BGPProtocolSessionPromise@670fb461(incomplete) completed
      2017-11-10 19:42:57,676 | INFO  | ntLoopGroup-11-3 | AbstractBGPSessionNegotiator     | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | BGP Session with peer [id: 0xd1cc8f4c, L:/10.64.236.36:52530 - R:10.64.192.166/10.64.192.166:179] established successfully.
      

      But the session went stale since that moment. Consequently, after 3 min, the BGP session expired due to HOLD_TIME_EXPIRE

      2017-11-10 19:45:57,610 | ERROR | ntLoopGroup-11-3 | BGPDocumentedException           | 347 - org.opendaylight.bgpcep.bgp-parser-api - 0.6.3.Boron-SR3 | Error = HOLD_TIMER_EXPIRED
      org.opendaylight.protocol.bgp.parser.BGPDocumentedException
              at org.opendaylight.protocol.bgp.rib.impl.BGPSessionImpl.handleHoldTimer(BGPSessionImpl.java:365)
              at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)[111:io.netty.common:4.0.44.Final]
              at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)[111:io.netty.common:4.0.44.Final]
              at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)[111:io.netty.common:4.0.44.Final]
              at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307)[116:io.netty.transport-native-epoll:4.0.44.Final]
              at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)[111:io.netty.common:4.0.44.Final]
              at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)[111:io.netty.common:4.0.44.Final]
              at java.lang.Thread.run(Thread.java:745)[:1.8.0_101]
      2017-11-10 19:45:57,612 | INFO  | ntLoopGroup-11-3 | BGPPeer                          | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Session with peer 10.64.192.166 terminated: BGPTerminationReason{error=HOLD_TIMER_EXPIRED}
      2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPReconnectPromise              | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Reconnecting after connection to /10.64.192.166:179 was dropped
      2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise        | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Callback for session creation with peer IpAddress [_ipv4Address=Ipv4Address [_value=10.64.192.166]] received
      2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise        | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Connection to /10.64.192.166:179 already exists
      

      However, BGP session was unable to be re-established, as session was stale, and upon reconnect, BGP session thinks the connection already existed.

      Expected fix:
      1. Fix the OptismisticLockException
      2. Fix the PingPong transaction failure.
      3. (Bottom line) BGP session should recover after netty failure.

            kevinxw Kevin Wang
            kevinxw Kevin Wang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: