Uploaded image for project: 'bgpcep'
  1. bgpcep
  2. BGPCEP-724

BGP PingPong transaction racing and BGP session not recovered after that

    XMLWordPrintable

Details

    • Bug
    • Status: Verified
    • High
    • Resolution: Done
    • Nitrogen, Carbon, Oxygen
    • Nitrogen, Carbon, Oxygen
    • BGP
    • None

    Description

      Please check the karaf log attached.

      The problem seems to be started from a customer router restart, which causes a BGP session reset:

      2017-11-10 19:40:55,184 | WARN  | ntLoopGroup-11-2 | BGPSessionImpl                   | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | BGP session encountered error
      io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer
              at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source)
      2017-11-10 19:40:55,476 | WARN  | ntLoopGroup-11-2 | BGPSessionImpl                   | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Failed to send message Notify [_errorCode=6, _errorSubcode=0, augmentation=[]] to socket [id: 0x7fbd98a3, L:/10.64.236.36:50186 ! R:10.64.192.166/10.64.192.166:179]
      io.netty.channel.unix.Errors$NativeIoException: syscall:write(...)() failed: Broken pipe
              at io.netty.channel.unix.FileDescriptor.writeAddress(...)(Unknown Source)
      

      Then a OptimisticLockException was captured:

      2017-11-10 19:40:56,145 | ERROR | CommitFutures-9  | AdjRibInWriter                   | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Failed to create Empty Structure
      OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node children was modified by other transaction]]}
              at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:532)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:560)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:81)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.CohortEntry.canCommit(CohortEntry.java:98)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:237)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleReadyLocalTransaction(ShardCommitCoordinator.java:201)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.Shard.handleReadyLocalTransaction(Shard.java:437)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:243)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:291)[163:org.opendaylight.controller.sal-akka-raft:1.4.3.Boron-SR3]
              at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:29)[162:org.opendaylight.controller.sal-clustering-commons:1.4.3.Boron-SR3]
      

      After some time, BGP session re-established, but a netty error occured:

      2017-11-10 19:42:57,608 | INFO  | ntLoopGroup-11-3 | BGPPeer                          | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Session with peer 10.64.192.166 went up with tables [BgpTableTypeImpl [getAfi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.Ipv6AddressFamily, getSafi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.flowspec.rev150807.FlowspecSubsequentAddressFamily], BgpTableTypeImpl [getAfi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.Ipv4AddressFamily, getSafi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.flowspec.rev150807.FlowspecSubsequentAddressFamily]] and Add Path tables []
      2017-11-10 19:42:57,675 | WARN  | ntLoopGroup-11-3 | DefaultChannelPipeline           | 111 - io.netty.common - 4.0.44.Final | An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
      io.netty.channel.ChannelPipelineException: org.opendaylight.protocol.bgp.rib.impl.BGPSessionImpl.handlerAdded() has thrown an exception; removed.
              at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:601)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:545)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:486)[112:io.netty.transport:4.0.44.Final]
              at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.negotiationSuccessful(AbstractBGPSessionNegotiator.java:253)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3]
              at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.handleMessage(AbstractBGPSessionNegotiator.java:166)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3]
              at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.channelRead(AbstractBGPSessionNegotiator.java:292)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final]
              at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)[113:io.netty.codec:4.0.44.Final]
              at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)[113:io.netty.codec:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final]
              at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)[113:io.netty.codec:4.0.44.Final]
              at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)[113:io.netty.codec:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)[112:io.netty.transport:4.0.44.Final]
              at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:934)[116:io.netty.transport-native-epoll:4.0.44.Final]
              at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397)[116:io.netty.transport-native-epoll:4.0.44.Final]
              at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302)[116:io.netty.transport-native-epoll:4.0.44.Final]
              at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)[111:io.netty.common:4.0.44.Final]
              at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)[111:io.netty.common:4.0.44.Final]
              at java.lang.Thread.run(Thread.java:745)[:1.8.0_101]
      Caused by: java.lang.IllegalStateException: New transaction PingPongTransaction{delegate=org.opendaylight.controller.cluster.databroker.DOMBrokerReadWriteTransaction@69fc1ef} raced with transacion PingPongTransaction{delegate=org.opendaylight.controller.cluster.databroker.DOMBrokerReadWriteTransaction@423d24f4}
              at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransactionChain.slowAllocateTransaction(PingPongTransactionChain.java:176)[136:org.opendaylight.controller.sal-broker-impl:1.4.3.Boron-SR3]
      

      However, BGP module still thinks the session is established successfully.

      2017-11-10 19:42:57,676 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise        | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Promise BGPProtocolSessionPromise@670fb461(incomplete) completed
      2017-11-10 19:42:57,676 | INFO  | ntLoopGroup-11-3 | AbstractBGPSessionNegotiator     | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | BGP Session with peer [id: 0xd1cc8f4c, L:/10.64.236.36:52530 - R:10.64.192.166/10.64.192.166:179] established successfully.
      

      But the session went stale since that moment. Consequently, after 3 min, the BGP session expired due to HOLD_TIME_EXPIRE

      2017-11-10 19:45:57,610 | ERROR | ntLoopGroup-11-3 | BGPDocumentedException           | 347 - org.opendaylight.bgpcep.bgp-parser-api - 0.6.3.Boron-SR3 | Error = HOLD_TIMER_EXPIRED
      org.opendaylight.protocol.bgp.parser.BGPDocumentedException
              at org.opendaylight.protocol.bgp.rib.impl.BGPSessionImpl.handleHoldTimer(BGPSessionImpl.java:365)
              at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)[111:io.netty.common:4.0.44.Final]
              at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)[111:io.netty.common:4.0.44.Final]
              at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)[111:io.netty.common:4.0.44.Final]
              at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307)[116:io.netty.transport-native-epoll:4.0.44.Final]
              at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)[111:io.netty.common:4.0.44.Final]
              at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)[111:io.netty.common:4.0.44.Final]
              at java.lang.Thread.run(Thread.java:745)[:1.8.0_101]
      2017-11-10 19:45:57,612 | INFO  | ntLoopGroup-11-3 | BGPPeer                          | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Session with peer 10.64.192.166 terminated: BGPTerminationReason{error=HOLD_TIMER_EXPIRED}
      2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPReconnectPromise              | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Reconnecting after connection to /10.64.192.166:179 was dropped
      2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise        | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Callback for session creation with peer IpAddress [_ipv4Address=Ipv4Address [_value=10.64.192.166]] received
      2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise        | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Connection to /10.64.192.166:179 already exists
      

      However, BGP session was unable to be re-established, as session was stale, and upon reconnect, BGP session thinks the connection already existed.

      Expected fix:
      1. Fix the OptismisticLockException
      2. Fix the PingPong transaction failure.
      3. (Bottom line) BGP session should recover after netty failure.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            kevinxw Kevin Wang
            kevinxw Kevin Wang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: