[BGPCEP-724] BGP PingPong transaction racing and BGP session not recovered after that Created: 21/Nov/17 Updated: 18/Apr/18 Resolved: 21/Mar/18 |
|
| Status: | Verified |
| Project: | bgpcep |
| Component/s: | BGP |
| Affects Version/s: | Nitrogen, Carbon, Oxygen |
| Fix Version/s: | Nitrogen, Carbon, Oxygen |
| Type: | Bug | Priority: | High |
| Reporter: | Kevin Wang | Assignee: | Kevin Wang |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
Please check the karaf log attached. The problem seems to be started from a customer router restart, which causes a BGP session reset: 2017-11-10 19:40:55,184 | WARN | ntLoopGroup-11-2 | BGPSessionImpl | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | BGP session encountered error
io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer
at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source)
2017-11-10 19:40:55,476 | WARN | ntLoopGroup-11-2 | BGPSessionImpl | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Failed to send message Notify [_errorCode=6, _errorSubcode=0, augmentation=[]] to socket [id: 0x7fbd98a3, L:/10.64.236.36:50186 ! R:10.64.192.166/10.64.192.166:179]
io.netty.channel.unix.Errors$NativeIoException: syscall:write(...)() failed: Broken pipe
at io.netty.channel.unix.FileDescriptor.writeAddress(...)(Unknown Source)
Then a OptimisticLockException was captured: 2017-11-10 19:40:56,145 | ERROR | CommitFutures-9 | AdjRibInWriter | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Failed to create Empty Structure
OptimisticLockFailedException{message=Optimistic lock failed., errorList=[RpcError [message=Optimistic lock failed., severity=ERROR, errorType=APPLICATION, tag=resource-denied, applicationTag=null, info=null, cause=org.opendaylight.yangtools.yang.data.api.schema.tree.ConflictingModificationAppliedException: Node children was modified by other transaction]]}
at org.opendaylight.controller.cluster.datastore.ShardDataTree.processNextTransaction(ShardDataTree.java:532)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
at org.opendaylight.controller.cluster.datastore.ShardDataTree.startCanCommit(ShardDataTree.java:560)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
at org.opendaylight.controller.cluster.datastore.SimpleShardDataTreeCohort.canCommit(SimpleShardDataTreeCohort.java:81)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
at org.opendaylight.controller.cluster.datastore.CohortEntry.canCommit(CohortEntry.java:98)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleCanCommit(ShardCommitCoordinator.java:237)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.handleReadyLocalTransaction(ShardCommitCoordinator.java:201)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
at org.opendaylight.controller.cluster.datastore.Shard.handleReadyLocalTransaction(Shard.java:437)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:243)[168:org.opendaylight.controller.sal-distributed-datastore:1.4.3.Boron-SR3]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:291)[163:org.opendaylight.controller.sal-akka-raft:1.4.3.Boron-SR3]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:29)[162:org.opendaylight.controller.sal-clustering-commons:1.4.3.Boron-SR3]
After some time, BGP session re-established, but a netty error occured: 2017-11-10 19:42:57,608 | INFO | ntLoopGroup-11-3 | BGPPeer | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Session with peer 10.64.192.166 went up with tables [BgpTableTypeImpl [getAfi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.Ipv6AddressFamily, getSafi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.flowspec.rev150807.FlowspecSubsequentAddressFamily], BgpTableTypeImpl [getAfi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.types.rev130919.Ipv4AddressFamily, getSafi()=class org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.bgp.flowspec.rev150807.FlowspecSubsequentAddressFamily]] and Add Path tables []
2017-11-10 19:42:57,675 | WARN | ntLoopGroup-11-3 | DefaultChannelPipeline | 111 - io.netty.common - 4.0.44.Final | An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
io.netty.channel.ChannelPipelineException: org.opendaylight.protocol.bgp.rib.impl.BGPSessionImpl.handlerAdded() has thrown an exception; removed.
at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:601)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:545)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.DefaultChannelPipeline.replace(DefaultChannelPipeline.java:486)[112:io.netty.transport:4.0.44.Final]
at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.negotiationSuccessful(AbstractBGPSessionNegotiator.java:253)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3]
at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.handleMessage(AbstractBGPSessionNegotiator.java:166)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3]
at org.opendaylight.protocol.bgp.rib.impl.AbstractBGPSessionNegotiator.channelRead(AbstractBGPSessionNegotiator.java:292)[361:org.opendaylight.bgpcep.bgp-rib-impl:0.6.3.Boron-SR3]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)[113:io.netty.codec:4.0.44.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)[113:io.netty.codec:4.0.44.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)[113:io.netty.codec:4.0.44.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)[113:io.netty.codec:4.0.44.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)[112:io.netty.transport:4.0.44.Final]
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:934)[116:io.netty.transport-native-epoll:4.0.44.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397)[116:io.netty.transport-native-epoll:4.0.44.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302)[116:io.netty.transport-native-epoll:4.0.44.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)[111:io.netty.common:4.0.44.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)[111:io.netty.common:4.0.44.Final]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_101]
Caused by: java.lang.IllegalStateException: New transaction PingPongTransaction{delegate=org.opendaylight.controller.cluster.databroker.DOMBrokerReadWriteTransaction@69fc1ef} raced with transacion PingPongTransaction{delegate=org.opendaylight.controller.cluster.databroker.DOMBrokerReadWriteTransaction@423d24f4}
at org.opendaylight.controller.md.sal.dom.broker.impl.PingPongTransactionChain.slowAllocateTransaction(PingPongTransactionChain.java:176)[136:org.opendaylight.controller.sal-broker-impl:1.4.3.Boron-SR3]
However, BGP module still thinks the session is established successfully. 2017-11-10 19:42:57,676 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Promise BGPProtocolSessionPromise@670fb461(incomplete) completed 2017-11-10 19:42:57,676 | INFO | ntLoopGroup-11-3 | AbstractBGPSessionNegotiator | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | BGP Session with peer [id: 0xd1cc8f4c, L:/10.64.236.36:52530 - R:10.64.192.166/10.64.192.166:179] established successfully. But the session went stale since that moment. Consequently, after 3 min, the BGP session expired due to HOLD_TIME_EXPIRE 2017-11-10 19:45:57,610 | ERROR | ntLoopGroup-11-3 | BGPDocumentedException | 347 - org.opendaylight.bgpcep.bgp-parser-api - 0.6.3.Boron-SR3 | Error = HOLD_TIMER_EXPIRED
org.opendaylight.protocol.bgp.parser.BGPDocumentedException
at org.opendaylight.protocol.bgp.rib.impl.BGPSessionImpl.handleHoldTimer(BGPSessionImpl.java:365)
at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)[111:io.netty.common:4.0.44.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)[111:io.netty.common:4.0.44.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)[111:io.netty.common:4.0.44.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:307)[116:io.netty.transport-native-epoll:4.0.44.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)[111:io.netty.common:4.0.44.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)[111:io.netty.common:4.0.44.Final]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_101]
2017-11-10 19:45:57,612 | INFO | ntLoopGroup-11-3 | BGPPeer | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Session with peer 10.64.192.166 terminated: BGPTerminationReason{error=HOLD_TIMER_EXPIRED}
2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPReconnectPromise | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Reconnecting after connection to /10.64.192.166:179 was dropped
2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Callback for session creation with peer IpAddress [_ipv4Address=Ipv4Address [_value=10.64.192.166]] received
2017-11-10 19:45:57,613 | DEBUG | ntLoopGroup-11-3 | BGPProtocolSessionPromise | 361 - org.opendaylight.bgpcep.bgp-rib-impl - 0.6.3.Boron-SR3 | Connection to /10.64.192.166:179 already exists
However, BGP session was unable to be re-established, as session was stale, and upon reconnect, BGP session thinks the connection already existed. Expected fix: |
| Comments |
| Comment by Claudio David Gasparini [ 21/Nov/17 ] |
|
Could you replicate it with supported versions? Boron is no longer supported. Regards, |
| Comment by Kevin Wang [ 11/Dec/17 ] |
|
cdgasparini I cannot reproduce the error easily with master or Boron. Right now I am just going to make BGP session recoverable after such exception happens. We were already doing that in PCEP but not in BGP. |
| Comment by Claudio David Gasparini [ 12/Dec/17 ] |
|
Ok, I ll review asap. Please also always attach a link to sandbox running the fix(or use test-bgpcep-bgp) . So we can ensure that there is no regression. Regards, |
| Comment by Kevin Wang [ 23/Jan/18 ] |
|
Hi, cdgasparini it seems stable/carbon is broken? https://git.opendaylight.org/gerrit/#/c/67449/ Do you know the status?
|
| Comment by Claudio David Gasparini [ 23/Jan/18 ] |
|
Hi Kevin, I don't know but I but need to check it. The branch still locked anyway and I'm not sure why. Regards, |
| Comment by Claudio David Gasparini [ 21/Mar/18 ] |
|
https://git.opendaylight.org/gerrit/#/q/topic:BGPCEP-724+(status:open+OR+status:merged) |