[BGPCEP-535] Holdtimer expired when ODL BGP is advertising many prefixes Created: 30/Aug/16  Updated: 03/Mar/19  Resolved: 05/Sep/16

Status: Resolved
Project: bgpcep
Component/s: BGP
Affects Version/s: Bugzilla Migration
Fix Version/s: Bugzilla Migration

Type: Bug
Reporter: Milos Fabian Assignee: Milos Fabian
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 6585

 Description   

Steps to reproduce:
1. run ODL distribution with configured peer and application peer
2. inject 100k routes to the app-rib
3. connect remote peer (exabgp)
4. unexpected behavior observed: holdtimer expired on remote peer, not all Update messages where introduced by ODL BGP

Symptoms are very similar to https://bugs.opendaylight.org/show_bug.cgi?id=4689



 Comments   
Comment by Milos Fabian [ 31/Aug/16 ]

After a while (when an output buffer reach upper bound), BGP's OutputChannelLimiter handler get stacked waiting for the channel to become writable. The writability change never happen, however socket flush is invoked, so the session peer dies on holdtimer expiration.

2016-08-31 15:58:05,481 | DEBUG | entLoopGroup-7-6 | ChannelOutputLimiter | 288 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Writes on session BGPSessionImpl

{channel=[id: 0xeec1d421, L:/127.0.0.1:1790 - R:/127.0.0.2:56984], state=UP}

blocked
2016-08-31 15:58:05,481 | TRACE | n-dispatcher-140 | ChannelOutputLimiter | 288 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Blocked slow path tripped on session BGPSessionImpl

{channel=[id: 0xeec1d421, L:/127.0.0.1:1790 - R:/127.0.0.2:56984], state=UP}

2016-08-31 15:58:05,531 | DEBUG | n-dispatcher-140 | ChannelOutputLimiter | 288 - org.opendaylight.bgpcep.bgp-rib-impl - 0.7.0.SNAPSHOT | Waiting for session BGPSessionImpl

{channel=[id: 0xeec1d421, L:/127.0.0.1:1790 - R:/127.0.0.2:56984], state=UP}

to become writable

Comment by Milos Fabian [ 31/Aug/16 ]

master: https://git.opendaylight.org/gerrit/#/c/44946/

Comment by Milos Fabian [ 01/Sep/16 ]

This problem happens when Loc-RIB is pre-filled with many routes (100k+) and then remote peer connects.
The prefix dump results in a big AdjRibInListener receive a huge DTCL notification, built Update messages quickly fills in the output buffer.
As a result, the channel become unwritable, hence Update messages writing is blocked and is waiting until the channel become writable again. But the writability change event is not received even the flush is called immediately.
It looks like the flush have no effect in this case. However, while debugging the application, writability has truly changed after the flush invocation. It might indicate some multithreading/thread-locking problem.

As the proposed solution is is not clearly efficient, more investigation is needed to hunt down the true root cause of this problem.

Comment by Milos Fabian [ 05/Sep/16 ]

stable/boron: https://git.opendaylight.org/gerrit/#/c/45170/

Generated at Wed Feb 07 19:13:22 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.