[BGPCEP-872] Controller does not send KEEPALIVE while learning prefixes Created: 06/May/19  Updated: 03/Aug/20

Status: Confirmed
Project: bgpcep
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: High
Reporter: Luis Gomez Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File bgp-tool-600k.txt     Text File bgp-tool-nok.txt     Text File bgp-tool-ok.txt    

 Description   

The regression is detected here:

https://jenkins.opendaylight.org/releng/job/bgpcep-csit-1node-bgp-ingest-mixed-all-neon/

BGP scale test uses play.py script to setup BGP session and inject 500K prefixes to controller:

python play.py --amount 500000 --myip=10.30.171.84 --myport=17900 --peerip=10.30.170.47 --peerport=1790 --insert=10 --withdraw=9 --prefill 10 --update single --info --results bgp.csv &> play.py.out

After ~3 minutes (hold timer) the BGP script throws this ERROR:

2019-05-06 10:08:44,090 INFO BGP-Dummy-1: Iteration: 296000 - total remaining prefixes: 203991
2019-05-06 10:08:44,726 ERROR BGP-Dummy-1: Peer has overstepped the hold timer.
Unhandled exception in thread started by <function job at 0x7fefba965938>
Traceback (most recent call last):
  File "play.py", line 2066, in job
    state.perform_one_loop_iteration()
  File "play.py", line 1958, in perform_one_loop_iteration
    self.timer.check_peer_hold_time(self.timer.snapshot_time)
  File "play.py", line 1429, in check_peer_hold_time
    raise RuntimeError("Peer has overstepped the hold timer.")
RuntimeError: Peer has overstepped the hold timer.
Traceback (most recent call last):
  File "play.py", line 2168, in <module>
    threaded_job(arguments)
  File "play.py", line 2162, in threaded_job
    rpcserver.serve_forever()
  File "/usr/lib/python2.7/SocketServer.py", line 231, in serve_forever
    poll_interval)
  File "/usr/lib/python2.7/SocketServer.py", line 150, in _eintr_retry
    return func(*args)
KeyboardInterrupt

And the session is disconnected:

2019-05-06T10:08:46,632 | INFO  | epollEventLoopGroup-10-1 | BGPSessionImpl                   | 242 - org.opendaylight.bgpcep.bgp-rib-impl - 0.11.1.SNAPSHOT | End of input detected. Close the session.
2019-05-06T10:08:46,633 | INFO  | epollEventLoopGroup-10-1 | BGPPeer                          | 242 - org.opendaylight.bgpcep.bgp-rib-impl - 0.11.1.SNAPSHOT | Session with peer 10.30.171.99 went down
2019-05-06T10:08:46,633 | INFO  | epollEventLoopGroup-10-1 | BGPPeer                          | 242 - org.opendaylight.bgpcep.bgp-rib-impl - 0.11.1.SNAPSHOT | Closing session with peer
2019-05-06T10:08:46,650 | INFO  | epollEventLoopGroup-10-1 | AbstractPeer                     | 242 - org.opendaylight.bgpcep.bgp-rib-impl - 0.11.1.SNAPSHOT | Closed per Peer /(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)bgp-rib/rib/rib[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)id=example-bgp-rib}]/peer/peer[{(urn:opendaylight:params:xml:ns:yang:bgp-rib?revision=2018-03-29)peer-id=bgp://10.30.171.99}] removed
2019-05-06T10:08:46,653 | INFO  | epollEventLoopGroup-10-1 | AbstractPeer                     | 242 - org.opendaylight.bgpcep.bgp-rib-impl - 0.11.1.SNAPSHOT | Closing peer chain Uri{_value=bgp://10.30.171.99}
2019-05-06T10:08:46,659 | INFO  | epollEventLoopGroup-10-1 | BGPSessionImpl                   | 242 - org.opendaylight.bgpcep.bgp-rib-impl - 0.11.1.SNAPSHOT | Closing session: BGPSessionImpl{channel=[id: 0x79806baf, L:/10.30.171.133:1790 ! R:/10.30.171.99:17900], state=UP}

According to the play.py logs controller does not send any KEEPALIVE message while it is learning the prefixes, this is main reason the test fails. See attached test tool logs, when it works controller sends 1 KEEPALIVE every ~1 min.



 Comments   
Comment by Luis Gomez [ 06/May/19 ]

The issue is visible in all branches but in Neon started to happen very recently (May 1st).

Comment by Luis Gomez [ 08/May/19 ]

FYI, I setup job in sandbox to run scale test for any candidate patch:

https://jenkins.opendaylight.org/sandbox/job/bgpcep-csit-1node-bgp-ingest-mixed-all-neon/10

Comment by Luis Gomez [ 08/May/19 ]

I just uploaded BGP log from another test (bgp-tool-600k.txt), and there we can see the controller sends the KEEPALIVE messages after the prefix learning is done. This proves the controller only misses the KEEPALIVE sending while it is learning prefixes at high rate, otherwise it is fine.

Comment by Luis Gomez [ 22/May/19 ]

From my observation the issue of missing KEEPALIVE sending happens very consistent whenever the controller is busy for long time (few minutes). This behavior has been probably there all this time. Before the regression, it is very possible the controller got less busy or took less time for the same test case and therefore it passed. So this bug is exposed by a perf regression and the workaround for the missing KEEPALIVE is to increase the BGP holdtime on the BGP system peering with the ODL.

Comment by Robert Varga [ 29/Jul/19 ]

I think this has been addressed

Comment by Luis Gomez [ 29/Jul/19 ]

FYI I recently adjusted CSIT thresholds so we can effectively check for further regressions:

https://git.opendaylight.org/gerrit/c/releng/builder/+/83178

I actually forgot to update this ticket with this information, so maybe this is why you see the regression is gone?

Comment by Luis Gomez [ 31/Jul/19 ]

Right, by simply running the test with original 500K vs current 250K we can see the regression is still there (use TESTOPTIONS -v COUNT:500000):

https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-gate-bgp-ingest-mixed-all-sodium/5/

Comment by Robert Varga [ 31/Jul/19 ]

Okay, this is kind of weird, as I definitely saw a keepalive being sent in one of the gate job runs. Let's see what https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-gate-bgp-ingest-mixed-all-sodium/6/ does (with BGPSessionImpl debugs)

Comment by Robert Varga [ 01/Aug/19 ]

Okay, this looks like a netty read starvation ... I will need to investigate it a bit more.

Generated at Wed Feb 07 19:14:22 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.