[BGPCEP-174] Keepalive not sent when updates are being processed Created: 04/Dec/14  Updated: 03/Mar/19  Resolved: 22/Jan/15

Status: Resolved
Project: bgpcep
Component/s: BGP
Affects Version/s: Bugzilla Migration
Fix Version/s: Bugzilla Migration

Type: Bug
Reporter: Jozef Behran Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 2475

 Description   

To hit this bug, configure a BGP speaker to have HoldTimer set to 3 seconds and then point it to ODL. Then let the BGP speaker wait about 0.5 seconds and then send 1750 updates. If you have topology updates turned OFF in the ODL instance, you can ask the BGP speaker to send up to 5000 updates.

The speaker won't see any KeepAlive messages coming from ODL while it is sending the updates, thus closing the connection down.

The problem can happen with other HoldTimer values as well. If the speaker starts sending a lot of updates at the time when the HoldTimer is nearly expired, then ODL is going to miss the HoldTimer deadline.

This may be related to YANGTOOLS-383 but I am not sure.



 Comments   
Comment by Jozef Behran [ 04/Dec/14 ]

Update: To hit this bug now (when YANGTOOLS-383 is still present) you need to switch topology off and then use a BGP speaker to push 2500 to 5000 updates all at once. With topology updating enabled YANGTOOLS-383 will hit before you can get to see this bug.

Comment by Jozef Behran [ 05/Dec/14 ]

Update: After more investigation it turned out that the problem is most likely to be big garbage collection being started during time when the BGP speaker is making the connection with 3 second Hold Timer, then ODL is not going to make it with a KeepAlive message in time.

When I placed a wait into the test which waited one minute after seeing that BGP socket in ODL, this problem did not show up at all (but I hit YANGTOOLS-383 in that case).

Additionally after I opened the profiler snapshot (together with Dana), we realized that clustering was still running during the test. Tt was not used at all but this still means that the possibility that the problem is due to clustering is still open.

I also (together with Dana) started to suspect that avoiding asking the GC for big garbage collections and using -XX:+UseG1GC will improve matters a lot. I can do the test with -XX:+UseG1GC but the "avoiding asking the GC for big garbage collection" part needs to be done by somebody who can find the place where the offending GC call is made.

Comment by Robert Varga [ 09/Dec/14 ]

https://git.opendaylight.org/gerrit/#/c/13479/

Comment by Robert Varga [ 09/Dec/14 ]

https://git.opendaylight.org/gerrit/#/c/13479/

Comment by Robert Varga [ 11/Dec/14 ]

Still seems to be reproducible with current master. Analysis is pending.

Comment by Robert Varga [ 22/Jan/15 ]

master: https://git.opendaylight.org/gerrit/14380
helium: https://git.opendaylight.org/gerrit/14381

Generated at Wed Feb 07 19:12:15 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.