[BGPCEP-174] Keepalive not sent when updates are being processed Created: 04/Dec/14 Updated: 03/Mar/19 Resolved: 22/Jan/15 |
|
| Status: | Resolved |
| Project: | bgpcep |
| Component/s: | BGP |
| Affects Version/s: | Bugzilla Migration |
| Fix Version/s: | Bugzilla Migration |
| Type: | Bug | ||
| Reporter: | Jozef Behran | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 2475 |
| Description |
|
To hit this bug, configure a BGP speaker to have HoldTimer set to 3 seconds and then point it to ODL. Then let the BGP speaker wait about 0.5 seconds and then send 1750 updates. If you have topology updates turned OFF in the ODL instance, you can ask the BGP speaker to send up to 5000 updates. The speaker won't see any KeepAlive messages coming from ODL while it is sending the updates, thus closing the connection down. The problem can happen with other HoldTimer values as well. If the speaker starts sending a lot of updates at the time when the HoldTimer is nearly expired, then ODL is going to miss the HoldTimer deadline. This may be related to |
| Comments |
| Comment by Jozef Behran [ 04/Dec/14 ] |
|
Update: To hit this bug now (when |
| Comment by Jozef Behran [ 05/Dec/14 ] |
|
Update: After more investigation it turned out that the problem is most likely to be big garbage collection being started during time when the BGP speaker is making the connection with 3 second Hold Timer, then ODL is not going to make it with a KeepAlive message in time. When I placed a wait into the test which waited one minute after seeing that BGP socket in ODL, this problem did not show up at all (but I hit Additionally after I opened the profiler snapshot (together with Dana), we realized that clustering was still running during the test. Tt was not used at all but this still means that the possibility that the problem is due to clustering is still open. I also (together with Dana) started to suspect that avoiding asking the GC for big garbage collections and using -XX:+UseG1GC will improve matters a lot. I can do the test with -XX:+UseG1GC but the "avoiding asking the GC for big garbage collection" part needs to be done by somebody who can find the place where the offending GC call is made. |
| Comment by Robert Varga [ 09/Dec/14 ] |
| Comment by Robert Varga [ 09/Dec/14 ] |
| Comment by Robert Varga [ 11/Dec/14 ] |
|
Still seems to be reproducible with current master. Analysis is pending. |
| Comment by Robert Varga [ 22/Jan/15 ] |
|
master: https://git.opendaylight.org/gerrit/14380 |