[BGPCEP-493] Topology freezes or slows down due to java.util.concurrent.TimeoutException Created: 19/Jul/16  Updated: 03/Mar/19  Resolved: 10/Aug/16

Status: Resolved
Project: bgpcep
Component/s: BGP
Affects Version/s: Bugzilla Migration
Fix Version/s: Bugzilla Migration

Type: Bug
Reporter: Vratko Polak Assignee: Claudio David Gasparini
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive karaf1.log.zip     File logs_20160719.tar.xz    
Issue Links:
Blocks
is blocked by BGPCEP-486 PingPongTransaction race when handlin... Resolved
External issue ID: 6237

 Description   

Possibly a Bug in distributed datastore or other offset-0 component.

This is a Boron regression in Singlepeer Changecount suite. Symptoms are slightly different in "mixed" job [0] (topology practically freezes) and in "non-mixed" job [1] (topology only slows down).

The TimeoutException occures in karaf.log, followed by several other ERRORs. This does not happen in Prefix Count suite (or anywhere in Beryllium tests).

Segments of logs will be attached, as the whole logs are very big and even the segments take too many lines to paste here. Both segments end with two occurences of "raced with transacion PingPongTransaction" error, in the real logs there were many more of such occurrences.

[0] https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-periodic-bgp-ingest-mixed-only-boron/
[1] https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-1node-periodic-bgp-ingest-only-boron/



 Comments   
Comment by Vratko Polak [ 19/Jul/16 ]

Attachment logs_20160719.tar.xz has been added with description: Archive with the promised karaf.log segments

Comment by Milos Fabian [ 29/Jul/16 ]

Routes injection (applciation peer) use-case:
1. Testing latest snapshots - Boron (2016/07/28)
2. using ODL Java BGP benchmak application - 500k routes (plain IPv4)

Error reproduced with Boron:
java.util.concurrent.TimeoutException: Futures timed out after [5 seconds]
The testing app RPC hangs for a long time (implies, that routes are not written to app-rib tree).

With unplugged BGP topology provider, test sometimes pass (no errors/RPC finish).

The same test runs with Beryllium (latest snapshot) without any problems.

More investigation needed.

Comment by Milos Fabian [ 01/Aug/16 ]

Memory footprint analysis and comparison (Beryllium vs Boron) shows Boron's memory consumuption significant rise which results in observed timeout erros, casused by insane GC activity (application inactivity).
The regression was introduced by add-path related changes - namely OffsetMap - where a new instance is created by each route entry (instead of using shared one).

Comment by Peter Gubka [ 04/Aug/16 ]

Attachment karaf1.log.zip has been added with description: NPE karaf log

Comment by Peter Gubka [ 04/Aug/16 ]

I forgot to add larger comment.

The attached log is a part of the whole log, it has NPE as the first exaception. The lof is taken from https://logs.opendaylight.org/sandbox/jenkins091/bgpcep-csit-1node-periodic-bgp-ingest-mixed-only-boron/5/archives/. Unfortunately, the full kig has 11M zipped (1.4G text) and will be deleted during the weekend.

What happened in that suite: tool started to advertised routes and after around 500k routes tho tool side output looked like:
2016-08-03 18:22:54,502 INFO BGP-Dummy-1: Iteration: 432000 - total remaining prefixes: 567991
2016-08-03 18:22:55,651 INFO BGP-Dummy-1: Iteration: 433000 - total remaining prefixes: 566991
2016-08-03 18:22:56,045 INFO BGP-Dummy-1: Iteration: 434000 - total remaining prefixes: 565991
2016-08-03 18:22:56,305 INFO BGP-Dummy-1: Iteration: 435000 - total remaining prefixes: 564991
2016-08-03 18:22:56,936 INFO BGP-Dummy-1: Iteration: 436000 - total remaining prefixes: 563991
2016-08-03 18:22:58,002 WARNING BGP-Dummy-1: Input and output both blocked for 1.0 seconds.
2016-08-03 18:22:59,414 WARNING BGP-Dummy-1: Input and output both blocked for 1.0 seconds.
2016-08-03 18:23:00,662 WARNING BGP-Dummy-1: Input and output both blocked for 1.0 seconds.
2016-08-03 18:23:01,730 INFO BGP-Dummy-1: Iteration: 437000 - total remaining prefixes: 562991
2016-08-03 18:23:01,951 INFO BGP-Dummy-1: Iteration: 438000 - total remaining prefixes: 561991
2016-08-03 18:23:01,994 ERROR BGP-Dummy-1: Peer has overstepped the hold timer.
Unhandled exception in thread started by <function job at 0x7fb17575c2a8>
Traceback (most recent call last):
File "play.py", line 1780, in job
state.perform_one_loop_iteration()
File "play.py", line 1676, in perform_one_loop_iteration
self.timer.check_peer_hold_time(self.timer.snapshot_time)
File "play.py", line 1227, in check_peer_hold_time
raise RuntimeError("Peer has overstepped the hold timer.")
RuntimeError: Peer has overstepped the hold timer.
Traceback (most recent call last):
File "play.py", line 1829, in <module>
threaded_job(arguments)
File "play.py", line 1823, in threaded_job
time.sleep(5)

Hopefully attached log will show what problems occurred in odl.

Comment by Claudio David Gasparini [ 05/Aug/16 ]

https://git.opendaylight.org/gerrit/#/c/43110/

Comment by Claudio David Gasparini [ 05/Aug/16 ]

Boron
<output xmlns="urn:opendaylight:params:xml:ns:yang:odl-bgp-app-peer-benchmark">
<result>
<duration>61184</duration>
<rate>8196</rate>
<count>500000</count>
</result>
</output>

Be
<output xmlns="urn:opendaylight:params:xml:ns:yang:odl-bgp-app-peer-benchmark">
<result>
<rate>9615</rate>
<count>500000</count>
<duration>52295</duration>
</result>
</output>

Comment by Milos Fabian [ 10/Aug/16 ]

stable/boron: https://git.opendaylight.org/gerrit/#/c/43590/

Generated at Wed Feb 07 19:13:14 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.