[INFRAUTILS-29] CSIT hit infrautils.metrics.ThreadsWatcher limit of 1000 threads Created: 28/Feb/18  Updated: 02/Mar/18  Resolved: 02/Mar/18

Status: Resolved
Project: infrautils
Component/s: None
Affects Version/s: Oxygen, Fluorine
Fix Version/s: Oxygen, Fluorine

Type: Bug Priority: Highest
Reporter: Jamo Luhrsen Assignee: Michael Vorburger
Resolution: Done Votes: 0
Labels: patch_merged
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File infrautils.metrics.ThreadsWatcher.txt    

 Description   

vorburger filing this issue, based on https://lists.opendaylight.org/pipermail/infrautils-dev/2018-February/000595.html, setting jluhrsen as Reporter; initially filing in infrautils, but will move to suitable project once we figure where these many threads come from.

Attached is a Thread Dump with >1000 threads in ODL produced by the infrautils.metrics.ThreadsWatcher (new; see INFRAUTILS-22, also FYI INFRAUTILS-21) which I extracted out of the 1.1 GB odl1_karaf.log.1 from https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/19/jamo-netvirt-csit-1node-openstack-pike-upstream-stateful-oxygen/4/odl_1/ ...

Why are there some many threads? What created them?



 Comments   
Comment by Michael Vorburger [ 28/Feb/18 ]

Whoa, hang on a sec, what are all these threads in the attached essentially jstack equivalent:

    "Karaf ssh console user karaf" id=... state=TIMED_WAITING

let's count them:

grep "Karaf ssh" infrautils.metrics.ThreadsWatcher.txt | wc -l
430

There are 1000 - 430 = 570 other threads we could look more into as well, but this jumps as as the biggest immediate problem... That 430 is a lot of threads for 1 Karaf SSH session, no? jluhrsen perhaps there isn't just 1 SSH session to Karaf, in CSIT... could that be? I don't suppose there could be some sort of issue more on the CSIT script side than in ODL / Karaf here? Like you guys aren't by any chance... I'm not sure, like sending hundreds of ODL shell commands to Karaf from the CSIT robot thing and doing that over a new ssh connection each time and for some reason not closing the previous SSH session - kind of thing? Even if you are not intentionally doing it like that, maybe some bug in.. Robot, or its SSH library?

Comment by Sam Hague [ 28/Feb/18 ]

Every suite and test is logged when they start. ROBOT Message: Starting suite blah, Starting test foo... So you would see that at last 240 for that. Each test also has a mdsal dump for each test, so another 235 or so. That could be the issue. Those commands were supposed to use a single connection to ODL but maybe something is busted there.

Any ideas how to debug this - meaning to tell if on the odl side the sessions are open? We have journalctl logs on the odl nodes and I see many ssh connections opened, but most are followed with a a close. Thee are only 30 so I think the karaf ssh sessions must not be hitting there.

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/965/odl_1/journalctl.log.gz

Comment by Michael Vorburger [ 28/Feb/18 ]

> you would see that at last 240 for that. Each test also (...), so another 235 or so

that adds up to 400 something, roughly matching, so yeah that's probably it then...

> Those commands were supposed to use a single connection to ODL

what exactly do those "commands" technically do? separate SSH or (probably) new SSH for each?

> Any ideas how to debug this - meaning to tell if on the odl side the sessions are open?

dunno how to easily check for open ports inside the JVM, but jstack and see those threads?

> journalctl logs on the odl nodes and I see many ssh connections opened, but most are followed with a a close

not sure what journalctl shows here - is that really the connection of the CSIT scripts into Karaf? Isn't that showing OS system ssh logins? I guess we could somehow check the TCP socket for the SSH sessions to Karaf is closed? But maybe the problem isn't at that level - I mean it COULD also be a bug not in CSIT ssh client connections, but in Karaf's built-in ssh server itself... like perhaps it's dumb enough to keep a loose open thread for each session even after the SSH client disconnects? That would clearly cause a thread leak, over time, and be a "Karaf" bug (in quotes because this more likely would be a problem in whatever library does the SSH server for Karaf, not even Karaf itself).

Comment by Michael Vorburger [ 28/Feb/18 ]

jluhrsen FYI I was looking a bit more into this with shague ... here are some new conclusions:

1. according to shague, when you wrote on https://lists.opendaylight.org/pipermail/infrautils-dev/2018-February/000604.html that "[b] had the same trouble and that was without odl-mdsal-trace" that really meant "without feature:instal odl-mdsal-trace" but still WITH https://git.opendaylight.org/gerrit/#/c/68447/6/csit/libraries/OpenStackOperations.robot, right?

2. shague "ran sandbox with c/68447 removed and it looked good for logs". That confirms that what caused these thread leaks to suddenly pop up, with absolutely no code having changed anywhere otherwise, is c/68447.

3. my gut now tells me this is more likely a problem in CSIT robot scripts, than in Karaf's SSH server (first suspicion, above). Is it at all possible that, somehow, that Issue_Command_On_Karaf_Console thing somehow lets hundreds of SSH sessions hanging open? I know nothing about that stuff, but e.g. seeing that in OpenStackOperations.robot there is a SSHLibrary.Close All Connections in line 1043 after one FOR Issue_Command_On_Karaf_Console trace:transaction but in line 560 there isn't - could that be a problem?

Comment by Jamo Luhrsen [ 28/Feb/18 ]

I sent an email with some more details and observations. This is not anything to do with a bug in CSIT or dependant library. We can get to the same state with a bash script doing "ssh" in a loop (properly opening/closing etc).

Anyway, the gist of it is that running karaf shell commands is opening some threads and they remain open for 60s. If we do enough of these at a fast enough rate, we'll cross this 1k threshold.

I'm not sure what state we are in (thread count wise) in our CSIT environment, but I suppose we have to assume we are somewhere near that threshold to start with. I don't think CSIT is doing more than a handful of these things per second (if that. more like 1-2 every 10-300 seconds).

Comment by Michael Vorburger [ 01/Mar/18 ]

> get to the same state with a bash script doing "ssh" in a loop (properly opening/closing etc).

jluhrsen if you have it handy and wouldn't mind attaching that here, it could be useful for repro. in the future.

Comment by Stephen Kitt [ 01/Mar/18 ]

BTW Karaf has a built-in “threads” command which outputs the thread dump, or a formatted list of threads with the “–list” option; you can combine that with “| wc -l” to get a quick measure of the number of running threads.

Comment by Kit Lou [ 01/Mar/18 ]

This is the patch  ready to be merged: https://git.opendaylight.org/gerrit/#/c/68937/

Comment by Jamo Luhrsen [ 01/Mar/18 ]

> get to the same state with a bash script doing "ssh" in a loop (properly opening/closing etc).

Jamo Luhrsen if you have it handy and wouldn't mind attaching that here, it could be useful for repro. in the future.

for i in {1..5000}; do sshpass -p karaf ssh -p 8101 karaf@localhost -o StrictHostKeyChecking=no feature:list -i ; done
Comment by Jamo Luhrsen [ 02/Mar/18 ]

BTW Karaf has a built-in “threads” command which outputs the thread dump, or a formatted list of threads with the “–list” option; you can combine that with “| wc -l” to get a quick measure of the number of running threads.

cool, this can be added to CSIT at some point if we think it's valuable. The fix to calm down the logging
is merged, so I think we are safe to add something like this. Here's something along those lines.

Generated at Wed Feb 07 20:02:04 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.