[INFRAUTILS-29] CSIT hit infrautils.metrics.ThreadsWatcher limit of 1000 threads Created: 28/Feb/18 Updated: 02/Mar/18 Resolved: 02/Mar/18 |
|
| Status: | Resolved |
| Project: | infrautils |
| Component/s: | None |
| Affects Version/s: | Oxygen, Fluorine |
| Fix Version/s: | Oxygen, Fluorine |
| Type: | Bug | Priority: | Highest |
| Reporter: | Jamo Luhrsen | Assignee: | Michael Vorburger |
| Resolution: | Done | Votes: | 0 |
| Labels: | patch_merged | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
vorburger filing this issue, based on https://lists.opendaylight.org/pipermail/infrautils-dev/2018-February/000595.html, setting jluhrsen as Reporter; initially filing in infrautils, but will move to suitable project once we figure where these many threads come from. Attached is a Thread Dump with >1000 threads in ODL produced by the infrautils.metrics.ThreadsWatcher (new; see Why are there some many threads? What created them? |
| Comments |
| Comment by Michael Vorburger [ 28/Feb/18 ] |
|
Whoa, hang on a sec, what are all these threads in the attached essentially jstack equivalent: "Karaf ssh console user karaf" id=... state=TIMED_WAITING let's count them: grep "Karaf ssh" infrautils.metrics.ThreadsWatcher.txt | wc -l 430 There are 1000 - 430 = 570 other threads we could look more into as well, but this jumps as as the biggest immediate problem... That 430 is a lot of threads for 1 Karaf SSH session, no? |
| Comment by Sam Hague [ 28/Feb/18 ] |
|
Every suite and test is logged when they start. ROBOT Message: Starting suite blah, Starting test foo... So you would see that at last 240 for that. Each test also has a mdsal dump for each test, so another 235 or so. That could be the issue. Those commands were supposed to use a single connection to ODL but maybe something is busted there. Any ideas how to debug this - meaning to tell if on the odl side the sessions are open? We have journalctl logs on the odl nodes and I see many ssh connections opened, but most are followed with a a close. Thee are only 30 so I think the karaf ssh sessions must not be hitting there. |
| Comment by Michael Vorburger [ 28/Feb/18 ] |
|
> you would see that at last 240 for that. Each test also (...), so another 235 or so that adds up to 400 something, roughly matching, so yeah that's probably it then... > Those commands were supposed to use a single connection to ODL what exactly do those "commands" technically do? separate SSH or (probably) new SSH for each? > Any ideas how to debug this - meaning to tell if on the odl side the sessions are open? dunno how to easily check for open ports inside the JVM, but jstack and see those threads? > journalctl logs on the odl nodes and I see many ssh connections opened, but most are followed with a a close not sure what journalctl shows here - is that really the connection of the CSIT scripts into Karaf? Isn't that showing OS system ssh logins? I guess we could somehow check the TCP socket for the SSH sessions to Karaf is closed? But maybe the problem isn't at that level - I mean it COULD also be a bug not in CSIT ssh client connections, but in Karaf's built-in ssh server itself... like perhaps it's dumb enough to keep a loose open thread for each session even after the SSH client disconnects? That would clearly cause a thread leak, over time, and be a "Karaf" bug (in quotes because this more likely would be a problem in whatever library does the SSH server for Karaf, not even Karaf itself). |
| Comment by Michael Vorburger [ 28/Feb/18 ] |
|
jluhrsen FYI I was looking a bit more into this with shague ... here are some new conclusions: 1. according to shague, when you wrote on https://lists.opendaylight.org/pipermail/infrautils-dev/2018-February/000604.html that "[b] had the same trouble and that was without odl-mdsal-trace" that really meant "without feature:instal odl-mdsal-trace" but still WITH https://git.opendaylight.org/gerrit/#/c/68447/6/csit/libraries/OpenStackOperations.robot, right? 2. shague "ran sandbox with c/68447 removed and it looked good for logs". That confirms that what caused these thread leaks to suddenly pop up, with absolutely no code having changed anywhere otherwise, is c/68447. 3. my gut now tells me this is more likely a problem in CSIT robot scripts, than in Karaf's SSH server (first suspicion, above). Is it at all possible that, somehow, that Issue_Command_On_Karaf_Console thing somehow lets hundreds of SSH sessions hanging open? I know nothing about that stuff, but e.g. seeing that in OpenStackOperations.robot there is a SSHLibrary.Close All Connections in line 1043 after one FOR Issue_Command_On_Karaf_Console trace:transaction but in line 560 there isn't - could that be a problem? |
| Comment by Jamo Luhrsen [ 28/Feb/18 ] |
|
I sent an email with some more details and observations. This is not anything to do with a bug in CSIT or dependant library. We can get to the same state with a bash script doing "ssh" in a loop (properly opening/closing etc). Anyway, the gist of it is that running karaf shell commands is opening some threads and they remain open for 60s. If we do enough of these at a fast enough rate, we'll cross this 1k threshold. I'm not sure what state we are in (thread count wise) in our CSIT environment, but I suppose we have to assume we are somewhere near that threshold to start with. I don't think CSIT is doing more than a handful of these things per second (if that. more like 1-2 every 10-300 seconds). |
| Comment by Michael Vorburger [ 01/Mar/18 ] |
|
> get to the same state with a bash script doing "ssh" in a loop (properly opening/closing etc). jluhrsen if you have it handy and wouldn't mind attaching that here, it could be useful for repro. in the future. |
| Comment by Stephen Kitt [ 01/Mar/18 ] |
|
BTW Karaf has a built-in “threads” command which outputs the thread dump, or a formatted list of threads with the “–list” option; you can combine that with “| wc -l” to get a quick measure of the number of running threads. |
| Comment by Kit Lou [ 01/Mar/18 ] |
|
This is the patch ready to be merged: https://git.opendaylight.org/gerrit/#/c/68937/ |
| Comment by Jamo Luhrsen [ 01/Mar/18 ] |
for i in {1..5000}; do sshpass -p karaf ssh -p 8101 karaf@localhost -o StrictHostKeyChecking=no feature:list -i ; done |
| Comment by Jamo Luhrsen [ 02/Mar/18 ] |
cool, this can be added to CSIT at some point if we think it's valuable. The fix to calm down the logging |