[CONTROLLER-1762] ODL is up and ports are listening but not functional Created: 28/Aug/17 Updated: 19/Oct/17 Resolved: 12/Sep/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Carbon |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Sai Sindhur Malleni | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||||||
| External issue ID: | 9063 | ||||||||||||||||
| Priority: | Highest | ||||||||||||||||
| Description |
|
Description of problem: On running longevity tests in a clustered ODL setup we see that one of the ODL instances seems to be up and running as reported by ps output, systemctl and netstat listening ports, however it doesn't seem to be functional. We could not even ssh into the karaf terminal using ssh -p 8101 karaf@172.16.0.16 until we restarted opendaylight. On performing a service restart we were able to get into the karaf shell and ODL seemed to come back up. Test: Entire Karaf Log: http://8.43.86.1:8088/smalleni/karaf-controller-0.log.tar.gz ODL RPM from upstream: python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch |
| Comments |
| Comment by Sai Sindhur Malleni [ 28/Aug/17 ] |
|
ODL became non-functional around 10:44 UTC 08/28/2017. This was confirmed as collectd which talks tothe Karaf JMX suddenly stopped reporting values for heap size. Collectd was able to talk to Karaf JMX after the service restart. The break can be clearly observed at: https://snapshot.raintank.io/dashboard/snapshot/nf6OWq7jNSeT6vwjM71jlUSWc31E9LdW |
| Comment by Jamo Luhrsen [ 28/Aug/17 ] |
|
the karaf.log file is ~1G so it's hard to debug with it. I did see a ton of timed out messages in genius.lockmanager-impl at one point, but that could just be a |
| Comment by Michael Vorburger [ 29/Aug/17 ] |
|
> it doesn't seem to be functional. We could not even ssh Sai, when you hit this kind of situation, it would interesting if we could have a "thread stack dump" of that JVM, to see if we can spot anything obvious (like e.g. an obvious deadlock, or an extreme number of threads). Use the JDK's "jstack" utility to obtain this. Try "jstack -h" to learn it, if you've never used it. Use the -l flag. You'll need to check as what user you have to run jstack to work - try first on a non-stuck instance? If jstack still doesn't work, its -F flag helps sometimes, but having to use that is a bad sign. If still doesn't work, then we would probably need help from our openjdk team friends at Red Hat to be able to understand what horrible thing ODL code may be doing to get a JVM that badly stuck. |
| Comment by Sai Sindhur Malleni [ 29/Aug/17 ] |
|
Sure, Michael, Next time I will do that. If it helps: The karaf thread count https://snapshot.raintank.io/dashboard/snapshot/EgrJsRB7HJ6tl1pjLlSY4hb6wWvJS7nT We can see that arund 10:44 UTC the thread count suddenly spikes and falls back after a restart. |
| Comment by Sai Sindhur Malleni [ 29/Aug/17 ] |
|
ODL RPM used was opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch |
| Comment by Michael Vorburger [ 29/Aug/17 ] |
|
> If it helps: The karaf thread count Sai, that actually is an interesting observation - it's certainly possible that, in addition to OOM problems due to Memory Leaks related to TransactionChain ( |
| Comment by Michael Vorburger [ 06/Sep/17 ] |
|
Wondering if |