[CONTROLLER-1762] ODL is up and ports are listening but not functional Created: 28/Aug/17  Updated: 19/Oct/17  Resolved: 12/Sep/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Sai Sindhur Malleni Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by CONTROLLER-1755 RaftActor lastApplied index moves bac... Resolved
Duplicate
duplicates CONTROLLER-1756 OOM due to huge Map in ShardDataTree Resolved
External issue ID: 9063
Priority: Highest

 Description   

Description of problem: On running longevity tests in a clustered ODL setup we see that one of the ODL instances seems to be up and running as reported by ps output, systemctl and netstat listening ports, however it doesn't seem to be functional. We could not even ssh into the karaf terminal using ssh -p 8101 karaf@172.16.0.16 until we restarted opendaylight. On performing a service restart we were able to get into the karaf shell and ODL seemed to come back up.
Out of the other two instances of ODL, one was killed due to OOM and the other seemed to be running fine. This happens after about 42 hours of running the tests.
Setup:
3 ODLs
3 OpenStack Controllers
3 Compute nodes

Test:
Create 40 neutron resources (rotuers, networks etc) 2 at a time using Rally and delete them over and over again. This is a long running low stress test.

Entire Karaf Log: http://8.43.86.1:8088/smalleni/karaf-controller-0.log.tar.gz

ODL RPM from upstream: python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch



 Comments   
Comment by Sai Sindhur Malleni [ 28/Aug/17 ]

ODL became non-functional around 10:44 UTC 08/28/2017. This was confirmed as collectd which talks tothe Karaf JMX suddenly stopped reporting values for heap size. Collectd was able to talk to Karaf JMX after the service restart. The break can be clearly observed at: https://snapshot.raintank.io/dashboard/snapshot/nf6OWq7jNSeT6vwjM71jlUSWc31E9LdW

Comment by Jamo Luhrsen [ 28/Aug/17 ]

the karaf.log file is ~1G so it's hard to debug with it. I did see a ton of timed out messages in genius.lockmanager-impl at one point, but that could just be a
symptom. And in reality they are taking up less than 1% of the total log messages.

Comment by Michael Vorburger [ 29/Aug/17 ]

> it doesn't seem to be functional. We could not even ssh
> into the karaf terminal using ssh -p 8101 karaf@172.16.0.16

Sai, when you hit this kind of situation, it would interesting if we could have a "thread stack dump" of that JVM, to see if we can spot anything obvious (like e.g. an obvious deadlock, or an extreme number of threads). Use the JDK's "jstack" utility to obtain this. Try "jstack -h" to learn it, if you've never used it. Use the -l flag. You'll need to check as what user you have to run jstack to work - try first on a non-stuck instance? If jstack still doesn't work, its -F flag helps sometimes, but having to use that is a bad sign. If still doesn't work, then we would probably need help from our openjdk team friends at Red Hat to be able to understand what horrible thing ODL code may be doing to get a JVM that badly stuck.

Comment by Sai Sindhur Malleni [ 29/Aug/17 ]

Sure, Michael, Next time I will do that.

If it helps: The karaf thread count https://snapshot.raintank.io/dashboard/snapshot/EgrJsRB7HJ6tl1pjLlSY4hb6wWvJS7nT

We can see that arund 10:44 UTC the thread count suddenly spikes and falls back after a restart.

Comment by Sai Sindhur Malleni [ 29/Aug/17 ]

ODL RPM used was opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch

Comment by Michael Vorburger [ 29/Aug/17 ]

> If it helps: The karaf thread count

Sai, that actually is an interesting observation - it's certainly possible that, in addition to OOM problems due to Memory Leaks related to TransactionChain (CONTROLLER-1756) and broken clustering (CONTROLLER-1763) we also have a "Thread leak" issue (i.e. unbounded creation of new threads, instead of correctly using a pool / Executor thing), somewhere in the code... could be anywhere - netvirt, genius, mdsal - who knows. But before digging more into that, we would first need definitive proof that is the underlying root cause here - maybe the systemctl status in CONTROLLER-1763 will give us that, let's see.

Comment by Michael Vorburger [ 06/Sep/17 ]

Wondering if CONTROLLER-1755 may helped fix this - let's re-test and confirm is still seen.

Generated at Wed Feb 07 19:56:24 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.