-
Bug
-
Resolution: Done
-
Medium
-
Oxygen SR2
-
None
A deadlock occurred between Application thread (reading config DS) and AKKA thread inside org.opendaylight.controller.cluster.access.client.AbstractClientConnection. Which seems to completely block all interactions with the Datastore and requires manual restart.
Attached is:
- DEADLOCK stacktraces from jstack
- GC logs
- Snippet from karaf.log which is related to this issue (the rest of the logs did not contain anything of substance, just netconf disconnect and reconnect details)
Initial analysis of jstack: Looks like an ABBA deadlock between 2 instances of AbstractClientSession. Thread A (Uniconfig-task-20) starts with invocation of ReadOnlyTransaction.close() which flows through 2 AbstractClientSession instances trying to acquire lock in each instance. Thread B (opendaylight-cluster-data-akka.actor.default-dispatcher-105) is triggerred by ClientActorBehavior.onReceiveCommand and due to timeout triggers "poison" path in the code, again passing through 2 instances of AbstractClientSession trying to acquire locks in the process (however the order is opposite). More details can be found in the stacktrace or in diagram:
I am really not sure why there are 2 instances of AbstractClientSession, nor have I any idea why the timeout/poison was triggerred (according to the log, it was 30 minutes inactive, but overall there is a lot of activity prior to this deadlock).
Any idea how this deadlock could be fixed ? Is it an issue inside AbstractClientSession or perhaps some mismanagement on the application side ? I tried to simulate this deadlock in a unit test but so far no luck.
ODL env info:
- version: Oxygen-SR2 based (But the code for AbstractClientSession is almost identical on master branch as well)
- deployment: Single node
- Uptime: approx 15 hours
- TELL based
- Xmx10G
- cores == 12
- Opendaylight was running netconf southbound + FRINX specific application
- Netconf southbound was connected to 1000 devices and frequently reconnecting them due to faulty network (intentinally)
- FRINX app was listening for the mountpoints, reading config from them and then READing and WRITEing some data into Datastore