Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1893

AbstractClientConnection deadlock

XMLWordPrintable

      A deadlock occurred between Application thread (reading config DS) and AKKA thread inside org.opendaylight.controller.cluster.access.client.AbstractClientConnection. Which seems to completely block all interactions with the Datastore and requires manual restart.

      Attached is:

      • DEADLOCK stacktraces from jstack
      • GC logs 
      • Snippet from karaf.log which is related to this issue (the rest of the logs did not contain anything of substance, just netconf disconnect and reconnect details)

       

      Initial analysis of jstack: Looks like an ABBA deadlock between 2 instances of AbstractClientSession. Thread A (Uniconfig-task-20) starts with invocation of ReadOnlyTransaction.close() which flows through 2 AbstractClientSession instances trying to acquire lock in each instance. Thread B (opendaylight-cluster-data-akka.actor.default-dispatcher-105) is triggerred by ClientActorBehavior.onReceiveCommand and due to timeout triggers "poison" path in the code, again passing through 2 instances of AbstractClientSession trying to acquire locks in the process (however the order is opposite). More details can be found in the stacktrace or in diagram:

      I am really not sure why there are 2 instances of AbstractClientSession, nor have I any idea why the timeout/poison was triggerred (according to the log, it was 30 minutes inactive, but overall there is a lot of activity prior to this deadlock).

       

      Any idea how this deadlock could be fixed ? Is it an issue inside AbstractClientSession or perhaps some mismanagement on the application side ? I tried to simulate this deadlock in a unit test but so far no luck. 

       

      ODL env info:

      • version: Oxygen-SR2 based (But the code for AbstractClientSession is almost identical on master branch as well)
      • deployment: Single node
      • Uptime: approx 15 hours
      • TELL based
      • Xmx10G
      • cores == 12
      • Opendaylight was running netconf southbound + FRINX specific application
        • Netconf southbound was connected to 1000 devices and frequently reconnecting them due to faulty network (intentinally)
        • FRINX app was listening for the mountpoints, reading config from them and then READing and WRITEing some data into Datastore

        1. gc.log.0
          5.00 MB
        2. gc.log.1.current
          462 kB
        3. image-2019-05-07-16-01-17-465.png
          image-2019-05-07-16-01-17-465.png
          26 kB
        4. jstack.txt
          10 kB
        5. karaf.partial.log
          0.6 kB
        6. stacktrace.poison.txt
          34 kB

            rovarga Robert Varga
            mmarsalek Maros Marsalek
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: