Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1893

AbstractClientConnection deadlock

    XMLWordPrintable

Details

    Description

      A deadlock occurred between Application thread (reading config DS) and AKKA thread inside org.opendaylight.controller.cluster.access.client.AbstractClientConnection. Which seems to completely block all interactions with the Datastore and requires manual restart.

      Attached is:

      • DEADLOCK stacktraces from jstack
      • GC logs 
      • Snippet from karaf.log which is related to this issue (the rest of the logs did not contain anything of substance, just netconf disconnect and reconnect details)

       

      Initial analysis of jstack: Looks like an ABBA deadlock between 2 instances of AbstractClientSession. Thread A (Uniconfig-task-20) starts with invocation of ReadOnlyTransaction.close() which flows through 2 AbstractClientSession instances trying to acquire lock in each instance. Thread B (opendaylight-cluster-data-akka.actor.default-dispatcher-105) is triggerred by ClientActorBehavior.onReceiveCommand and due to timeout triggers "poison" path in the code, again passing through 2 instances of AbstractClientSession trying to acquire locks in the process (however the order is opposite). More details can be found in the stacktrace or in diagram:

      I am really not sure why there are 2 instances of AbstractClientSession, nor have I any idea why the timeout/poison was triggerred (according to the log, it was 30 minutes inactive, but overall there is a lot of activity prior to this deadlock).

       

      Any idea how this deadlock could be fixed ? Is it an issue inside AbstractClientSession or perhaps some mismanagement on the application side ? I tried to simulate this deadlock in a unit test but so far no luck. 

       

      ODL env info:

      • version: Oxygen-SR2 based (But the code for AbstractClientSession is almost identical on master branch as well)
      • deployment: Single node
      • Uptime: approx 15 hours
      • TELL based
      • Xmx10G
      • cores == 12
      • Opendaylight was running netconf southbound + FRINX specific application
        • Netconf southbound was connected to 1000 devices and frequently reconnecting them due to faulty network (intentinally)
        • FRINX app was listening for the mountpoints, reading config from them and then READing and WRITEing some data into Datastore

      Attachments

        1. image-2019-05-07-16-01-17-465.png
          26 kB
          Maros Marsalek
        2. jstack.txt
          10 kB
          Maros Marsalek
        3. karaf.partial.log
          0.6 kB
          Maros Marsalek
        4. gc.log.1.current
          462 kB
          Maros Marsalek
        5. gc.log.0
          5.00 MB
          Maros Marsalek
        6. stacktrace.poison.txt
          34 kB
          Maros Marsalek
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            rovarga Robert Varga
            mmarsalek Maros Marsalek
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: