Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Medium
Fix Version/s: Sodium, Fluorine SR3, Neon SR2
Affects Version/s: Oxygen SR2
Component/s: clustering
Labels:
None

A deadlock occurred between Application thread (reading config DS) and AKKA thread inside org.opendaylight.controller.cluster.access.client.AbstractClientConnection. Which seems to completely block all interactions with the Datastore and requires manual restart.

Attached is:

DEADLOCK stacktraces from jstack
GC logs
Snippet from karaf.log which is related to this issue (the rest of the logs did not contain anything of substance, just netconf disconnect and reconnect details)

Initial analysis of jstack: Looks like an ABBA deadlock between 2 instances of AbstractClientSession. Thread A (Uniconfig-task-20) starts with invocation of ReadOnlyTransaction.close() which flows through 2 AbstractClientSession instances trying to acquire lock in each instance. Thread B (opendaylight-cluster-data-akka.actor.default-dispatcher-105) is triggerred by ClientActorBehavior.onReceiveCommand and due to timeout triggers "poison" path in the code, again passing through 2 instances of AbstractClientSession trying to acquire locks in the process (however the order is opposite). More details can be found in the stacktrace or in diagram:

I am really not sure why there are 2 instances of AbstractClientSession, nor have I any idea why the timeout/poison was triggerred (according to the log, it was 30 minutes inactive, but overall there is a lot of activity prior to this deadlock).

Any idea how this deadlock could be fixed ? Is it an issue inside AbstractClientSession or perhaps some mismanagement on the application side ? I tried to simulate this deadlock in a unit test but so far no luck.

ODL env info:

version: Oxygen-SR2 based (But the code for AbstractClientSession is almost identical on master branch as well)
deployment: Single node
Uptime: approx 15 hours
TELL based
Xmx10G
cores == 12
Opendaylight was running netconf southbound + FRINX specific application
- Netconf southbound was connected to 1000 devices and frequently reconnecting them due to faulty network (intentinally)
- FRINX app was listening for the mountpoints, reading config from them and then READing and WRITEing some data into Datastore

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

gc.log.0
5.00 MB
07/May/19 2:19 PM
gc.log.1.current
462 kB
07/May/19 2:19 PM
image-2019-05-07-16-01-17-465.png
26 kB
07/May/19 2:01 PM
jstack.txt
10 kB
07/May/19 2:19 PM
karaf.partial.log
0.6 kB
07/May/19 2:19 PM
stacktrace.poison.txt
34 kB
10/May/19 7:33 AM

Assignee:: Robert Varga

Reporter:: Maros Marsalek

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 07/May/19 2:20 PM

Updated:: 16/May/19 2:14 PM

Resolved:: 16/May/19 2:14 PM

Details

Description

Attachments

Attachments

Activity

People

Dates