[CONTROLLER-1893] AbstractClientConnection deadlock Created: 07/May/19  Updated: 16/May/19  Resolved: 16/May/19

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Oxygen SR2
Fix Version/s: Sodium, Fluorine SR3, Neon SR2

Type: Bug Priority: Medium
Reporter: Maros Marsalek Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File gc.log.0     File gc.log.1.current     PNG File image-2019-05-07-16-01-17-465.png     Text File jstack.txt     File karaf.partial.log     Text File stacktrace.poison.txt    

 Description   

A deadlock occurred between Application thread (reading config DS) and AKKA thread inside org.opendaylight.controller.cluster.access.client.AbstractClientConnection. Which seems to completely block all interactions with the Datastore and requires manual restart.

Attached is:

  • DEADLOCK stacktraces from jstack
  • GC logs 
  • Snippet from karaf.log which is related to this issue (the rest of the logs did not contain anything of substance, just netconf disconnect and reconnect details)

 

Initial analysis of jstack: Looks like an ABBA deadlock between 2 instances of AbstractClientSession. Thread A (Uniconfig-task-20) starts with invocation of ReadOnlyTransaction.close() which flows through 2 AbstractClientSession instances trying to acquire lock in each instance. Thread B (opendaylight-cluster-data-akka.actor.default-dispatcher-105) is triggerred by ClientActorBehavior.onReceiveCommand and due to timeout triggers "poison" path in the code, again passing through 2 instances of AbstractClientSession trying to acquire locks in the process (however the order is opposite). More details can be found in the stacktrace or in diagram:

I am really not sure why there are 2 instances of AbstractClientSession, nor have I any idea why the timeout/poison was triggerred (according to the log, it was 30 minutes inactive, but overall there is a lot of activity prior to this deadlock).

 

Any idea how this deadlock could be fixed ? Is it an issue inside AbstractClientSession or perhaps some mismanagement on the application side ? I tried to simulate this deadlock in a unit test but so far no luck. 

 

ODL env info:

  • version: Oxygen-SR2 based (But the code for AbstractClientSession is almost identical on master branch as well)
  • deployment: Single node
  • Uptime: approx 15 hours
  • TELL based
  • Xmx10G
  • cores == 12
  • Opendaylight was running netconf southbound + FRINX specific application
    • Netconf southbound was connected to 1000 devices and frequently reconnecting them due to faulty network (intentinally)
    • FRINX app was listening for the mountpoints, reading config from them and then READing and WRITEing some data into Datastore


 Comments   
Comment by Maros Marsalek [ 07/May/19 ]

Regarding:

karaf.log.7:2019-05-06T15:08:50,640 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-105 | AbstractClientConnection         | 203 - org.opendaylight.controller.cds-access-client - 1.3.2.Oxygen-SR2_4_2_1_rc3-frinxodl-SNAPSHOT | Queue ReconnectingClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#2024575561], sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} has not seen progress in 1964 seconds, failing all requests

 

I don't really know what has happened here ... and what would be the consequences had the DEADLOCK not occurred. Any details on this would be also helpful.

 

Thanks,

Maros

 

Comment by Robert Varga [ 07/May/19 ]

cds-access-client logs from INFO level would be nice, as we need to understand the lifecycle of the two connections. The connection being poisoned seems to be the successor in this case – which would indicate the originating transaction was allocated while it was still not there – which seems to be 1964 seconds by the reconnecting connection's accounting (which itself may inaccurate).

If there is real activity going on, the connection should have reconnected and been replaced by a successor – which would have drained its queue, preventing the timer from being anything but a no-op.

 

Comment by Robert Varga [ 07/May/19 ]

I would suggest picking up the reestablishment patches that went into Neon, which can help masking the issue by preventing poison from happening.

Comment by Robert Varga [ 09/May/19 ]

This seems similar to CONTROLLER-1745 – hence moving poisoning outside the lock should help.

Comment by Maros Marsalek [ 10/May/19 ]

I can confirm that https://git.opendaylight.org/gerrit/#/c/81949/ works. Verified by a unit test: https://git.opendaylight.org/gerrit/#/c/81979/ as well as by running ODL instance.

Comment by Maros Marsalek [ 10/May/19 ]

Ad "verified by running ODL instance": I think the same problem occurred again and the poison was triggerred. This time however there was no deadlock. The Datastore responds but with an error:

   see attached stacktrace: stacktrace.poison.txt

Robert, you mentioned "reestablishment patches" could help here. Could you give me a pointer please ?

Also, I don't have "cds-access-client logs from INFO" yet, but I will try to acquire them from the next ODL deployment.

Generated at Wed Feb 07 19:56:43 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.