[CONTROLLER-1893] AbstractClientConnection deadlock Created: 07/May/19 Updated: 16/May/19 Resolved: 16/May/19 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Oxygen SR2 |
| Fix Version/s: | Sodium, Fluorine SR3, Neon SR2 |
| Type: | Bug | Priority: | Medium |
| Reporter: | Maros Marsalek | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
A deadlock occurred between Application thread (reading config DS) and AKKA thread inside org.opendaylight.controller.cluster.access.client.AbstractClientConnection. Which seems to completely block all interactions with the Datastore and requires manual restart. Attached is:
Initial analysis of jstack: Looks like an ABBA deadlock between 2 instances of AbstractClientSession. Thread A (Uniconfig-task-20) starts with invocation of ReadOnlyTransaction.close() which flows through 2 AbstractClientSession instances trying to acquire lock in each instance. Thread B (opendaylight-cluster-data-akka.actor.default-dispatcher-105) is triggerred by ClientActorBehavior.onReceiveCommand and due to timeout triggers "poison" path in the code, again passing through 2 instances of AbstractClientSession trying to acquire locks in the process (however the order is opposite). More details can be found in the stacktrace or in diagram:
I am really not sure why there are 2 instances of AbstractClientSession, nor have I any idea why the timeout/poison was triggerred (according to the log, it was 30 minutes inactive, but overall there is a lot of activity prior to this deadlock).
Any idea how this deadlock could be fixed ? Is it an issue inside AbstractClientSession or perhaps some mismanagement on the application side ? I tried to simulate this deadlock in a unit test but so far no luck.
ODL env info:
|
| Comments |
| Comment by Maros Marsalek [ 07/May/19 ] |
|
Regarding: karaf.log.7:2019-05-06T15:08:50,640 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-105 | AbstractClientConnection | 203 - org.opendaylight.controller.cds-access-client - 1.3.2.Oxygen-SR2_4_2_1_rc3-frinxodl-SNAPSHOT | Queue ReconnectingClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#2024575561], sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} has not seen progress in 1964 seconds, failing all requests
I don't really know what has happened here ... and what would be the consequences had the DEADLOCK not occurred. Any details on this would be also helpful.
Thanks, Maros
|
| Comment by Robert Varga [ 07/May/19 ] |
|
cds-access-client logs from INFO level would be nice, as we need to understand the lifecycle of the two connections. The connection being poisoned seems to be the successor in this case – which would indicate the originating transaction was allocated while it was still not there – which seems to be 1964 seconds by the reconnecting connection's accounting (which itself may inaccurate). If there is real activity going on, the connection should have reconnected and been replaced by a successor – which would have drained its queue, preventing the timer from being anything but a no-op.
|
| Comment by Robert Varga [ 07/May/19 ] |
|
I would suggest picking up the reestablishment patches that went into Neon, which can help masking the issue by preventing poison from happening. |
| Comment by Robert Varga [ 09/May/19 ] |
|
This seems similar to |
| Comment by Maros Marsalek [ 10/May/19 ] |
|
I can confirm that https://git.opendaylight.org/gerrit/#/c/81949/ works. Verified by a unit test: https://git.opendaylight.org/gerrit/#/c/81979/ as well as by running ODL instance. |
| Comment by Maros Marsalek [ 10/May/19 ] |
|
Ad "verified by running ODL instance": I think the same problem occurred again and the poison was triggerred. This time however there was no deadlock. The Datastore responds but with an error: see attached stacktrace: stacktrace.poison.txt Robert, you mentioned "reestablishment patches" could help here. Could you give me a pointer please ? Also, I don't have "cds-access-client logs from INFO" yet, but I will try to acquire them from the next ODL deployment. |