[GENIUS-86] LockManager being deadlocked Created: 11/Aug/17  Updated: 31/Jul/18  Resolved: 31/Jul/18

Status: Resolved
Project: genius
Component/s: General
Affects Version/s: Nitrogen
Fix Version/s: None

Type: Bug
Reporter: Kency Kurian Assignee: Kency Kurian
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8975

 Description   

There are chances that the lockManager gets into a state of deadlock when an AskTimeException occurs.



 Comments   
Comment by Sam Hague [ 19/Aug/17 ]

https://git.opendaylight.org/gerrit/61977

Comment by Faseela K [ 02/Nov/17 ]

The patch is in merge conflict. Can we incorporate Michael's review comments or discuss on what is missing?

Comment by Michael Vorburger [ 06/Nov/17 ]

kencykurian@gmail.com is this reproducible? Do you have logs showing those AskTimeException?

https://git.opendaylight.org/gerrit/#/c/65233/

Comment by Kency Kurian [ 07/Nov/17 ]

Hi Michael,

I am not able to find the logs for the same. But I had send a mail to controller dev with just the exception message, pasting it here:

TransactionCommitFailedException{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@192.168.123.3:2550/), Path(/user/shardmanager-operational/192.168.123.3-shard-default-operational/shard-192.168.123.4:datastore-operational@1:790#1851512584)]] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.BatchedModifications".]]}

But eventually the transaction got successful, but because of retries, on reading the DS, lockmanager thinks it got locked by some other thread but it was actually the same thread which eventually got successful. In order to avoid these situations we have introduced the owner field.

Comment by Michael Vorburger [ 07/Nov/17 ]

IMHO the TransactionCommitFailedException caused by AskTimeoutException after 30s of cluster issues should not be handled by application (genius lockmanager) code, and not lead to retries but fails - contrary to the OptimisticLockFailedException. In c/65233 it will do just that (whereas in c/61526 you retry for any failure).

IMHO If we need to handle cluster issues longer than 30s than perhaps we need to bump some timeout somewhere there, not retry from application code.

The lockmanager should still never deadlock though of course, so if your change fixes that, then I'm all for that part!

Generated at Wed Feb 07 19:59:52 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.