[CONTROLLER-1629] AbstractConcurrentDataBrokerTest causes deadlocks which AbstractDataBrokerTest did not Created: 05/Apr/17  Updated: 25/Jul/23  Resolved: 13/Jul/17

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Michael Vorburger Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File bug8163-jstack-1.txt    
Issue Links:
Blocks
blocks BGPCEP-665 InstructionDeployerImplTest hangs per... Resolved
External issue ID: 8163

 Description   

The AbstractConcurrentDataBrokerTest, introduced in CONTROLLER-1578 to replace the AbstractDataBrokerTest, while working well in a number of tests I've been using it since, seems to cause deadlocks in some rare cases.

For example, https://git.opendaylight.org/gerrit/#/c/54353/ replaces AbstractDataBrokerTest in Daexim by AbstractConcurrentDataBrokerTest, and causes DataExportImportAppProviderTest to deadlock... Attached is a thread dump.

Tali Ben-Meir seems to have hit a similar issue in a not-yet-published new test she's writing for project Federation, according to private email exchange with a thread dump that looks similar.



 Comments   
Comment by Michael Vorburger [ 05/Apr/17 ]

Attachment bug8163-jstack-1.txt has been added with description: jstack of deadlock in DataExportImportAppProviderTest when changing from @Deprecated AbstractDataBrokerTest to AbstractConcurrentDataBrokerTest

Comment by Tom Pantelis [ 13/Apr/17 ]

The problem is the use of MoreExecutors$DirectExecutorService in the QueuedNotificationManager and the fact that the DataExportImportAppProvider does a blocking get on the submit Future in onDataTreeChanged. The solution is to not use a direct executor for publishing change notifications.

Comment by Michael Vorburger [ 14/Apr/17 ]

> DataExportImportAppProvider does a blocking get on the submit Future in onDataTreeChanged

I may be too tired to see it - but where? DataExportImportAppProvider has a registerDataTreeChangeListener() in init() with an (anonymous / lambda) onDataTreeChanged which goes into ipcHandler() - but I don't see a get() in there? There definitely are many in other places though (c/55022), so I'm probably just not seeing what you're pointing to...

> The problem is the use of MoreExecutors$DirectExecutorService in the QueuedNotificationManager (...) The solution is to not use a direct executor for publishing change notifications.

In Daexim or in controller? Oh wait, I'm seeing that this DirectExecutorService is created in AbstractDataBrokerTestCustomizer .. this sounds strangely familiar.. Is this the discussion we've already had in https://bugs.opendaylight.org/show_bug.cgi?id=7538#c9 ? I can try to pick that up from where left that off...

PS, just FTR: This WORKED before switching from the AbstractDataBrokerTest to the AbstractConcurrentDataBrokerTest, and more importantly it works in "production" (with the "real" not test DataBroker) .. so it's a problem in controller utilities?

Comment by Tom Pantelis [ 18/Apr/17 ]

(In reply to Michael Vorburger from comment #3)
> I may be too tired to see it - but where? DataExportImportAppProvider has a
> registerDataTreeChangeListener() in init() with an (anonymous / lambda)
> onDataTreeChanged which goes into ipcHandler() - but I don't see a get() in
> there?

One code path is ipcHandler -> updateNodeStatus. The code in this class really should be changed to be async and not block on Futures.

> Is this the discussion we've already had in
> https://bugs.opendaylight.org/show_bug.cgi?id=7538#c9 ? I can try to pick
> that up from where left that off...

Yes.

> PS, just FTR: This WORKED before switching from the AbstractDataBrokerTest
> to the AbstractConcurrentDataBrokerTest, and more importantly it works in
> "production" (with the "real" not test DataBroker) .. so it's a problem in
> controller utilities?

It works in production b/c there are no direct executors used in production core code. CDS uses akka actors and is all async. In this case the InMemoryDOMDataStore executor in the tests cannot not be a direct executor since the DataExportImportAppProvider code does blocking gets in its DTCL callback.

Comment by Claudio David Gasparini [ 06/May/17 ]

Hi, investigating BUG-8335 It seems that Im hitting the same issue.

dead lock is caused by AbstractConcurrentDataBrokerTest.
Debugging I see that it hangs when AbstractRegistrationTree#removeRegistration
is trying to take the log.

Created a demo class and test to easily reproduce the issue.
https://git.opendaylight.org/gerrit/#/c/56640/

moving the bug as confirmed, you can changed it you don't agree

Comment by Tom Pantelis [ 04/Jul/17 ]

Is this fixed now?

Comment by Michael Vorburger [ 05/Jul/17 ]

> Is this fixed now?

Nope, this one slipped through the cracks on my side; never got to this, due to other (internal, downstream) prios .. if you/someone would like to have a go at it, please don't let me stop you.. Otherwise I'll try to get to this in the next weeks, time permitting.

Comment by Tom Pantelis [ 05/Jul/17 ]

(In reply to Michael Vorburger from comment #7)
> > Is this fixed now?
>
> Nope, this one slipped through the cracks on my side; never got to this, due
> to other (internal, downstream) prios .. if you/someone would like to have a
> go at it, please don't let me stop you.. Otherwise I'll try to get to
> this in the next weeks, time permitting.

I commented earlier on the solution based on the original thread dump - do not use a direct executor for publishing change notifications. I can push a patch for that but I suspect some downstream tests may fail intermittently if they're not expecting DTCL notification to be async.

Comment by Tom Pantelis [ 06/Jul/17 ]

Submitted https://git.opendaylight.org/gerrit/#/c/60022/ to use a thread pool executor instead of direct executor for DTCL notifications. This fixes the deadlock with the DemoBugTest in https://git.opendaylight.org/gerrit/#/c/56640/.

This change may break existing tests if they expect DTCL notifications to be immediate but we can tackle those if/when they occur. If it becomes too problematic we can keep AbstractConcurrentDataBrokerTest using the direct executor and create another version that uses a thread pool although I'd rather not do that.

Comment by Michael Vorburger [ 13/Jul/17 ]

> create another version that uses a thread pool although I'd rather not do that

FTR: https://git.opendaylight.org/gerrit/#/c/60022/ ultimately ended up having to do this in order not to break backward compatibility, and introduces an option to choose whether or not you want the DataTreeChangeListenerExecutor to be multi threaded or not.

Comment by Michael Vorburger [ 13/Jul/17 ]

https://git.opendaylight.org/gerrit/#/c/60312/

Generated at Wed Feb 07 19:56:02 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.