[CONTROLLER-1186] Clustering : ConcurrentDOMDataBroker/DOMConcurrentDataCommitCoordinator cause deadlocks Created: 06/Mar/15  Updated: 30/Mar/15  Resolved: 30/Mar/15

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: Post-Helium
Fix Version/s: None

Type: Bug
Reporter: Moiz Raja Assignee: Moiz Raja
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 2792
Priority: Highest

 Description   

Here is the scenario,

T1 and T2 are two threads which create Txn1 and Txn2. Both these transactions touch both config/operational datastores. When Txn1 and Txn2 are submitted they therefore have 2 cohorts each. On the CDS side we end up with 1 Txn for config and 1 Txn for operational per Broker transaction each with 1 cohort. Let's call the cohorts Txn11, Txn12, Txn21, Txn22

When Txn1 and Txn2 are submitted the concurrent broker attempts to do a canCommit on all the cohorts of Txn1 and Txn2 concurrently. On the CDS side we therefore try to do canCommit on the cohorts for all 4 Transactions.

When we do a canCommit on a CDS transaction the cohortEntry on which the canCommit is triggered gets queued on the ShardCommitCoordinator. This cohortEntry is then removed when a commit is called on that cohort. On the DataBroker end preCommit will only be called on the cohorts when the response for both cohort canCommits are received.

Because canCommit is triggered asynchronously it just so happens that on the ShardCommitCoordinator the cohort entries get queued as follows,

Txn21 -> Txn11 = Operational Shard Coordinator

Txn12 -> Txn22 = Config Shard Coordinator

Operational ShardCommitCoordinator process the first item in it's queue and sends a CanCommitTransactionReply for Txn21. Similarly Config ShardCommitCoordinator sends the CanCommitTransactionReply for Txn12.

Txn21 and Txn12 can be removed from the queue only when a commit arrives for those cohorts but that will never happen because that would require that a canCommit response be received for cohorts Txn11 and Txn22. Thus the deadlock.

In CDS this deadlock eventually resolves with akka timeouts.

One way to reproduce this problem is by using the netconf scale test.

The manual instructions for running this test are here,

ODL with odl-restconf and odl-netconf-connector-all installed

Download latest testool:
https://nexus.opendaylight.org/service/local/artifact/maven/redirect?r=opendaylight.snapshot&g=org.opendaylight.controller&a=netconf-testtool&v=0.3.0-SNAPSHOT&e=jar&c=executable

run testtool with(dont just copy paste run it, make sure all the settings are correct according to the instructions below):

java -Dorg.apache.sshd.registerBouncyCastle=false -Xmx8G -XX:MaxPermSize=1G -jar ./netconf-testtool-0.3.0-SNAPSHOT-executable.jar --ssh true --generate-configs-batch-size 4000 --exi false --generate-config-connection-timeout 10000000 --generate-config-address 127.0.0.1 --device-count 10000 --distribution-folder ~/data/odl/distribution-karaf-0.3.0-SNAPSHOT/ --starting-port 17830 --schemas-dir ~/data/odl/yang --debug false

for the argument explanation consult the wiki:
https://wiki.opendaylight.org/index.php?title=OpenDaylight_Controller:Netconf:Testtool#Testtool_help

for now you only need to set the distribution-folder to the distribution that you are going to run and schemas-dir to the dir with yang schemas you want the simulated devices to use. If you dont want to use any extra schemas just remove the argument.

after testtool is done rewriting the configs in the distro folder you can run karaf.
Make sure karaf can run with more than 2GB of ram since with extra schemas or more features installed 2gb of ram is not sufficient for 10k devices.

you can monitor progress on RESTCONF:
http://locahost:8181/restconf/operational/opendaylight-inventory:nodes/

when a device is connected it has a <connected>true</connected> attribute in the restconf response.

NOTE : In Helium if you disable the concurrent commits for DistributedDataStore this problem disappears because the of the sequential coordinator.



 Comments   
Comment by Moiz Raja [ 25/Mar/15 ]

https://git.opendaylight.org/gerrit/#/c/17136/

Generated at Wed Feb 07 19:54:54 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.