Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1186

Clustering : ConcurrentDOMDataBroker/DOMConcurrentDataCommitCoordinator cause deadlocks

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Resolution: Done
    • Post-Helium
    • None
    • mdsal
    • None
    • Operating System: All
      Platform: All

    • 2792
    • Highest

    Description

      Here is the scenario,

      T1 and T2 are two threads which create Txn1 and Txn2. Both these transactions touch both config/operational datastores. When Txn1 and Txn2 are submitted they therefore have 2 cohorts each. On the CDS side we end up with 1 Txn for config and 1 Txn for operational per Broker transaction each with 1 cohort. Let's call the cohorts Txn11, Txn12, Txn21, Txn22

      When Txn1 and Txn2 are submitted the concurrent broker attempts to do a canCommit on all the cohorts of Txn1 and Txn2 concurrently. On the CDS side we therefore try to do canCommit on the cohorts for all 4 Transactions.

      When we do a canCommit on a CDS transaction the cohortEntry on which the canCommit is triggered gets queued on the ShardCommitCoordinator. This cohortEntry is then removed when a commit is called on that cohort. On the DataBroker end preCommit will only be called on the cohorts when the response for both cohort canCommits are received.

      Because canCommit is triggered asynchronously it just so happens that on the ShardCommitCoordinator the cohort entries get queued as follows,

      Txn21 -> Txn11 = Operational Shard Coordinator

      Txn12 -> Txn22 = Config Shard Coordinator

      Operational ShardCommitCoordinator process the first item in it's queue and sends a CanCommitTransactionReply for Txn21. Similarly Config ShardCommitCoordinator sends the CanCommitTransactionReply for Txn12.

      Txn21 and Txn12 can be removed from the queue only when a commit arrives for those cohorts but that will never happen because that would require that a canCommit response be received for cohorts Txn11 and Txn22. Thus the deadlock.

      In CDS this deadlock eventually resolves with akka timeouts.

      One way to reproduce this problem is by using the netconf scale test.

      The manual instructions for running this test are here,

      ODL with odl-restconf and odl-netconf-connector-all installed

      Download latest testool:
      https://nexus.opendaylight.org/service/local/artifact/maven/redirect?r=opendaylight.snapshot&g=org.opendaylight.controller&a=netconf-testtool&v=0.3.0-SNAPSHOT&e=jar&c=executable

      run testtool with(dont just copy paste run it, make sure all the settings are correct according to the instructions below):

      java -Dorg.apache.sshd.registerBouncyCastle=false -Xmx8G -XX:MaxPermSize=1G -jar ./netconf-testtool-0.3.0-SNAPSHOT-executable.jar --ssh true --generate-configs-batch-size 4000 --exi false --generate-config-connection-timeout 10000000 --generate-config-address 127.0.0.1 --device-count 10000 --distribution-folder ~/data/odl/distribution-karaf-0.3.0-SNAPSHOT/ --starting-port 17830 --schemas-dir ~/data/odl/yang --debug false

      for the argument explanation consult the wiki:
      https://wiki.opendaylight.org/index.php?title=OpenDaylight_Controller:Netconf:Testtool#Testtool_help

      for now you only need to set the distribution-folder to the distribution that you are going to run and schemas-dir to the dir with yang schemas you want the simulated devices to use. If you dont want to use any extra schemas just remove the argument.

      after testtool is done rewriting the configs in the distro folder you can run karaf.
      Make sure karaf can run with more than 2GB of ram since with extra schemas or more features installed 2gb of ram is not sufficient for 10k devices.

      you can monitor progress on RESTCONF:
      http://locahost:8181/restconf/operational/opendaylight-inventory:nodes/

      when a device is connected it has a <connected>true</connected> attribute in the restconf response.

      NOTE : In Helium if you disable the concurrent commits for DistributedDataStore this problem disappears because the of the sequential coordinator.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            moraja@cisco.com Moiz Raja
            moraja@cisco.com Moiz Raja
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: