<!-- 
RSS generated by JIRA (8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d) at Wed Feb 07 19:56:34 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>OpenDaylight JIRA</title>
    <link>https://jira.opendaylight.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>8.20.10</version>
        <build-number>820010</build-number>
        <build-date>22-06-2022</build-date>
    </build-info>


<item>
            <title>[CONTROLLER-1836] Deadlock scenario with multi-shard transactions</title>
                <link>https://jira.opendaylight.org/browse/CONTROLLER-1836</link>
                <project id="10113" key="CONTROLLER">controller</project>
                    <description>&lt;p&gt;The genius project has been running into a deadlock&#160;with multi-shard transactions. The log shows the following symptoms:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2018-05-30T15:08:06,198 | WARN&#160; | opendaylight-cluster-data-shard-dispatcher-88 | ShardDataTree&#160;| 240 - org.opendaylight.controller.sal-distributed-datastore -&#160;| 1.8.0.SNAPSHOT | member-1-shard-default-config: Current transaction member-1-datastore-config-fe-0-txn-1477-0 has timed out after 19233 ms in state CAN_COMMIT_COMPLETE
2018-05-30T15:08:06,198 | WARN&#160; | opendaylight-cluster-data-shard-dispatcher-65 | ShardDataTree&#160;| 240 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | member-1-shard-inventory-config:&#160;Current transaction member-1-datastore-config-fe-0-txn-1478-0 has timed out after 19234 ms in state READY
2018-05-30T15:08:06,199 | ERROR | opendaylight-cluster-data-shard-dispatcher-88 | Shard&#160;| 232 - org.opendaylight.controller.sal-clustering-commons -&#160;| 1.8.0.SNAPSHOT | member-1-shard inventory-config: Cannot canCommit transaction member-1-datastore-config-fe-0-txn-1478-0 - no cohort entry found 
2018-05-30T15:08:06,199 | ERROR | opendaylight-cluster-data-shard-dispatcher-65 | Shard&#160; &#160; &#160;  | 232&#160; - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-1-shard-default-config: Cannot commit transaction member-1-datastore-config-fe-0-txn-1477-0 - no cohort entry found
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The deadlock occurs if the ready messages are interleaved between 2 or more transactions that access the same shards. In this case, tx1 and tx2 are both writing to the inventory and default shards. tx1 sends the ready message to the default shard first and gets added to pendingTransactions queue before tx2. However the opposite happens for the inventory shard, ie tx2 sends ready and is added to pendingTransactions first. So when tx2 sends CanCommit to the default shard, it&apos;s not at the head of pendingTransactions so it&apos;s not processed - tx2 can&apos;t proceed until tx1 fully completes. tx1 sends CanCommit to the default shard and completes. It then sends CanCommit to the inventory shard but tx2 is at the head of the queue so it can&apos;t proceed. So neither tx can make progress until they timeout.&lt;/p&gt;

&lt;p&gt;This issue was originally reported via &lt;a href=&quot;https://jira.opendaylight.org/browse/GENIUS-166&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.opendaylight.org/browse/GENIUS-166&lt;/a&gt; and was fixed for single-node by &lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/72650/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/#/c/72650/&lt;/a&gt; however the same issue can occur for 3 (or multi) node.&lt;/p&gt;

&lt;p&gt;The following outlines a proposed solution that should work for single and multi-node.&lt;/p&gt;

&lt;p&gt;The crux of the problem is that the ShardDataTree doesn&apos;t allow 3PC to start for a tx unless it&apos;s at the head of the pendingTransactions queue - this is done to honor the order in which the tx&apos;s were &quot;readied&quot;, specifically to maintain tx chain integrity. The deadlock scenario occurs when 2 tx&apos;s access the same shards and, for the second shard in the sequence, the first tx is behind the second tx in the pendingTransactions queue. Therefore I propose we relax this rule for all but the first shard in a tx by introducing the list of sorted participating shard names in the ready messages (if multi-shard) and using that to determine if a tx can&#160;be moved ahead of another tx in the queue on&#160;the CanCommit request to avoid potential deadlock. If the preceding participating shard names for a preceding pending tx in the READY state, call it tx A, in the queue matches that of&#160;the&#160;requesting&#160;tx, then&#160;the&#160;requesting tx&#160;is allowed to be moved ahead of tx A in the queue so it is processed first to avoid potential deadlock if tx A is behind&#160;the&#160;requesting&#160;tx in the pendingTransactions queue for a preceding shard.&#160;If&#160;the&#160;requesting&#160;tx is moved to the head of the queue as a result, then proceed with CanCommit.&lt;/p&gt;

&lt;p&gt;&lt;ins&gt;&lt;b&gt;Scenario&#160;1:&lt;/b&gt;&lt;/ins&gt;&lt;/p&gt;

&lt;p&gt;tx1 -&amp;gt; shard A, shard B&lt;br/&gt;
 tx2 -&amp;gt; A, B&lt;/p&gt;

&lt;p&gt;Queue for shard A -&amp;gt; tx1, tx2&lt;br/&gt;
 B -&amp;gt; tx2, tx1&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;tx2 sends CanCommit to A - tx1 is at the head of the Q so tx2 is not allowed to proceed b/c A is the first shard in the participating shard list.&lt;/li&gt;
	&lt;li&gt;tx1 sends CanCommit to A and is at the head of the Q so proceeds.&lt;/li&gt;
	&lt;li&gt;tx1 sends CanCommit to B - tx2 is at the head of the Q but the preceding shards in tx1&apos;s participating shard list &lt;span class=&quot;error&quot;&gt;&amp;#91;A&amp;#93;&lt;/span&gt; matches that of tx2 &lt;span class=&quot;error&quot;&gt;&amp;#91;A&amp;#93;&lt;/span&gt; so tx1 is moved ahead of tx2&#160;and proceeds with CanCommit. &lt;b&gt;Note:&lt;/b&gt;&#160;previously this resulted in deadlock.&lt;/li&gt;
	&lt;li&gt;tx1 completes 3PC&lt;/li&gt;
	&lt;li&gt;tx2 proceeds on A etc&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;ins&gt;&lt;b&gt;Scenario 2&lt;/b&gt;&lt;/ins&gt;:&lt;/p&gt;

&lt;p&gt;tx1 -&amp;gt; A, C&lt;br/&gt;
 tx2 -&amp;gt; B, C&lt;/p&gt;

&lt;p&gt;A -&amp;gt; tx1&lt;br/&gt;
 B -&amp;gt; tx2&lt;br/&gt;
 C -&amp;gt; tx2, tx1&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;tx1 sends CanCommit to A and is at the head of the Q so proceeds.&lt;/li&gt;
	&lt;li&gt;tx2 sends CanCommit to B and is at the head of the Q so proceeds.&lt;/li&gt;
	&lt;li&gt;tx1 sends CanCommit to C - tx2 is at the head of the Q. The preceding shards in tx1&apos;s participating shard list &lt;span class=&quot;error&quot;&gt;&amp;#91;A&amp;#93;&lt;/span&gt; do not match that of tx2 &lt;span class=&quot;error&quot;&gt;&amp;#91;B&amp;#93;&lt;/span&gt; so tx1 is not moved and&#160;does not proceed with CanCommit. This preserves the ready order.&lt;/li&gt;
	&lt;li&gt;tx2 sends CanCommit to C and is at the head of the Q so proceeds.&lt;/li&gt;
	&lt;li&gt;tx1 proceeds&#160;with&#160;CanCommit C etc&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;ins&gt;&lt;b&gt;Scenario 3&lt;/b&gt;&lt;/ins&gt;:&lt;/p&gt;

&lt;p&gt;tx1 -&amp;gt; A, B&lt;br/&gt;
 tx2 -&amp;gt; B, C&lt;/p&gt;

&lt;p&gt;A -&amp;gt; tx1&lt;br/&gt;
 B -&amp;gt; tx2, tx1&lt;br/&gt;
 C -&amp;gt; tx2&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;tx1 sends CanCommit to A and is at the head of the Q so proceeds.&lt;/li&gt;
	&lt;li&gt;tx1 sends CanCommit to B - tx2 is at the head of the Q. The preceding shards in tx1&apos;s participating shard list &lt;span class=&quot;error&quot;&gt;&amp;#91;A&amp;#93;&lt;/span&gt; do not match that of tx2 [] so tx1 is not moved and&#160;does not&#160;proceed with CanCommit. This preserves the ready order.&lt;/li&gt;
	&lt;li&gt;tx2 sends CanCommit to B and is at the head of the Q so proceeds.&lt;/li&gt;
	&lt;li&gt;tx1 proceeds with CanCommit on B etc&lt;/li&gt;
&lt;/ul&gt;
</description>
                <environment></environment>
        <key id="30124">CONTROLLER-1836</key>
            <summary>Deadlock scenario with multi-shard transactions</summary>
                <type id="10104" iconUrl="https://jira.opendaylight.org/secure/viewavatar?size=xsmall&amp;avatarId=10303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.opendaylight.org/images/icons/priorities/critical.svg">High</priority>
                        <status id="10004" iconUrl="https://jira.opendaylight.org/images/icons/status_generic.gif" description="">Verified</status>
                    <statusCategory id="3" key="done" colorName="green"/>
                                    <resolution id="10000">Done</resolution>
                                        <assignee username="tpantelis">Tom Pantelis</assignee>
                                    <reporter username="tpantelis">Tom Pantelis</reporter>
                        <labels>
                            <label>csit:3node</label>
                    </labels>
                <created>Mon, 11 Jun 2018 13:29:42 +0000</created>
                <updated>Wed, 27 Jun 2018 15:40:13 +0000</updated>
                            <resolved>Thu, 21 Jun 2018 08:48:01 +0000</resolved>
                                    <version>Nitrogen</version>
                    <version>Oxygen</version>
                    <version>Fluorine</version>
                                    <fixVersion>Fluorine</fixVersion>
                    <fixVersion>Oxygen SR3</fixVersion>
                                    <component>clustering</component>
                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                                                                <comments>
                            <comment id="63414" author="tpantelis" created="Tue, 12 Jun 2018 19:41:40 +0000"  >&lt;p&gt;Submitted&#160;&#160;&lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/72874/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/#/c/72874/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="63596" author="vorburger" created="Thu, 21 Jun 2018 12:20:36 +0000"  >&lt;p&gt;What are people&apos;s thoughts about (both need to, and required effort for) back-porting this one from Fluorine master to stably/oxygen for Oxygen SR3?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=shague&quot; class=&quot;user-hover&quot; rel=&quot;shague&quot;&gt;shague&lt;/a&gt;&#160;and &lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=skitt&quot; class=&quot;user-hover&quot; rel=&quot;skitt&quot;&gt;skitt&lt;/a&gt; is this important for us? &lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=k.faseela&quot; class=&quot;user-hover&quot; rel=&quot;k.faseela&quot;&gt;k.faseela&lt;/a&gt; have anyone wanting to do the back-port?&lt;/p&gt;</comment>
                            <comment id="63598" author="tpantelis" created="Thu, 21 Jun 2018 12:37:40 +0000"  >&lt;p&gt;cherry-pick failed so it would have to be done manually.&lt;/p&gt;</comment>
                            <comment id="63715" author="vorburger" created="Tue, 26 Jun 2018 16:35:21 +0000"  >&lt;p&gt;&lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/73454/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/#/c/73454/&lt;/a&gt;&#160;proposes to back-port this for Oxygen SR3.&lt;/p&gt;</comment>
                            <comment id="63742" author="faseela.k@ericsson.com" created="Wed, 27 Jun 2018 15:40:13 +0000"  >&lt;p&gt;Not seeing the exception currently.&#160;&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10000">
                    <name>Blocks</name>
                                            <outwardlinks description="blocks">
                                        <issuelink>
            <issuekey id="30045">GENIUS-166</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                            <customfield id="customfield_11400" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10002" key="com.pyxis.greenhopper.jira:gh-epic-link">
                        <customfieldname>Epic Link</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>NETVIRT-996</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_10000" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>0|i03fj3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>