<!-- 
RSS generated by JIRA (8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d) at Wed Feb 07 19:56:43 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>OpenDaylight JIRA</title>
    <link>https://jira.opendaylight.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>8.20.10</version>
        <build-number>820010</build-number>
        <build-date>22-06-2022</build-date>
    </build-info>


<item>
            <title>[CONTROLLER-1893] AbstractClientConnection deadlock</title>
                <link>https://jira.opendaylight.org/browse/CONTROLLER-1893</link>
                <project id="10113" key="CONTROLLER">controller</project>
                    <description>&lt;p&gt;A deadlock occurred between Application thread (reading config DS) and AKKA thread inside org.opendaylight.controller.cluster.access.client.AbstractClientConnection. Which seems to completely block all interactions with the Datastore and requires manual restart.&lt;/p&gt;

&lt;p&gt;Attached is:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;DEADLOCK stacktraces from jstack&lt;/li&gt;
	&lt;li&gt;GC logs&#160;&lt;/li&gt;
	&lt;li&gt;Snippet from karaf.log which is related to this issue (the rest of the logs did not contain anything of substance, just netconf disconnect and reconnect details)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Initial analysis of jstack: Looks like an ABBA deadlock between 2 instances of AbstractClientSession. Thread A (Uniconfig-task-20) starts with invocation of ReadOnlyTransaction.close() which flows&#160;through 2 AbstractClientSession instances trying to acquire lock in each instance. Thread B (opendaylight-cluster-data-akka.actor.default-dispatcher-105) is triggerred&#160;by&#160;ClientActorBehavior.onReceiveCommand and due to timeout triggers &quot;poison&quot; path in the code, again passing through 2 instances of AbstractClientSession trying to acquire locks in the process (however the order is opposite). More details can be found in the stacktrace or in diagram:&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;image-wrap&quot; style=&quot;&quot;&gt;&lt;img src=&quot;https://jira.opendaylight.org/secure/attachment/15210/15210_image-2019-05-07-16-01-17-465.png&quot; style=&quot;border: 0px solid black&quot; /&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I am really not sure why there are 2 instances of AbstractClientSession, nor have I any idea why the timeout/poison was triggerred (according to the log, it was 30 minutes inactive, but overall there is a lot of activity prior to this deadlock).&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Any idea how this deadlock could be fixed ? Is it an issue inside AbstractClientSession or perhaps some mismanagement on the application side ? I tried to simulate this deadlock in a unit test but so far no luck.&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;ODL env info:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;version: Oxygen-SR2 based (But the code for&#160;AbstractClientSession is almost identical on master branch as well)&lt;/li&gt;
	&lt;li&gt;deployment: Single node&lt;/li&gt;
	&lt;li&gt;Uptime: approx 15 hours&lt;/li&gt;
	&lt;li&gt;TELL based&lt;/li&gt;
	&lt;li&gt;Xmx10G&lt;/li&gt;
	&lt;li&gt;cores == 12&lt;/li&gt;
	&lt;li&gt;Opendaylight was running netconf southbound + FRINX specific application
	&lt;ul&gt;
		&lt;li&gt;Netconf southbound was connected to 1000 devices and frequently reconnecting them due to faulty network (intentinally)&lt;/li&gt;
		&lt;li&gt;FRINX app was listening for the mountpoints, reading config from them and then READing and WRITEing some data into Datastore&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
&lt;/ul&gt;
</description>
                <environment></environment>
        <key id="31668">CONTROLLER-1893</key>
            <summary>AbstractClientConnection deadlock</summary>
                <type id="10104" iconUrl="https://jira.opendaylight.org/secure/viewavatar?size=xsmall&amp;avatarId=10303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.opendaylight.org/images/icons/priorities/major.svg">Medium</priority>
                        <status id="5" iconUrl="https://jira.opendaylight.org/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="green"/>
                                    <resolution id="10000">Done</resolution>
                                        <assignee username="rovarga">Robert Varga</assignee>
                                    <reporter username="mmarsalek">Maros Marsalek</reporter>
                        <labels>
                    </labels>
                <created>Tue, 7 May 2019 14:20:07 +0000</created>
                <updated>Thu, 16 May 2019 14:14:37 +0000</updated>
                            <resolved>Thu, 16 May 2019 14:14:37 +0000</resolved>
                                    <version>Oxygen SR2</version>
                                    <fixVersion>Sodium</fixVersion>
                    <fixVersion>Fluorine SR3</fixVersion>
                    <fixVersion>Neon SR2</fixVersion>
                                    <component>clustering</component>
                        <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                                                                <comments>
                            <comment id="66773" author="mmarsalek" created="Tue, 7 May 2019 14:27:40 +0000"  >&lt;p&gt;Regarding:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
karaf.log.7:2019-05-06T15:08:50,640 | ERROR | opendaylight-cluster-data-akka.actor.&lt;span class=&quot;code-keyword&quot;&gt;default&lt;/span&gt;-dispatcher-105 | AbstractClientConnection         | 203 - org.opendaylight.controller.cds-access-client - 1.3.2.Oxygen-SR2_4_2_1_rc3-frinxodl-SNAPSHOT | Queue ReconnectingClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka:&lt;span class=&quot;code-comment&quot;&gt;//opendaylight-cluster-data/user/shardmanager-config/member-1-shard-&lt;span class=&quot;code-keyword&quot;&gt;default&lt;/span&gt;-config#2024575561], sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=&lt;span class=&quot;code-keyword&quot;&gt;default&lt;/span&gt;, dataTree=present}} has not seen progress in 1964 seconds, failing all requests&lt;/span&gt;&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;I don&apos;t really know what has happened&#160;here&#160;... and what would be the consequences had the DEADLOCK not occurred. Any details on this would be also helpful.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Maros&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="66775" author="rovarga" created="Tue, 7 May 2019 15:35:37 +0000"  >&lt;p&gt;cds-access-client logs from INFO level would be nice, as we need to understand the lifecycle of the two connections. The connection being poisoned seems to be the successor in this case &#8211; which would indicate the originating transaction was allocated while it was still not there &#8211; which seems to be 1964 seconds by the reconnecting connection&apos;s accounting (which itself may inaccurate).&lt;/p&gt;

&lt;p&gt;If there is real activity going on, the connection should have reconnected and been replaced by a successor &#8211; which would have drained its queue, preventing the timer from being anything but a no-op.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="66776" author="rovarga" created="Tue, 7 May 2019 17:23:52 +0000"  >&lt;p&gt;I would suggest picking up the reestablishment patches that went into Neon, which can help masking the issue by preventing poison from happening.&lt;/p&gt;</comment>
                            <comment id="66781" author="rovarga" created="Thu, 9 May 2019 08:50:44 +0000"  >&lt;p&gt;This seems similar to &lt;a href=&quot;https://jira.opendaylight.org/browse/CONTROLLER-1745&quot; title=&quot;produce-transactions can get stuck when closing itemProducer&quot; class=&quot;issue-link&quot; data-issue-key=&quot;CONTROLLER-1745&quot;&gt;&lt;del&gt;CONTROLLER-1745&lt;/del&gt;&lt;/a&gt; &#8211; hence moving poisoning outside the lock should help.&lt;/p&gt;</comment>
                            <comment id="66784" author="mmarsalek" created="Fri, 10 May 2019 07:29:53 +0000"  >&lt;p&gt;I can confirm that&#160;&lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/81949/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/#/c/81949/&lt;/a&gt;&#160;works. Verified by a unit test:&#160;&lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/81979/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/#/c/81979/&lt;/a&gt;&#160;as well as by running ODL instance.&lt;/p&gt;</comment>
                            <comment id="66785" author="mmarsalek" created="Fri, 10 May 2019 07:34:01 +0000"  >&lt;p&gt;Ad &quot;verified by running ODL instance&quot;: I think the same problem occurred again and the poison was triggerred. This time however there was no deadlock. The Datastore responds but with an error:&lt;/p&gt;

&lt;p&gt;&#160; &#160;see attached stacktrace: stacktrace.poison.txt&lt;/p&gt;

&lt;p&gt;Robert, you mentioned &quot;reestablishment patches&quot; could help here. Could you give me a pointer please ?&lt;/p&gt;

&lt;p&gt;Also, I don&apos;t have &quot;cds-access-client logs from INFO&quot; yet, but I will try to acquire them from the next ODL deployment.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="15207" name="gc.log.0" size="5244963" author="mmarsalek" created="Tue, 7 May 2019 14:19:59 +0000"/>
                            <attachment id="15206" name="gc.log.1.current" size="473438" author="mmarsalek" created="Tue, 7 May 2019 14:19:58 +0000"/>
                            <attachment id="15210" name="image-2019-05-07-16-01-17-465.png" size="26779" author="mmarsalek" created="Tue, 7 May 2019 14:01:17 +0000"/>
                            <attachment id="15209" name="jstack.txt" size="10358" author="mmarsalek" created="Tue, 7 May 2019 14:19:51 +0000"/>
                            <attachment id="15208" name="karaf.partial.log" size="661" author="mmarsalek" created="Tue, 7 May 2019 14:19:51 +0000"/>
                            <attachment id="15213" name="stacktrace.poison.txt" size="34758" author="mmarsalek" created="Fri, 10 May 2019 07:33:44 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                            <customfield id="customfield_11400" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10000" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>0|i03npz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>