<!-- 
RSS generated by JIRA (8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d) at Wed Feb 07 19:56:27 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>OpenDaylight JIRA</title>
    <link>https://jira.opendaylight.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>8.20.10</version>
        <build-number>820010</build-number>
        <build-date>22-06-2022</build-date>
    </build-info>


<item>
            <title>[CONTROLLER-1789] Recover from shard stopped condition on akka persistence circuit-breaker failure</title>
                <link>https://jira.opendaylight.org/browse/CONTROLLER-1789</link>
                <project id="10113" key="CONTROLLER">controller</project>
                    <description>&lt;p&gt;Under heavy transaction load in cluster setup, sometimes akka persistence circuit-breaker failure is observed. When this happens, the shard gets stopped and it never recovers from this condition. This bug is opened to request recovery of shard when it gets into this condition&lt;/p&gt;

&lt;p&gt;Problem can be forced to happen pretty easily:&lt;br/&gt;
1. On node-1, install controller in a nfs mounted directory&lt;br/&gt;
2. Start all nodes and make sure node-1 is leader of the target shard&lt;br/&gt;
3. Make write transactions at high rate against this shard&lt;br/&gt;
4. Stop/start the nfs service on server from which the dir was mounted. This will cause the dir to be not writable for some time&lt;br/&gt;
5. During this period, akka persistence for journal records will fail. When this failure happens, the shard is stopped and never started again&lt;/p&gt;

&lt;p&gt;Increasing the akka journal persistence circuit-breaker call-timeout value (default is 10s) does help in making it more tolerant to outage&lt;/p&gt;

&lt;p&gt;Few others have seen this issue. Ref. &lt;a href=&quot;https://lists.opendaylight.org/pipermail/controller-dev/2017-August/013777.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://lists.opendaylight.org/pipermail/controller-dev/2017-August/013777.html&lt;/a&gt;&lt;/p&gt;</description>
                <environment></environment>
        <key id="28772">CONTROLLER-1789</key>
            <summary>Recover from shard stopped condition on akka persistence circuit-breaker failure</summary>
                <type id="10104" iconUrl="https://jira.opendaylight.org/secure/viewavatar?size=xsmall&amp;avatarId=10303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.opendaylight.org/images/icons/priorities/major.svg">Medium</priority>
                        <status id="10003" iconUrl="https://jira.opendaylight.org/images/icons/status_generic.gif" description="">Confirmed</status>
                    <statusCategory id="2" key="new" colorName="blue-gray"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="-1">Unassigned</assignee>
                                    <reporter username="ajayslele">Ajay Lele</reporter>
                        <labels>
                    </labels>
                <created>Tue, 14 Nov 2017 20:07:11 +0000</created>
                <updated>Tue, 18 May 2021 15:35:59 +0000</updated>
                                                                                <due></due>
                            <votes>0</votes>
                                    <watches>2</watches>
                                                                                                                <comments>
                            <comment id="64622" author="ajayslele" created="Mon, 13 Aug 2018 18:32:47 +0000"  >&lt;p&gt;The basic cause of Circuit Breaker issues is that disk I/O operation takes longer than the (circuit-breaker) timeouts used by akka persistence plugins. This could be because disk is slow/choked, amount of data in data-store is very large (order of hundreds of MBs), transactions are happening at very high rate, or a combination of these factors. Suggested steps for troubleshooting this problem:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Monitor disk performance by using tools like sar &lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt;, identify and fix bottle-necks&lt;/li&gt;
	&lt;li&gt;Tune akka persistence plugin circuit-breaker settings. In controller/configuration/initial/akka.conf, override the defaults, esp. the call-timeout. Refer to &lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt; and &lt;span class=&quot;error&quot;&gt;&amp;#91;2&amp;#93;&lt;/span&gt; for more details. Example:&lt;/li&gt;
&lt;/ol&gt;


&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;akka {
  persistence {
    journal-plugin-fallback {
      circuit-breaker {
        max-failures = 10
        call-timeout = 60s
        reset-timeout = 30s
      }
    }
    snapshot-store-plugin-fallback {
      circuit-breaker {
        max-failures = 10
        call-timeout = 120s
        reset-timeout = 60s
      }
    }
  }
}&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt; &lt;a href=&quot;https://linux.die.net/man/1/sar&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://linux.die.net/man/1/sar&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt;&#160;&lt;a href=&quot;https://github.com/akka/akka/blob/master/akka-persistence/src/main/resources/reference.conf&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/akka/akka/blob/master/akka-persistence/src/main/resources/reference.conf&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;2&amp;#93;&lt;/span&gt;&#160;&lt;a href=&quot;https://doc.akka.io/docs/akka/2.5/common/circuitbreaker.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://doc.akka.io/docs/akka/2.5/common/circuitbreaker.html&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="65620" author="rovarga" created="Wed, 14 Nov 2018 12:48:30 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=ajayslele&quot; class=&quot;user-hover&quot; rel=&quot;ajayslele&quot;&gt;ajayslele&lt;/a&gt; any progress on this? Are you still working on it?&lt;/p&gt;</comment>
                            <comment id="65656" author="ajayslele" created="Wed, 14 Nov 2018 17:32:49 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=rovarga&quot; class=&quot;user-hover&quot; rel=&quot;rovarga&quot;&gt;rovarga&lt;/a&gt; yes, I will be proposing a patch soon&lt;/p&gt;</comment>
                            <comment id="68306" author="ajayslele" created="Sat, 27 Jun 2020 00:27:08 +0000"  >&lt;p&gt;The PR which I had opened for this &lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt; was in pretty good shape, but could not get through the reviews. Whoever wants to work on this can pick up from where it is right now.&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;0&amp;#93;&lt;/span&gt; &lt;a href=&quot;https://git.opendaylight.org/gerrit/c/controller/+/79328&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/c/controller/+/79328&lt;/a&gt;&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                            <customfield id="customfield_11400" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10000" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>0|i037qf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>