<!-- 
RSS generated by JIRA (8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d) at Wed Feb 07 19:56:24 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>OpenDaylight JIRA</title>
    <link>https://jira.opendaylight.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>8.20.10</version>
        <build-number>820010</build-number>
        <build-date>22-06-2022</build-date>
    </build-info>


<item>
            <title>[CONTROLLER-1768] SyncStatus stays false for more than 5minutes after bringing 2 of 3 nodes down and back up.</title>
                <link>https://jira.opendaylight.org/browse/CONTROLLER-1768</link>
                <project id="10113" key="CONTROLLER">controller</project>
                    <description>&lt;p&gt;This was seen in a netvirt 3node job:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-upstream-stateful-carbon/118/log.html.gz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-upstream-stateful-carbon/118/log.html.gz&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The job has the karaf logs, and in the log for ODL2 I saw this:&lt;/p&gt;

&lt;p&gt;2017-09-08 12:43:45,101 | ERROR | Event Dispatcher | AbstractDataStore                | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Shard leaders failed to settle in 90 seconds, giving up&lt;/p&gt;

&lt;p&gt;The system tests were pasing ok until this point, after which a lot&lt;br/&gt;
of failures were seen.&lt;/p&gt;</description>
                <environment>&lt;p&gt;Operating System: All&lt;br/&gt;
Platform: All&lt;/p&gt;</environment>
        <key id="26322">CONTROLLER-1768</key>
            <summary>SyncStatus stays false for more than 5minutes after bringing 2 of 3 nodes down and back up.</summary>
                <type id="10104" iconUrl="https://jira.opendaylight.org/secure/viewavatar?size=xsmall&amp;avatarId=10303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.opendaylight.org/images/icons/priorities/critical.svg">High</priority>
                        <status id="10004" iconUrl="https://jira.opendaylight.org/images/icons/status_generic.gif" description="">Verified</status>
                    <statusCategory id="3" key="done" colorName="green"/>
                                    <resolution id="10003">Cannot Reproduce</resolution>
                                        <assignee username="-1">Unassigned</assignee>
                                    <reporter username="jluhrsen">Jamo Luhrsen</reporter>
                        <labels>
                            <label>csit:3node</label>
                    </labels>
                <created>Fri, 8 Sep 2017 23:10:18 +0000</created>
                <updated>Wed, 18 Jul 2018 21:35:16 +0000</updated>
                            <resolved>Wed, 18 Jul 2018 21:35:09 +0000</resolved>
                                    <version>Carbon</version>
                    <version>Oxygen</version>
                    <version>Oxygen SR3</version>
                                    <fixVersion>Fluorine</fixVersion>
                    <fixVersion>Oxygen SR3</fixVersion>
                                    <component>clustering</component>
                        <due></due>
                            <votes>0</votes>
                                    <watches>7</watches>
                                                                                                                <comments>
                            <comment id="52712" author="tpantelis" created="Sat, 9 Sep 2017 00:22:52 +0000"  >&lt;p&gt;It seems that shard leaders weren&apos;t resolved after 90 sec but eventually did around 2017-09-08 12:49, eg&lt;/p&gt;

&lt;p&gt; INFO  | rd-dispatcher-30 | ShardManager                     | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | shard-manager-config: Received role changed for member-2-shard-inventory-config from Candidate to Leader&lt;/p&gt;

&lt;p&gt;I&apos;m not sure what the expectations of the test are but I suspect this is similar to &lt;a href=&quot;https://jira.opendaylight.org/browse/CONTROLLER-1751&quot; title=&quot;Sporadic cluster failure when member is restarted in OF cluster test&quot; class=&quot;issue-link&quot; data-issue-key=&quot;CONTROLLER-1751&quot;&gt;&lt;del&gt;CONTROLLER-1751&lt;/del&gt;&lt;/a&gt; where node connectivity occasionally gets delayed in the test environment for some reason (hypothesis is overload in the environment).&lt;/p&gt;</comment>
                            <comment id="52713" author="tpantelis" created="Sat, 9 Sep 2017 00:22:53 +0000"  >&lt;p&gt;It seems that shard leaders weren&apos;t resolved after 90 sec but eventually did around 2017-09-08 12:49, eg&lt;/p&gt;

&lt;p&gt; INFO  | rd-dispatcher-30 | ShardManager                     | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | shard-manager-config: Received role changed for member-2-shard-inventory-config from Candidate to Leader&lt;/p&gt;

&lt;p&gt;I&apos;m not sure what the expectations of the test are but I suspect this is similar to &lt;a href=&quot;https://jira.opendaylight.org/browse/CONTROLLER-1751&quot; title=&quot;Sporadic cluster failure when member is restarted in OF cluster test&quot; class=&quot;issue-link&quot; data-issue-key=&quot;CONTROLLER-1751&quot;&gt;&lt;del&gt;CONTROLLER-1751&lt;/del&gt;&lt;/a&gt; where node connectivity occasionally gets delayed in the test environment for some reason (hypothesis is overload in the environment).&lt;/p&gt;</comment>
                            <comment id="52714" author="tpantelis" created="Sat, 9 Sep 2017 17:01:57 +0000"  >&lt;p&gt;I see in the bug summary that the test brings 2 out of the 3 nodes down then back up. The way akka works is that is that if a node becomes unreachable then it must become reachable or &quot;downed&quot; before it can allow another node to join. So if 2 nodes are unreachable then both have to come back before it allows either back in. We&apos;ve seen previously that this can take more than 5 min due to suspected intermittent load in the environment with 2 nodes restarting (and whatever other load there is in the VM environment). So most of the time it&apos;s relatively quick but sometimes it can take arbitrarily longer. So I would suggest increasing the 5 min expectation to avoid intermittent timeouts (I&apos;d say at least 15 min to be safe).&lt;/p&gt;</comment>
                            <comment id="52715" author="rovarga" created="Sun, 10 Sep 2017 08:15:05 +0000"  >&lt;p&gt;I wonder if the tests should use more aggressive retry timings. Also are ODL2/ODL3 brought up concurrently?&lt;/p&gt;</comment>
                            <comment id="52716" author="rovarga" created="Sun, 10 Sep 2017 08:15:46 +0000"  >&lt;p&gt;I meant retry timers at akka.conf layer &amp;#8211; as I think they do back off by default.&lt;/p&gt;</comment>
                            <comment id="52717" author="jluhrsen" created="Tue, 12 Sep 2017 22:48:57 +0000"  >&lt;p&gt;(In reply to Tom Pantelis from comment #3)&lt;br/&gt;
&amp;gt; I see in the bug summary that the test brings 2 out of the 3 nodes down then&lt;br/&gt;
&amp;gt; back up. The way akka works is that is that if a node becomes unreachable&lt;br/&gt;
&amp;gt; then it must become reachable or &quot;downed&quot; before it can allow another node&lt;br/&gt;
&amp;gt; to join. So if 2 nodes are unreachable then both have to come back before it&lt;br/&gt;
&amp;gt; allows either back in. We&apos;ve seen previously that this can take more than 5&lt;br/&gt;
&amp;gt; min due to suspected intermittent load in the environment with 2 nodes&lt;br/&gt;
&amp;gt; restarting (and whatever other load there is in the VM environment). So most&lt;br/&gt;
&amp;gt; of the time it&apos;s relatively quick but sometimes it can take arbitrarily&lt;br/&gt;
&amp;gt; longer. So I would suggest increasing the 5 min expectation to avoid&lt;br/&gt;
&amp;gt; intermittent timeouts (I&apos;d say at least 15 min to be safe).&lt;/p&gt;

&lt;p&gt;ok, I can change the test to wait for 15m for cluster sync, but what worries&lt;br/&gt;
me is that later in the same job, there are more fundamental failures from&lt;br/&gt;
netvirt&apos;s perspective (e.g. created networks not ending up in the config&lt;br/&gt;
store). Because of that I&apos;m worried that something is just broken under&lt;br/&gt;
the hood and waiting might not be the answer.&lt;/p&gt;

&lt;p&gt;but, I&apos;ll try.&lt;/p&gt;</comment>
                            <comment id="52718" author="jluhrsen" created="Tue, 12 Sep 2017 22:51:44 +0000"  >&lt;p&gt;(In reply to Robert Varga from comment #5)&lt;br/&gt;
&amp;gt; I meant retry timers at akka.conf layer &amp;#8211; as I think they do back off by&lt;br/&gt;
&amp;gt; default.&lt;/p&gt;

&lt;p&gt;do you have an example I can look at? It&apos;s not totally clear&lt;br/&gt;
to me what values I should tweak. This is worth a try too.&lt;/p&gt;</comment>
                            <comment id="52719" author="jluhrsen" created="Tue, 12 Sep 2017 22:52:58 +0000"  >&lt;p&gt;(In reply to Robert Varga from comment #4)&lt;br/&gt;
&amp;gt; I wonder if the tests should use more aggressive retry timings. Also are&lt;br/&gt;
&amp;gt; ODL2/ODL3 brought up concurrently?&lt;/p&gt;

&lt;p&gt;yeah, ODL2/ODL3 are brought up in the same 1-2s window.&lt;/p&gt;</comment>
                            <comment id="64012" author="jluhrsen" created="Wed, 11 Jul 2018 14:26:33 +0000"  >&lt;p&gt;just wanted to note that I have also seen this recently in a local setup with a recent Oxygen build&lt;/p&gt;</comment>
                            <comment id="64091" author="tpantelis" created="Wed, 18 Jul 2018 20:17:38 +0000"  >&lt;p&gt;On the kernel call, you mentioned you saw this stopping/restarting 1 node. This issue is almost a year old was related to 2 nodes stopping/restarting.  From earlier analysis and comments from last year, it was determined that the 2 nodes did actually re-join but it happened after the 5 min test deadline. It was suggested to increases the test timeout. I assume there is still a controller test that does this (I know there was originally)? If so, if it hasn&apos;t been failing then can we close this issue?&lt;/p&gt;</comment>
                            <comment id="64092" author="jluhrsen" created="Wed, 18 Jul 2018 20:42:01 +0000"  >&lt;p&gt;so we want to use a new bug then?&lt;/p&gt;</comment>
                            <comment id="64093" author="tpantelis" created="Wed, 18 Jul 2018 20:54:18 +0000"  >&lt;p&gt;Well we already have &lt;a href=&quot;https://jira.opendaylight.org/browse/CONTROLLER-1849&quot; title=&quot;controller not coming up healthy after being killed and restarted (401 after 5m)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;CONTROLLER-1849&quot;&gt;&lt;del&gt;CONTROLLER-1849&lt;/del&gt;&lt;/a&gt; for a 1 node restart not rejoining.  I assume the failure you saw that prompted your recent comment here was the same issue. So unless this issue is still current then let&apos;s close it.&lt;/p&gt;

&lt;p&gt;Keep in mind - the 401&apos;s due to the failed AAA reads and SyncStatus remaining false are all symptoms that the node did not rejoin the cluster on restart for whatever reason. At this point, AFAIK, this is only being seen occasionally when the first seed node is killed/restarted (ie &lt;a href=&quot;https://jira.opendaylight.org/browse/CONTROLLER-1849&quot; title=&quot;controller not coming up healthy after being killed and restarted (401 after 5m)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;CONTROLLER-1849&quot;&gt;&lt;del&gt;CONTROLLER-1849&lt;/del&gt;&lt;/a&gt;).&lt;/p&gt;</comment>
                            <comment id="64094" author="jluhrsen" created="Wed, 18 Jul 2018 21:33:19 +0000"  >&lt;p&gt;ok, I follow that logic. we&apos;ll open this back up if there is something specific about 2 nodes going down/up,&lt;br/&gt;
and we have 1849 to track all that we think is left at this point.&lt;/p&gt;</comment>
                            <comment id="64095" author="jluhrsen" created="Wed, 18 Jul 2018 21:35:09 +0000"  >&lt;p&gt;going with the assumption that this is not seen any more in this specific scenario&lt;br/&gt;
where 2 of 3 nodes are going down and back up. We do have a very similar bug&lt;br/&gt;
in &lt;a href=&quot;https://jira.opendaylight.org/browse/CONTROLLER-1849&quot; title=&quot;controller not coming up healthy after being killed and restarted (401 after 5m)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;CONTROLLER-1849&quot;&gt;&lt;del&gt;CONTROLLER-1849&lt;/del&gt;&lt;/a&gt; that only deals with one node (first seed node) going down&lt;br/&gt;
and up. We&apos;ll track that problem there.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                            <customfield id="customfield_11400" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10208" key="com.atlassian.jira.plugin.system.customfieldtypes:textfield">
                        <customfieldname>External issue ID</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9133</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10201" key="com.atlassian.jira.plugin.system.customfieldtypes:url">
                        <customfieldname>External issue URL</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[https://bugs.opendaylight.org/show_bug.cgi?id=9133]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10000" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>0|i02slz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10203" key="com.atlassian.jira.plugin.system.customfieldtypes:textfield">
                        <customfieldname>Status Whiteboard</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>csit:sporadic_failures</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        </customfields>
    </item>
</channel>
</rss>