<!-- 
RSS generated by JIRA (8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d) at Wed Feb 07 19:55:07 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>OpenDaylight JIRA</title>
    <link>https://jira.opendaylight.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>8.20.10</version>
        <build-number>820010</build-number>
        <build-date>22-06-2022</build-date>
    </build-info>


<item>
            <title>[CONTROLLER-1273] Clustering: Shard actors are terminated when another cluster node is restarted</title>
                <link>https://jira.opendaylight.org/browse/CONTROLLER-1273</link>
                <project id="10113" key="CONTROLLER">controller</project>
                    <description>&lt;p&gt;I&apos;m seeing strange behavior when a node in a 3 node cluster is restarted. It somehow causes the Shard and ShardManager actors in the other nodes in the cluster to terminate. I see these messages in the log:&lt;/p&gt;

&lt;p&gt;2015-04-22 13:11:08,317 | INFO  | lt-dispatcher-19 | Shard                            | 177 | 227 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT |  | Stopping Shard member-1-shard-topology-operational&lt;/p&gt;

&lt;p&gt;2015-04-22 13:11:08,323 | INFO  | lt-dispatcher-18 | ShardManager                     | 159 | 234 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT |  | Stopping ShardManager&lt;/p&gt;

&lt;p&gt;There&apos;s no other messages in the log except the usual akka INFO messages about node addresses gated and nodes leaving and joining the cluster.&lt;/p&gt;

&lt;p&gt;Note that this occurs after the node is started up and not after it is shutdown. Right after the Shard stopping messages above I see the akka message that the downed node is now re-joining:&lt;/p&gt;

&lt;p&gt;2015-04-22 13:14:08,412 | INFO  | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$3 | 74 | 220 - com.typesafe.akka.slf4j - 2.3.9 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://odl-cluster-rpc@127.0.0.1:2551&amp;#93;&lt;/span&gt; - Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://odl-cluster-rpc@127.0.0.1:2555&amp;#93;&lt;/span&gt; is JOINING, roles []&lt;/p&gt;

&lt;p&gt;I have all 3 nodes running in the same VM on different ports so I&apos;m not sure if that&apos;s a factor but I&apos;ve been running with this setup for a while without seeing this issue.&lt;/p&gt;</description>
                <environment>&lt;p&gt;Operating System: All&lt;br/&gt;
Platform: All&lt;/p&gt;</environment>
        <key id="25827">CONTROLLER-1273</key>
            <summary>Clustering: Shard actors are terminated when another cluster node is restarted</summary>
                <type id="10104" iconUrl="https://jira.opendaylight.org/secure/viewavatar?size=xsmall&amp;avatarId=10303&amp;avatarType=issuetype">Bug</type>
                                                <status id="5" iconUrl="https://jira.opendaylight.org/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="green"/>
                                    <resolution id="10000">Done</resolution>
                                        <assignee username="-1">Unassigned</assignee>
                                    <reporter username="tpantelis">Tom Pantelis</reporter>
                        <labels>
                    </labels>
                <created>Thu, 23 Apr 2015 05:44:22 +0000</created>
                <updated>Tue, 25 Jul 2023 08:24:01 +0000</updated>
                            <resolved>Tue, 5 May 2015 16:45:18 +0000</resolved>
                                                                    <component>mdsal</component>
                        <due></due>
                            <votes>0</votes>
                                    <watches>1</watches>
                                                                                                                <comments>
                            <comment id="50488" author="moraja@cisco.com" created="Thu, 23 Apr 2015 11:41:48 +0000"  >&lt;p&gt;Can you attach the full log? Besides the logging what other behavior are you observing?&lt;/p&gt;</comment>
                            <comment id="50489" author="tpantelis" created="Thu, 23 Apr 2015 11:47:58 +0000"  >&lt;p&gt;The Shards and ShardManager disappear from JConsole.&lt;/p&gt;

&lt;p&gt;I&apos;m curious if you can reproduce this as well or if there&apos;s something fluky going on in my environment. I haven&apos;t dug into this yet with all the other patches going on right now. &lt;/p&gt;

&lt;p&gt;(In reply to Moiz Raja from comment #1)&lt;br/&gt;
&amp;gt; Can you attach the full log? Besides the logging what other behavior are you&lt;br/&gt;
&amp;gt; observing?&lt;/p&gt;</comment>
                            <comment id="50490" author="moraja@cisco.com" created="Thu, 23 Apr 2015 12:06:41 +0000"  >&lt;p&gt;I haven&apos;t seen this problem yet &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.opendaylight.org/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; The ActorSystem was shutdown possibly. Why this happened may be in the logs.&lt;/p&gt;</comment>
                            <comment id="50491" author="tpantelis" created="Thu, 30 Apr 2015 20:15:26 +0000"  >&lt;p&gt;I found this link &lt;a href=&quot;https://groups.google.com/forum/#!topic/akka-user/jleFC7P66ao&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://groups.google.com/forum/#!topic/akka-user/jleFC7P66ao&lt;/a&gt; which appears to be recent. This outlines the same issue I see - it&apos;s a bug in akka. The issue is related to a node becoming reachable after being auto-downed.&lt;/p&gt;

&lt;p&gt;On my controller instances I still had the auto-down-unreachable-after setting to 10s. So I set it really high and the actor system shutdown did not occur. That&apos;s good news. So it seems auto-downing is problematic at this point.&lt;/p&gt;

&lt;p&gt;However, the 3rd node which I had restarted didn&apos;t join back into the cluster. On the other 2 nodes I see this INFO message every few seconds: &quot;Existing member &lt;span class=&quot;error&quot;&gt;&amp;#91;address of 3rd node&amp;#93;&lt;/span&gt; is trying to join, ignoring&quot;. Not sure what&apos;s going on there. Interestingly the restarted node did become a follower as it appears the leader was able to send heartbeats b/c we cache the remote actor address in the RaftActorContext. However the restarted node did not have peer addresses for the other 2 so it did not get ClusterMemberUp messages.&lt;/p&gt;

&lt;p&gt;Also, akka&apos;s ClusterState mbean reported the stopped node as Unreachable which makes sense. However it also listed it in the members list as Up which doesn&apos;t seem right. It continued to report this state even after the node was restarted. &lt;/p&gt;

&lt;p&gt;So it seems the cluster leader didn&apos;t Up the restarted node or the endpoint layer didn&apos;t report reachability as evidenced by the &quot;ignoring&quot; message. It&apos;s unclear what it was ignoring and why.&lt;/p&gt;

&lt;p&gt;It seems akka&apos;s clustering code is either really funky or really buggy.&lt;/p&gt;</comment>
                            <comment id="50492" author="tpantelis" created="Thu, 30 Apr 2015 20:51:31 +0000"  >&lt;p&gt;After reading other posts, e.g. &lt;a href=&quot;https://groups.google.com/forum/#!topic/akka-user/AdRSv2yuwo4&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://groups.google.com/forum/#!topic/akka-user/AdRSv2yuwo4&lt;/a&gt;, it seems clear that a node with same host:port can&apos;t re-join until the previous member with same host:port has been removed from the cluster, which happens after auto-downing it. However, auto-downing seems to cause the endpoint layer bug that shuts down the actor system.&lt;/p&gt;

&lt;p&gt;I really don&apos;t get the rationale behind their design.&lt;/p&gt;</comment>
                            <comment id="50493" author="moraja@cisco.com" created="Thu, 30 Apr 2015 21:12:34 +0000"  >&lt;p&gt;Going through the 2.3.10 release notes (&lt;a href=&quot;http://akka.io/news/2015/04/23/akka-2.3.10-released.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://akka.io/news/2015/04/23/akka-2.3.10-released.html&lt;/a&gt;) I spotted this,&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;remove wrong assertion in remoting, which could lead to ActorSystem termination when restarted remote ActorSystem connects after being quarantined&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Maybe we should switch to akka 2.3.10 it has a bunch of remoting/clustering related fixes.&lt;/p&gt;</comment>
                            <comment id="50494" author="tpantelis" created="Thu, 30 Apr 2015 21:15:12 +0000"  >&lt;p&gt;Yup - that&apos;s the issue. We should upgrade. &lt;/p&gt;

&lt;p&gt;(In reply to Moiz Raja from comment #6)&lt;br/&gt;
&amp;gt; Going through the 2.3.10 release notes&lt;br/&gt;
&amp;gt; (&lt;a href=&quot;http://akka.io/news/2015/04/23/akka-2.3.10-released.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://akka.io/news/2015/04/23/akka-2.3.10-released.html&lt;/a&gt;) I spotted this,&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; * remove wrong assertion in remoting, which could lead to ActorSystem&lt;br/&gt;
&amp;gt; termination when restarted remote ActorSystem connects after being&lt;br/&gt;
&amp;gt; quarantined&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; Maybe we should switch to akka 2.3.10 it has a bunch of remoting/clustering&lt;br/&gt;
&amp;gt; related fixes.&lt;/p&gt;</comment>
                            <comment id="50495" author="moraja@cisco.com" created="Thu, 30 Apr 2015 21:32:08 +0000"  >&lt;p&gt;More details &lt;a href=&quot;https://github.com/akka/akka/issues/17213&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/akka/akka/issues/17213&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="50496" author="tpantelis" created="Fri, 1 May 2015 01:17:07 +0000"  >&lt;p&gt;I tested upgrading to 2.3.10. I had auto-down-unreachable-after set to 10s and, after stopping and restarting a node after auto-down, the actor system shutdown issue didn&apos;t occur and the node successfully rejoined the cluster. So that issue is fixed.&lt;/p&gt;

&lt;p&gt;I also tested with auto-down-unreachable-after set high. After restarting a node, it successfully rejoined the cluster. So that issue is fixed as well. Tried it twice. &lt;/p&gt;

&lt;p&gt;Akka&apos;s ClusterState mbean still reported the member as both Up and Unreachable but I can live with that.&lt;/p&gt;

&lt;p&gt;Looks like we have a stable akka release wrt to remoting.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                            <customfield id="customfield_11400" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10208" key="com.atlassian.jira.plugin.system.customfieldtypes:textfield">
                        <customfieldname>External issue ID</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>3049</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10201" key="com.atlassian.jira.plugin.system.customfieldtypes:url">
                        <customfieldname>External issue URL</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[https://bugs.opendaylight.org/show_bug.cgi?id=3049]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10206" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Issue Type</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10300"><![CDATA[Bug]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10204" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>ODL SR Target Milestone</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10315"><![CDATA[Lithium]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10000" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>0|i02pjz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>