<!-- 
RSS generated by JIRA (8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d) at Wed Feb 07 19:55:26 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>OpenDaylight JIRA</title>
    <link>https://jira.opendaylight.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>8.20.10</version>
        <build-number>820010</build-number>
        <build-date>22-06-2022</build-date>
    </build-info>


<item>
            <title>[CONTROLLER-1396] Clustering: Node does not rejoin after restart</title>
                <link>https://jira.opendaylight.org/browse/CONTROLLER-1396</link>
                <project id="10113" key="CONTROLLER">controller</project>
                    <description></description>
                <environment>&lt;p&gt;Operating System: All&lt;br/&gt;
Platform: All&lt;/p&gt;</environment>
        <key id="25950">CONTROLLER-1396</key>
            <summary>Clustering: Node does not rejoin after restart</summary>
                <type id="10104" iconUrl="https://jira.opendaylight.org/secure/viewavatar?size=xsmall&amp;avatarId=10303&amp;avatarType=issuetype">Bug</type>
                                                <status id="5" iconUrl="https://jira.opendaylight.org/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="green"/>
                                    <resolution id="10000">Done</resolution>
                                        <assignee username="gary.wu1@huawei.com">Gary Wu</assignee>
                                    <reporter username="ShaleenS">Shaleen Saxena</reporter>
                        <labels>
                    </labels>
                <created>Wed, 22 Jul 2015 15:44:51 +0000</created>
                <updated>Tue, 27 Oct 2015 15:34:31 +0000</updated>
                            <resolved>Tue, 27 Oct 2015 15:34:31 +0000</resolved>
                                    <version>Lithium</version>
                                                    <component>clustering</component>
                        <due></due>
                            <votes>0</votes>
                                    <watches>8</watches>
                                                                                                                <comments>
                            <comment id="50905" author="ssaxena@luminanetworks.com" created="Wed, 22 Jul 2015 15:58:48 +0000"  >&lt;p&gt;This issue is seen most commonly during clustering datastore integration tests. 1 out of 5 times, &quot;140 Recovery Restart Follower&quot; test suite will fail. The first failed test case is &quot;Add cars to the first follower&quot;; and following test cases will also fail after that. From the log.html:&lt;/p&gt;

&lt;p&gt;the response of the POST to add car=&amp;lt;Response &lt;span class=&quot;error&quot;&gt;&amp;#91;500&amp;#93;&lt;/span&gt;&amp;gt;&lt;/p&gt;

&lt;p&gt;Looking in the logs, the newly started node has failed to join the cluster. The logs from all 3 members are attached. &lt;/p&gt;

&lt;p&gt;The three cluster nodes are: &lt;br/&gt;
 member-1 = 10.18.162.168&lt;br/&gt;
 member-2 = 10.18.162.170&lt;br/&gt;
 member-3 = 10.18.162.171&lt;/p&gt;

&lt;p&gt;The failure is seen in &quot;140 Recovery Restart Follower&quot; test. The time stamp for the start of test is 20:41:11.&lt;/p&gt;</comment>
                            <comment id="50932" author="ssaxena@luminanetworks.com" created="Wed, 22 Jul 2015 16:06:58 +0000"  >&lt;p&gt;Attachment Logs.zip has been added with description: Karaf logs from all members.&lt;/p&gt;</comment>
                            <comment id="50906" author="tpantelis" created="Thu, 23 Jul 2015 13:04:18 +0000"  >&lt;p&gt;I tested the scenario in the 140 test. The first seed node is node1 which became the akka cluster leader as expected. I stopped node2 and node3.&lt;/p&gt;

&lt;p&gt;node1 quickly declared the other nodes unreachable and continued to retry the connection, ie:&lt;/p&gt;

&lt;p&gt;2015-07-23 02:55:40,271 | WARN  | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$2 | 71 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Association with remote system &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt; has failed, address is now gated for &lt;span class=&quot;error&quot;&gt;&amp;#91;5000&amp;#93;&lt;/span&gt; ms. Reason: [Association failed with &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt;] Caused by: &lt;span class=&quot;error&quot;&gt;&amp;#91;Connection refused: /127.0.0.1:2552&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I also got this message which is expected based on the akka docs:&lt;/p&gt;

&lt;p&gt;2015-07-23 02:56:16,234 | INFO  | ult-dispatcher-2 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 -&amp;gt; akka.tcp://opendaylight-cluster-data@127.0.0.1:2552: Unreachable &lt;span class=&quot;error&quot;&gt;&amp;#91;Unreachable&amp;#93;&lt;/span&gt; (1), akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 -&amp;gt; akka.tcp://opendaylight-cluster-data@127.0.0.1:2554: Unreachable &lt;span class=&quot;error&quot;&gt;&amp;#91;Unreachable&amp;#93;&lt;/span&gt; (2)], member status: &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@127.0.0.1:2552 Up seen=false, akka.tcp://opendaylight-cluster-data@127.0.0.1:2554 Up seen=false&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;After restarting node2:&lt;/p&gt;

&lt;p&gt;2015-07-23 02:56:38,965 | INFO  | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - New incarnation of existing member &lt;span class=&quot;error&quot;&gt;&amp;#91;Member(address = akka.tcp://opendaylight-cluster-data@127.0.0.1:2552, status = Up)&amp;#93;&lt;/span&gt; is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.&lt;br/&gt;
2015-07-23 02:56:38,965 | INFO  | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - Marking unreachable node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt; as &lt;span class=&quot;error&quot;&gt;&amp;#91;Down&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;And then this message repeated over and over every 11 sec....&lt;/p&gt;

&lt;p&gt;2015-07-23 02:56:49,932 | INFO  | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - New incarnation of existing member &lt;span class=&quot;error&quot;&gt;&amp;#91;Member(address = akka.tcp://opendaylight-cluster-data@127.0.0.1:2552, status = Down)&amp;#93;&lt;/span&gt; is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.&lt;/p&gt;

&lt;p&gt;So node1 wouldn&apos;t allow node2 back in until about 5 min, after which node2 and node3 were auto-downed and node2 was allowed to rejoin. &lt;/p&gt;

&lt;p&gt;I tried it with auto-down-unreachable-after to off. This time node2 wasn&apos;t allowed to rejoin until node3 was started:&lt;/p&gt;

&lt;p&gt;2015-07-23 03:11:17,235 | INFO  | lt-dispatcher-16 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - Leader is removing unreachable node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt;&lt;br/&gt;
2015-07-23 03:11:17,325 | WARN  | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$2 | 71 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Association with remote system &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt; has failed, address is now gated for &lt;span class=&quot;error&quot;&gt;&amp;#91;5000&amp;#93;&lt;/span&gt; ms. Reason: &lt;span class=&quot;error&quot;&gt;&amp;#91;Disassociated&amp;#93;&lt;/span&gt; &lt;br/&gt;
2015-07-23 03:11:25,935 | INFO  | ult-dispatcher-2 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt; is JOINING, roles &lt;span class=&quot;error&quot;&gt;&amp;#91;member-2&amp;#93;&lt;/span&gt;&lt;br/&gt;
2015-07-23 03:11:26,234 | INFO  | lt-dispatcher-14 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - Leader is moving node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt; to &lt;span class=&quot;error&quot;&gt;&amp;#91;Up&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Setting auto-down-unreachable-after to the original 10s, nodes 2 &amp;amp; 3 are auto-downed and removed quickly and thus are allowed to rejoin quickly after restart.&lt;/p&gt;

&lt;p&gt;So when node(s) are unreachable, the leader can&apos;t perform it&apos;s duties like allowing nodes to join. I had previously thought this only applies to new nodes that hadn&apos;t previously joined. But apparently that&apos;s not the case - it also applies to previously joined nodes that are unreachable. &lt;/p&gt;

&lt;p&gt;So the behavior seems to be that all unreachable nodes must become reachable or downed before any are allowed back by the leader. This doesn&apos;t seem right.&lt;/p&gt;

&lt;p&gt;The interesting (or wierd) part is that node2 remained as a follower and saw node1 as the shard leader and continued to receive heartbeats from the node1 even though node2 wasn&apos;t allowed to rejoin the akka cluster and was still seen as unreachable. So things seem OK between the 2 until you try to initiate a transaction on node2. Since node2 didn&apos;t get the MemberUp event for node1, it didn&apos;t have node1&apos;s actor address so transactions on node2 failed (from restconf).&lt;/p&gt;

&lt;p&gt;So akka remoting had a connection from node1 -&amp;gt; node2 and allowed messages to be sent while akka clustering deemed node2 unreachable. That seems broken - major disconnect between the 2 components.&lt;/p&gt;

&lt;p&gt;Based on my testing, I don&apos;t see how the 140 test works at all when just one of the followers is restarted and it tries to add cars on that follower.&lt;/p&gt;</comment>
                            <comment id="50907" author="tpantelis" created="Thu, 23 Jul 2015 13:04:38 +0000"  >&lt;p&gt;I tested the scenario in the 140 test. The first seed node is node1 which became the akka cluster leader as expected. I stopped node2 and node3.&lt;/p&gt;

&lt;p&gt;node1 quickly declared the other nodes unreachable and continued to retry the connection, ie:&lt;/p&gt;

&lt;p&gt;2015-07-23 02:55:40,271 | WARN  | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$2 | 71 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Association with remote system &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt; has failed, address is now gated for &lt;span class=&quot;error&quot;&gt;&amp;#91;5000&amp;#93;&lt;/span&gt; ms. Reason: [Association failed with &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt;] Caused by: &lt;span class=&quot;error&quot;&gt;&amp;#91;Connection refused: /127.0.0.1:2552&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;I also got this message which is expected based on the akka docs:&lt;/p&gt;

&lt;p&gt;2015-07-23 02:56:16,234 | INFO  | ult-dispatcher-2 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 -&amp;gt; akka.tcp://opendaylight-cluster-data@127.0.0.1:2552: Unreachable &lt;span class=&quot;error&quot;&gt;&amp;#91;Unreachable&amp;#93;&lt;/span&gt; (1), akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 -&amp;gt; akka.tcp://opendaylight-cluster-data@127.0.0.1:2554: Unreachable &lt;span class=&quot;error&quot;&gt;&amp;#91;Unreachable&amp;#93;&lt;/span&gt; (2)], member status: &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@127.0.0.1:2552 Up seen=false, akka.tcp://opendaylight-cluster-data@127.0.0.1:2554 Up seen=false&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;After restarting node2:&lt;/p&gt;

&lt;p&gt;2015-07-23 02:56:38,965 | INFO  | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - New incarnation of existing member &lt;span class=&quot;error&quot;&gt;&amp;#91;Member(address = akka.tcp://opendaylight-cluster-data@127.0.0.1:2552, status = Up)&amp;#93;&lt;/span&gt; is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.&lt;br/&gt;
2015-07-23 02:56:38,965 | INFO  | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - Marking unreachable node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt; as &lt;span class=&quot;error&quot;&gt;&amp;#91;Down&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;And then this message repeated over and over every 11 sec....&lt;/p&gt;

&lt;p&gt;2015-07-23 02:56:49,932 | INFO  | lt-dispatcher-24 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - New incarnation of existing member &lt;span class=&quot;error&quot;&gt;&amp;#91;Member(address = akka.tcp://opendaylight-cluster-data@127.0.0.1:2552, status = Down)&amp;#93;&lt;/span&gt; is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.&lt;/p&gt;

&lt;p&gt;So node1 wouldn&apos;t allow node2 back in until about 5 min, after which node2 and node3 were auto-downed and node2 was allowed to rejoin. &lt;/p&gt;

&lt;p&gt;I tried it with auto-down-unreachable-after to off. This time node2 wasn&apos;t allowed to rejoin until node3 was started:&lt;/p&gt;

&lt;p&gt;2015-07-23 03:11:17,235 | INFO  | lt-dispatcher-16 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - Leader is removing unreachable node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt;&lt;br/&gt;
2015-07-23 03:11:17,325 | WARN  | lt-dispatcher-22 | receive$1$$anonfun$applyOrElse$2 | 71 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Association with remote system &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt; has failed, address is now gated for &lt;span class=&quot;error&quot;&gt;&amp;#91;5000&amp;#93;&lt;/span&gt; ms. Reason: &lt;span class=&quot;error&quot;&gt;&amp;#91;Disassociated&amp;#93;&lt;/span&gt; &lt;br/&gt;
2015-07-23 03:11:25,935 | INFO  | ult-dispatcher-2 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt; is JOINING, roles &lt;span class=&quot;error&quot;&gt;&amp;#91;member-2&amp;#93;&lt;/span&gt;&lt;br/&gt;
2015-07-23 03:11:26,234 | INFO  | lt-dispatcher-14 | receive$1$$anonfun$applyOrElse$3 | 74 | 236 - com.typesafe.akka.slf4j - 2.3.10 |  | Cluster Node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2550&amp;#93;&lt;/span&gt; - Leader is moving node &lt;span class=&quot;error&quot;&gt;&amp;#91;akka.tcp://opendaylight-cluster-data@127.0.0.1:2552&amp;#93;&lt;/span&gt; to &lt;span class=&quot;error&quot;&gt;&amp;#91;Up&amp;#93;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Setting auto-down-unreachable-after to the original 10s, nodes 2 &amp;amp; 3 are auto-downed and removed quickly and thus are allowed to rejoin quickly after restart.&lt;/p&gt;

&lt;p&gt;So when node(s) are unreachable, the leader can&apos;t perform it&apos;s duties like allowing nodes to join. I had previously thought this only applies to new nodes that hadn&apos;t previously joined. But apparently that&apos;s not the case - it also applies to previously joined nodes that are unreachable. &lt;/p&gt;

&lt;p&gt;So the behavior seems to be that all unreachable nodes must become reachable or downed before any are allowed back by the leader. This doesn&apos;t seem right.&lt;/p&gt;

&lt;p&gt;The interesting (or wierd) part is that node2 remained as a follower and saw node1 as the shard leader and continued to receive heartbeats from the node1 even though node2 wasn&apos;t allowed to rejoin the akka cluster and was still seen as unreachable. So things seem OK between the 2 until you try to initiate a transaction on node2. Since node2 didn&apos;t get the MemberUp event for node1, it didn&apos;t have node1&apos;s actor address so transactions on node2 failed (from restconf).&lt;/p&gt;

&lt;p&gt;So akka remoting had a connection from node1 -&amp;gt; node2 and allowed messages to be sent while akka clustering deemed node2 unreachable. That seems broken - major disconnect between the 2 components.&lt;/p&gt;

&lt;p&gt;Based on my testing, I don&apos;t see how the 140 test works at all when just one of the followers is restarted and it tries to add cars on that follower.&lt;/p&gt;</comment>
                            <comment id="50908" author="tpantelis" created="Thu, 23 Jul 2015 13:16:41 +0000"  >&lt;p&gt;Actually I do know how the 140 test works. The CDS will block startup waiting for shard leaders to be elected up to 90s. With both data stores it blocks for 3 min. The test waits for the cluster-test-app to be started which happens after 3 min. So the 3 min startup combined with the time it takes to shutdown the nodes and the 1 min retries for add cars to succeed gives the 5 min auto-down enough time to kick in.&lt;/p&gt;</comment>
                            <comment id="50909" author="tpantelis" created="Thu, 23 Jul 2015 15:03:15 +0000"  >&lt;p&gt;I played around with this a bit more. I stopped node3 and also invoked the &quot;leave&quot; operation manually via the akka JMX. As advertised, the cluster leader immediately transitioned node3 to exiting and removed it from the cluster. Then I stopped and restarted node2 and it quickly re-joined the cluster as node3 was removed. &lt;/p&gt;

&lt;p&gt;So &quot;leave&quot; is a graceful and immediate way to tell the leader to remove the node w/o waiting for unreachable and down to occur. This seems to be a reasonable solution to this issue (in lieu of setting auto-down-unreachable-after back to a low value which has other issues). I would think akka would issue a &quot;leave&quot; automatically on graceful shutdown but it doesn&apos;t. Maybe there&apos;s a setting for that. Otherwise we should be able to issue the leave programmatically on shutdown.&lt;/p&gt;

&lt;p&gt;Of course this wouldn&apos;t apply in the case of an ungraceful shutdown or network partition involving multiple nodes but those scenarios will be uncommon.&lt;/p&gt;</comment>
                            <comment id="50910" author="colin@colindixon.com" created="Thu, 23 Jul 2015 19:50:55 +0000"  >&lt;p&gt;It appears that this is caused by and/or related to this bug in Akka&lt;br/&gt;
&lt;a href=&quot;https://github.com/akka/akka/issues/13584&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/akka/akka/issues/13584&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="50911" author="tpantelis" created="Fri, 24 Jul 2015 01:37:58 +0000"  >&lt;p&gt;I opened a new issue &lt;a href=&quot;https://github.com/akka/akka/issues/18067&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/akka/akka/issues/18067&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="50912" author="colin@colindixon.com" created="Thu, 13 Aug 2015 16:11:58 +0000"  >&lt;p&gt;I wrote up this summary of the problem:&lt;/p&gt;


&lt;p&gt;Clustering Autodown Issue&lt;/p&gt;

&lt;p&gt;Problem Statement:&lt;br/&gt;
===&lt;br/&gt;
There is no &quot;right&quot; setting for how and when to down nodes. We are in a catch-22.&lt;/p&gt;

&lt;p&gt;If a node becomes is downed, it has to be rebooted (or at least the actor system has to be rebooted) to get a new UUID to join. This means downing nodes too aggressively requires manual intervention with some regularity to keep fault-tolerance.&lt;/p&gt;

&lt;p&gt;Unreachable nodes can only be moved to up when the cluster is converged. This requires that all unreachable nodes be declared down or all become reachable for any unreachable node to be declared up again. This causes problems if you have 3 nodes and 2 become unreachable. Even if one becomes reachable again (in theory returning the cluster to a good state) it can&apos;t be moved to up unless the other node is declared down or becomes reachable again. This means unless we down nodes reasonable aggressively, we can see periods of unnecessary unavailability.&lt;/p&gt;

&lt;p&gt;This is tracked in ODL as &lt;a href=&quot;https://jira.opendaylight.org/browse/CONTROLLER-1396&quot; title=&quot;Clustering: Node does not rejoin after restart&quot; class=&quot;issue-link&quot; data-issue-key=&quot;CONTROLLER-1396&quot;&gt;&lt;del&gt;CONTROLLER-1396&lt;/del&gt;&lt;/a&gt; (see below)&lt;/p&gt;

&lt;p&gt;Akka Definitions:&lt;br/&gt;
===&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;unreachable: the failure detector has marked this node likely down and quarantined it, if it become reachable it will be allowed in&lt;/li&gt;
	&lt;li&gt;downed: the node has been marked as dead and will not be allowed to rejoin&lt;/li&gt;
	&lt;li&gt;convergence: there are no unreachable nodes, e.g., all &quot;members&quot; are up or down&lt;br/&gt;
From: &lt;a href=&quot;http://doc.akka.io/docs/akka/snapshot/common/cluster.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://doc.akka.io/docs/akka/snapshot/common/cluster.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Solutions:&lt;br/&gt;
===&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;First, is to fix Akka&apos;s logic:&lt;/li&gt;
	&lt;li&gt;Possibly, fix it so that we allow nodes to rejoin without having &quot;convergence&quot; (this is Akka issue 18067 &lt;span class=&quot;error&quot;&gt;&amp;#91;see below&amp;#93;&lt;/span&gt; and they aren&apos;t super optimistic about it)&lt;/li&gt;
	&lt;li&gt;Possibly, implement our own auto-down version which would ???. This would likely required help from Akka/Typesafe and maybe consulting.&lt;/li&gt;
	&lt;li&gt;Second, is to implement our Akka actors so that they test if they&apos;re downed and can reboot themselves automatically.&lt;/li&gt;
	&lt;li&gt;Third, is to have nodes leave on graceful shutdown. This is only a partial fix as it won&apos;t help non-graceful shutdowns.&lt;/li&gt;
	&lt;li&gt;Fourth, use something other than Akka clustering.&lt;/li&gt;
	&lt;li&gt;Maybe just use Akka remoting, but not clustering.&lt;/li&gt;
	&lt;li&gt;Maybe just use something like AMQP.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;a href=&quot;https://bugs.opendaylight.org/show_bug.cgi?id=4037&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugs.opendaylight.org/show_bug.cgi?id=4037&lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://github.com/akka/akka/issues/18067#issuecomment-129323444&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/akka/akka/issues/18067#issuecomment-129323444&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="50913" author="colin@colindixon.com" created="Thu, 13 Aug 2015 16:22:37 +0000"  >&lt;p&gt;Based on some talking with TomP and discussion on the Akka issue, I think the best solution in the short run might be the following:&lt;/p&gt;

&lt;p&gt;1.) Set an aggressive, but reasonable auto-down timeout&lt;br/&gt;
2.) When we get a MemberDown event any remote cluster instance, we send a YouAreDown message via Akka remoting to that node repeatedly, e.g., every 10 seconds&lt;br/&gt;
3.) If a cluster instance gets a YouAreDown message, then it reboots itself to be able to rejoin the cluster despite being auto-downed&lt;/p&gt;

&lt;p&gt;We could optimize step 2 a bit if we knew when the node was reachable despite being down and only send the message then, but the core idea is the same.&lt;/p&gt;</comment>
                            <comment id="50914" author="tpantelis" created="Thu, 13 Aug 2015 18:02:51 +0000"  >&lt;p&gt;When a MemberUp occurs for the downed node then stop the YouAreDown timer. Also, I think we could ignore YouAreDown until the first MemberUp is received for itself. That should really minimize the chance for a &quot;false&quot; redundant restart as the other nodes would also get the MemberUp within a short period of time.&lt;/p&gt;

&lt;p&gt;(In reply to Colin Dixon from comment #10)&lt;br/&gt;
&amp;gt; Based on some talking with TomP and discussion on the Akka issue, I think&lt;br/&gt;
&amp;gt; the best solution in the short run might be the following:&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; 1.) Set an aggressive, but reasonable auto-down timeout&lt;br/&gt;
&amp;gt; 2.) When we get a MemberDown event any remote cluster instance, we send a&lt;br/&gt;
&amp;gt; YouAreDown message via Akka remoting to that node repeatedly, e.g., every 10&lt;br/&gt;
&amp;gt; seconds&lt;br/&gt;
&amp;gt; 3.) If a cluster instance gets a YouAreDown message, then it reboots itself&lt;br/&gt;
&amp;gt; to be able to rejoin the cluster despite being auto-downed&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; We could optimize step 2 a bit if we knew when the node was reachable&lt;br/&gt;
&amp;gt; despite being down and only send the message then, but the core idea is the&lt;br/&gt;
&amp;gt; same.&lt;/p&gt;</comment>
                            <comment id="50915" author="phillip.shea@hp.com" created="Thu, 13 Aug 2015 18:08:15 +0000"  >&lt;p&gt;(In reply to Colin Dixon from comment #10)&lt;/p&gt;

&lt;p&gt;Wouldn&apos;t setting auto-down to 10 seconds, then rebooting to rejoin cause performance fall of a cliff in a environment with intermittent connections? It takes over a minute to restart with a minimum set of modules installed. This would mean a 10 second interruption would result in the controller being unavailable for more than a minute. Am I wrong?&lt;/p&gt;



&lt;p&gt;&amp;gt; Based on some talking with TomP and discussion on the Akka issue, I think&lt;br/&gt;
&amp;gt; the best solution in the short run might be the following:&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; 1.) Set an aggressive, but reasonable auto-down timeout&lt;br/&gt;
&amp;gt; 2.) When we get a MemberDown event any remote cluster instance, we send a&lt;br/&gt;
&amp;gt; YouAreDown message via Akka remoting to that node repeatedly, e.g., every 10&lt;br/&gt;
&amp;gt; seconds&lt;br/&gt;
&amp;gt; 3.) If a cluster instance gets a YouAreDown message, then it reboots itself&lt;br/&gt;
&amp;gt; to be able to rejoin the cluster despite being auto-downed&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; We could optimize step 2 a bit if we knew when the node was reachable&lt;br/&gt;
&amp;gt; despite being down and only send the message then, but the core idea is the&lt;br/&gt;
&amp;gt; same.&lt;/p&gt;</comment>
                            <comment id="50916" author="tpantelis" created="Thu, 13 Aug 2015 18:33:52 +0000"  >&lt;p&gt;We would just restart the actor systems, not the process. It looks like we have no choice based on the discussions with the akka dev. He did mention a new feature that is not released yet that he couldn&apos;t talk about. Even if this new feature helps us here, it will be a while before we could upgrade akka.&lt;/p&gt;

&lt;p&gt;If Colin&apos;s idea works and with auto-down set lower (e.g. 30s to give it some cushion), the issue with the &quot;140&quot; tests would be alleviated and we would also have automatic recovery of partitioned nodes. &lt;/p&gt;

&lt;p&gt;A potential issue could be if both sides of the partition somehow see each other down and, upon healing, each sends YouAreDown to the other side. Not sure if that could happen. If so, it probably could only happen in clusters larger than 3.&lt;/p&gt;

&lt;p&gt;(In reply to Phillip Shea from comment #12)&lt;br/&gt;
&amp;gt; (In reply to Colin Dixon from comment #10)&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; Wouldn&apos;t setting auto-down to 10 seconds, then rebooting to rejoin cause&lt;br/&gt;
&amp;gt; performance fall of a cliff in a environment with intermittent connections?&lt;br/&gt;
&amp;gt; It takes over a minute to restart with a minimum set of modules installed.&lt;br/&gt;
&amp;gt; This would mean a 10 second interruption would result in the controller&lt;br/&gt;
&amp;gt; being unavailable for more than a minute. Am I wrong?&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; &amp;gt; Based on some talking with TomP and discussion on the Akka issue, I think&lt;br/&gt;
&amp;gt; &amp;gt; the best solution in the short run might be the following:&lt;br/&gt;
&amp;gt; &amp;gt; &lt;br/&gt;
&amp;gt; &amp;gt; 1.) Set an aggressive, but reasonable auto-down timeout&lt;br/&gt;
&amp;gt; &amp;gt; 2.) When we get a MemberDown event any remote cluster instance, we send a&lt;br/&gt;
&amp;gt; &amp;gt; YouAreDown message via Akka remoting to that node repeatedly, e.g., every 10&lt;br/&gt;
&amp;gt; &amp;gt; seconds&lt;br/&gt;
&amp;gt; &amp;gt; 3.) If a cluster instance gets a YouAreDown message, then it reboots itself&lt;br/&gt;
&amp;gt; &amp;gt; to be able to rejoin the cluster despite being auto-downed&lt;br/&gt;
&amp;gt; &amp;gt; &lt;br/&gt;
&amp;gt; &amp;gt; We could optimize step 2 a bit if we knew when the node was reachable&lt;br/&gt;
&amp;gt; &amp;gt; despite being down and only send the message then, but the core idea is the&lt;br/&gt;
&amp;gt; &amp;gt; same.&lt;/p&gt;</comment>
                            <comment id="50917" author="phillip.shea@hp.com" created="Thu, 13 Aug 2015 18:35:28 +0000"  >&lt;p&gt;(In reply to Tom Pantelis from comment #13)&lt;/p&gt;

&lt;p&gt;Cool. Thanks for the explanation, Tom.&lt;/p&gt;

&lt;p&gt;&amp;gt; We would just restart the actor systems, not the process. It looks like we&lt;br/&gt;
&amp;gt; have no choice based on the discussions with the akka dev. He did mention a&lt;br/&gt;
&amp;gt; new feature that is not released yet that he couldn&apos;t talk about. Even if&lt;br/&gt;
&amp;gt; this new feature helps us here, it will be a while before we could upgrade&lt;br/&gt;
&amp;gt; akka.&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; If Colin&apos;s idea works and with auto-down set lower (e.g. 30s to give it some&lt;br/&gt;
&amp;gt; cushion), the issue with the &quot;140&quot; tests would be alleviated and we would&lt;br/&gt;
&amp;gt; also have automatic recovery of partitioned nodes. &lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; A potential issue could be if both sides of the partition somehow see each&lt;br/&gt;
&amp;gt; other down and, upon healing, each sends YouAreDown to the other side. Not&lt;br/&gt;
&amp;gt; sure if that could happen. If so, it probably could only happen in clusters&lt;br/&gt;
&amp;gt; larger than 3.&lt;br/&gt;
&amp;gt;  &lt;br/&gt;
&amp;gt; (In reply to Phillip Shea from comment #12)&lt;br/&gt;
&amp;gt; &amp;gt; (In reply to Colin Dixon from comment #10)&lt;br/&gt;
&amp;gt; &amp;gt; &lt;br/&gt;
&amp;gt; &amp;gt; Wouldn&apos;t setting auto-down to 10 seconds, then rebooting to rejoin cause&lt;br/&gt;
&amp;gt; &amp;gt; performance fall of a cliff in a environment with intermittent connections?&lt;br/&gt;
&amp;gt; &amp;gt; It takes over a minute to restart with a minimum set of modules installed.&lt;br/&gt;
&amp;gt; &amp;gt; This would mean a 10 second interruption would result in the controller&lt;br/&gt;
&amp;gt; &amp;gt; being unavailable for more than a minute. Am I wrong?&lt;br/&gt;
&amp;gt; &amp;gt; &lt;br/&gt;
&amp;gt; &amp;gt; &lt;br/&gt;
&amp;gt; &amp;gt; &lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; Based on some talking with TomP and discussion on the Akka issue, I think&lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; the best solution in the short run might be the following:&lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; &lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; 1.) Set an aggressive, but reasonable auto-down timeout&lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; 2.) When we get a MemberDown event any remote cluster instance, we send a&lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; YouAreDown message via Akka remoting to that node repeatedly, e.g., every 10&lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; seconds&lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; 3.) If a cluster instance gets a YouAreDown message, then it reboots itself&lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; to be able to rejoin the cluster despite being auto-downed&lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; &lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; We could optimize step 2 a bit if we knew when the node was reachable&lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; despite being down and only send the message then, but the core idea is the&lt;br/&gt;
&amp;gt; &amp;gt; &amp;gt; same.&lt;/p&gt;</comment>
                            <comment id="50918" author="colin@colindixon.com" created="Thu, 13 Aug 2015 19:31:28 +0000"  >&lt;p&gt;Short version: if we really have intermittent connections such that we routinely lose nodes for 10+ seconds, yes.&lt;/p&gt;

&lt;p&gt;In reality, I&apos;m not sure we can realistically function anyway in such an environment regardless of this solution.&lt;/p&gt;

&lt;p&gt;(In reply to Phillip Shea from comment #12)&lt;br/&gt;
&amp;gt; (In reply to Colin Dixon from comment #10)&lt;br/&gt;
&amp;gt; &lt;br/&gt;
&amp;gt; Wouldn&apos;t setting auto-down to 10 seconds, then rebooting to rejoin cause&lt;br/&gt;
&amp;gt; performance fall of a cliff in a environment with intermittent connections?&lt;br/&gt;
&amp;gt; It takes over a minute to restart with a minimum set of modules installed.&lt;br/&gt;
&amp;gt; This would mean a 10 second interruption would result in the controller&lt;br/&gt;
&amp;gt; being unavailable for more than a minute. Am I wrong?&lt;/p&gt;</comment>
                            <comment id="50919" author="gary.wu1@huawei.com" created="Thu, 10 Sep 2015 18:51:39 +0000"  >&lt;p&gt;I&apos;ll be working on this issue per Tom&apos;s request.&lt;/p&gt;</comment>
                            <comment id="50920" author="gary.wu1@huawei.com" created="Fri, 18 Sep 2015 22:04:03 +0000"  >&lt;p&gt;A quick update on where I&apos;m at with this bug.&lt;/p&gt;

&lt;p&gt;I had a prototype implemented using the YouAreDown message mechanism as suggested by Colin and Tom.&lt;/p&gt;

&lt;p&gt;While testing this out, I ran into an issue: Sometimes the YouAreDown messages would not make it to the destination node because the Akka association (connection) to the auto-downed node has been quarantined.  When this happens, there&apos;s no way to communicate to the auto-downed node to tell it to restart.&lt;/p&gt;

&lt;p&gt;Since the quarantine is what requires the ActorSystem reboot in the first place, I&apos;m now exploring the possibility of relying on Akka&apos;s own quarantined detection, and using that as a signal to reboot the ActorSystem instead of sending our own YouAreDown message system.&lt;/p&gt;

&lt;p&gt;This is going okay so far, except for two issues:&lt;/p&gt;

&lt;p&gt;1. Even though Akka can accurately detect being quarantined by a remote node, it doesn&apos;t bubble this up in a named exception.  So, I&apos;m having to resort to doing string matching on the exception message, which is fragile.&lt;/p&gt;

&lt;p&gt;2. Occasionally, even after the ActorSystem has been restarted, it will fail to rejoin the cluster properly (i.e. the other nodes would fail to get the MemberUp event).  In a three node cluster, this would happen on just one of the nodes. I&apos;m still investigating this one.  &lt;/p&gt;

&lt;p&gt;Since we have a separate Akka cluster for the RPC system, I guess we would need to implement something similar for that as well?&lt;/p&gt;</comment>
                            <comment id="50921" author="tpantelis" created="Fri, 18 Sep 2015 22:19:57 +0000"  >&lt;p&gt;Thanks for exploring this. I was afraid the YouAreDown message might get blocked. It would be ideal if akka could provide some indication to our code on the partitioned node that it&apos;s been quarantined - it sounds like you&apos;ve may have achieved that? &lt;/p&gt;

&lt;p&gt;I imagine we would have to do the same with the RPC actor system. However we&apos;ve talked about using just one actor system. I don&apos;t remember if there&apos;s a bug for that but that could be another task to work on if you want.&lt;/p&gt;</comment>
                            <comment id="50922" author="gary.wu1@huawei.com" created="Fri, 18 Sep 2015 23:27:06 +0000"  >&lt;p&gt;Quarantine detection works if we&apos;re okay with doing string matching on the exception message. Hopefully Akka doesn&apos;t change their exception messages often.&lt;/p&gt;

&lt;p&gt;Right now the main challenge is figuring out why restarting the ActorSystem can usually, but not always, allow the node to rejoin the cluster.&lt;/p&gt;

&lt;p&gt;I did observe, on occasion, that nodes on two sides of a partition mutually quarantine each other.  I haven&apos;t yet put much thought into how that case should be handled.&lt;/p&gt;

&lt;p&gt;If there is an existing bug on consolidating the datastore and RPC cluster systems, go ahead and send it my way.&lt;/p&gt;</comment>
                            <comment id="50923" author="gary.wu1@huawei.com" created="Wed, 30 Sep 2015 20:53:52 +0000"  >&lt;p&gt;Quick update on this bug.&lt;/p&gt;

&lt;p&gt;I&apos;m still working on the ActorSystem restart issue.  Namely, sporadically, after a node restarts its ActorSystem, it is not able to rejoin the cluster.  The other two nodes do not get the MemberUp notification, and the node in question becomes a single node cluster.  This despite the fact that Akka Remoting seems to be able to re-associate fine with the newly restarted node and drop the quarantine status.&lt;/p&gt;

&lt;p&gt;Still investigating.&lt;/p&gt;</comment>
                            <comment id="50924" author="gary.wu1@huawei.com" created="Thu, 1 Oct 2015 20:19:26 +0000"  >&lt;p&gt;Looks like the cluster rejoin problem was caused by a node restarting that happens to be the first seed node specified in its own akka.conf.  Since Akka Clustering will join to the seed node that responds first, sometimes the seed node that responds first is itself, and the node ends up only joining itself and forms a single-node cluster.&lt;/p&gt;

&lt;p&gt;To prevent this, during ActorSystem restarts, the list of seed nodes should not contain the self address of the node that is restarting.  My plan is to:&lt;/p&gt;

&lt;p&gt;For the initial boot, the configuration in akka.conf will be used as is.  &lt;br/&gt;
For ActorSystem restarts, prepare a separate Akka config in memory that is the same as akka.conf but removes the node&apos;s self address from the list of seed nodes.&lt;/p&gt;

&lt;p&gt;Let me know if you have any thoughts on this approach.&lt;/p&gt;</comment>
                            <comment id="50925" author="tpantelis" created="Thu, 1 Oct 2015 20:30:43 +0000"  >&lt;p&gt;That sounds reasonable. Moiz has a draft patch to use a single ActorSystem for CDS and RPC which will make things easier.&lt;/p&gt;</comment>
                            <comment id="50926" author="gary.wu1@huawei.com" created="Fri, 2 Oct 2015 23:44:21 +0000"  >&lt;p&gt;I&apos;ve made an initial commit here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/27852/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/#/c/27852/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The case of mutual quarantines hasn&apos;t been addressed yet.  On rare occasions when multiple nodes restart simultaneously (due to mutual quarantines), islands can form.  Will need to think of a good way to address this scenario next.&lt;/p&gt;</comment>
                            <comment id="50927" author="gary.wu1@huawei.com" created="Wed, 21 Oct 2015 18:12:24 +0000"  >&lt;p&gt;In regard to the seed node issue (initial seed node intermittently forms an island on restart):&lt;/p&gt;

&lt;p&gt;I have verified that the initial seed node is unable to rejoin the cluster  around 10% of the time, on both Akka 2.3.10 and 2.3.14.&lt;/p&gt;

&lt;p&gt;According to Patrik Nordwall from Typesafe, this is supposed to work, so maybe there is a bug in Akka.&lt;/p&gt;

&lt;p&gt;I&apos;ve created the following issues against Akka:&lt;/p&gt;

&lt;p&gt;Cluster initial seed node intermittently fails to rejoin cluster on restart&lt;br/&gt;
&lt;a href=&quot;https://github.com/akka/akka/issues/18757&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/akka/akka/issues/18757&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Add named exception to detect when a cluster node has been quarantined by others&lt;br/&gt;
&lt;a href=&quot;https://github.com/akka/akka/issues/18758&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/akka/akka/issues/18758&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="50928" author="gary.wu1@huawei.com" created="Wed, 21 Oct 2015 23:56:24 +0000"  >&lt;p&gt;I re-did some tests, and found that I can eliminate the initial seed node rejoin error by increasing seed-node-timeout to 10s.  This means that the Akka documentation was correct, and it was something specific to my test system that was sporadically preventing nodes 2 and 3 from responding to the first node within the default seed-node-timeout of 5s.&lt;/p&gt;

&lt;p&gt;This means that we can proceed with the node restart without having to modify the seed node configurations.&lt;/p&gt;

&lt;p&gt;What is the best way to restart the container?  Do we want to run an external script?  Or is it easy to programmatically restart bundle 0? (I&apos;m not familiar with this.)&lt;/p&gt;</comment>
                            <comment id="50929" author="tpantelis" created="Thu, 22 Oct 2015 00:33:19 +0000"  >&lt;p&gt;I&apos;ve programmatically restarted karaf in the past but it was a few years ago. Here&apos;s a link &lt;a href=&quot;http://karaf.922171.n3.nabble.com/karaf-programmatic-restart-td4035671.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://karaf.922171.n3.nabble.com/karaf-programmatic-restart-td4035671.html&lt;/a&gt;. It says to set the &quot;karaf.restart&quot; property which I vaguely remember. However I don&apos;t know if this is still valid with karaf 3.x.&lt;/p&gt;</comment>
                            <comment id="50930" author="tpantelis" created="Fri, 23 Oct 2015 14:51:35 +0000"  >&lt;p&gt;Patch &lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/27852/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/#/c/27852/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="50931" author="gary.wu1@huawei.com" created="Fri, 23 Oct 2015 16:58:04 +0000"  >&lt;p&gt;In &lt;a href=&quot;https://github.com/akka/akka/issues/18757&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/akka/akka/issues/18757&lt;/a&gt; the Akka team is finding strange issues from our logs, so there may potentially be a bug they will need to address.  Nonetheless, the functionality is supposed to be that the node can rejoin without having to remove itself from the seed node config, so we can move forward with our patch to restart the karaf container.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10002">
                    <name>Duplicate</name>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="25437">CONTROLLER-883</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="25939">CONTROLLER-1385</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="13531" name="Logs.zip" size="232874" author="ShaleenS" created="Wed, 22 Jul 2015 16:06:58 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                            <customfield id="customfield_11400" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10208" key="com.atlassian.jira.plugin.system.customfieldtypes:textfield">
                        <customfieldname>External issue ID</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4037</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10201" key="com.atlassian.jira.plugin.system.customfieldtypes:url">
                        <customfieldname>External issue URL</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[https://bugs.opendaylight.org/show_bug.cgi?id=4037]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10206" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Issue Type</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10300"><![CDATA[Bug]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10204" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>ODL SR Target Milestone</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10315"><![CDATA[Lithium]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10000" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>0|i02qbb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>