<!-- 
RSS generated by JIRA (8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d) at Wed Feb 07 19:56:38 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>OpenDaylight JIRA</title>
    <link>https://jira.opendaylight.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>8.20.10</version>
        <build-number>820010</build-number>
        <build-date>22-06-2022</build-date>
    </build-info>


<item>
            <title>[CONTROLLER-1865] owner changed after bringing back isolated node</title>
                <link>https://jira.opendaylight.org/browse/CONTROLLER-1865</link>
                <project id="10113" key="CONTROLLER">controller</project>
                    <description>&lt;p&gt;&lt;a href=&quot;https://lists.opendaylight.org/pipermail/controller-dev/2018-September/014624.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;initial email conversation &lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CSIT is sporadically failing when finding that basic-rpc-test ownership has changed from&lt;br/&gt;
it&apos;s current owner to some new owner after bringing back an isolated node. The expectation&lt;br/&gt;
is that no ownership should change when a new node is re-joined to the cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/445/jamo-controller-csit-3node-clustering-ask-all-oxygen/1/robot-plugin/log.html.gz#s1-s14-t9-k2-k1-k1-k4&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;an example of this failure &lt;/a&gt;&lt;/p&gt;

&lt;p&gt;the above example was run with debugs enabled for&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
controller.cluster.datastore.entityownership:DEBUG
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note the reproduction above ran this test suite 8 times, but it only failed once.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/445/jamo-controller-csit-3node-clustering-ask-all-oxygen/1/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;Karaf logs here &lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In order to extract just the portion of logs from the single test suite that failed, you can run&lt;br/&gt;
this command:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
sed -n &lt;span class=&quot;code-quote&quot;&gt;&quot;/16.59.*ROBOT MESSAGE: Starting suite/,/restart_odl_with_tell_based_false/p&quot;&lt;/span&gt; odl1_karaf.log
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;the main piece to that command is the 16.59 which is taken from a timestamp I found&lt;br/&gt;
that will be seen in all of the karaf logs for that specific suite&apos;s start point.&lt;/p&gt;</description>
                <environment></environment>
        <key id="30817">CONTROLLER-1865</key>
            <summary>owner changed after bringing back isolated node</summary>
                <type id="10104" iconUrl="https://jira.opendaylight.org/secure/viewavatar?size=xsmall&amp;avatarId=10303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.opendaylight.org/images/icons/priorities/major.svg">Medium</priority>
                        <status id="5" iconUrl="https://jira.opendaylight.org/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="green"/>
                                    <resolution id="10000">Done</resolution>
                                        <assignee username="tpantelis">Tom Pantelis</assignee>
                                    <reporter username="jluhrsen">Jamo Luhrsen</reporter>
                        <labels>
                            <label>csit:3node</label>
                    </labels>
                <created>Tue, 2 Oct 2018 17:36:50 +0000</created>
                <updated>Sat, 22 Dec 2018 15:45:24 +0000</updated>
                            <resolved>Sat, 22 Dec 2018 15:45:24 +0000</resolved>
                                                    <fixVersion>Neon</fixVersion>
                    <fixVersion>Oxygen SR4</fixVersion>
                    <fixVersion>Fluorine SR2</fixVersion>
                                    <component>clustering</component>
                        <due></due>
                            <votes>0</votes>
                                    <watches>5</watches>
                                                                                                                <comments>
                            <comment id="65721" author="ecelgp" created="Tue, 20 Nov 2018 18:19:38 +0000"  >&lt;p&gt;AFAIR this has been always the behavior, it seems like when isolated instance rejoins it forces an owner re-election. Long time back I commented a test in OFP checking this because I was told this behavior was expected.&lt;/p&gt;</comment>
                            <comment id="65923" author="tpantelis" created="Sat, 8 Dec 2018 00:01:29 +0000"  >&lt;p&gt;The test isolates the node that originally owns the Basic-rpc-test entity. In this run that was odl3. During isolation, the majority partition, odl1 and old2, elects a new owner since it is able to make progress. In this case that was odl2. When odl3 was un-isolated, since it re-joins as a minority member, it should not depose the new owner in the majority partition, ie the owner should remain odl2. This part actually worked as designed. The problem is that on partition heal, the new  EOS leader, odl1, received a notification from akka that odl2 was unreachable:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;2018-10-02T17:01:16,076 | INFO  | opendaylight-cluster-data-shard-dispatcher-76 | ShardManager                     | 287 - org.opendaylight.controller.sal-distributed-datastore - 1.7.4.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-2}, address: akka.tcp://opendaylight-cluster-data@10.30.170.39:2550
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It then received an ReachableMember event about a second later at 17:01:17,557.  But this little blip caused the EOS to re-assign ownership to odl3 since it had received a ReachableMember event for odl3 shortly before. &lt;/p&gt;

&lt;p&gt;The weird thing is that there&apos;s no &quot;Marking node(s) as UNREACHABLE ...&quot; log message from akka for odl2, just one for odl3 earlier which is expected since it was isolated. I don&apos;t know why akka sent a spurious UnreachableMember event for odl2 - I&apos;ve never seen that happen w/o a corresponding UNREACHABLE log message.  &lt;/p&gt;</comment>
                            <comment id="65925" author="tpantelis" created="Sat, 8 Dec 2018 14:43:34 +0000"  >&lt;p&gt;This was also reported in &lt;a href=&quot;https://bugs.opendaylight.org/show_bug.cgi?id=8430&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugs.opendaylight.org/show_bug.cgi?id=8430&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="65937" author="jluhrsen" created="Mon, 10 Dec 2018 17:35:42 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=tpantelis&quot; class=&quot;user-hover&quot; rel=&quot;tpantelis&quot;&gt;tpantelis&lt;/a&gt;, what do we want to do next with this one? I can try to reproduce it locally if&lt;br/&gt;
that&apos;s helpful. Or we can run our jobs with some tweaks or different logging levels.&lt;/p&gt;</comment>
                            <comment id="65938" author="tpantelis" created="Mon, 10 Dec 2018 17:56:12 +0000"  >&lt;p&gt;Easiest thing to do short term is to not fail the test if the owner changes - functionally it doesn&apos;t really matter if it does. I can also try to put in a workaround to check the cluster status to verify if the member is really unreachable in the hopes that the spurious/false event doesn&apos;t actually reflect the real status (since there was no akka log message indicating unreachability).  Other than that, we can try akka artery and hope that alleviates the spurious events.&lt;/p&gt;
</comment>
                            <comment id="65942" author="jluhrsen" created="Mon, 10 Dec 2018 18:15:05 +0000"  >&lt;p&gt;well, thinking worst case here and whatever causes this spurious owner change starts happening&lt;br/&gt;
frequently, aren&apos;t we in a less-than optimal state? &lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=vpickard&quot; class=&quot;user-hover&quot; rel=&quot;vpickard&quot;&gt;vpickard&lt;/a&gt;, didn&apos;t you have some environment&lt;br/&gt;
that was seeing unexpected leader changes in your cluster? Was that strictly related to performance/scale&lt;br/&gt;
testing?&lt;/p&gt;

&lt;p&gt;Obviously something is funky here and akka is default and what companies are deploying with. I understand&lt;br/&gt;
it&apos;s not a big deal that once in a while Leadership changes, but this is a pretty simple test with&lt;br/&gt;
pretty simple expectations. Hopefully we can find a way to explain it without just ignoring it.&lt;/p&gt;

&lt;p&gt;As to trying with artery, I had something working a while back to make testing in our infra possible, but&lt;br/&gt;
I&apos;ll have to refresh my memory for what I did.&lt;/p&gt;</comment>
                            <comment id="65943" author="vpickard" created="Mon, 10 Dec 2018 18:26:48 +0000"  >&lt;p&gt;Jamo,&lt;br/&gt;
Yes, I&apos;m currently running some scale testing downstream, and we have seen leadership changes in the cluster when none of the nodes were taken down.&lt;/p&gt;

&lt;p&gt;This is likely related to resource issues. I was monitoring the akka port (2550) on all cluster nodes, and it was clear that the recvq for the akka connection was not processed for an extended period of time (40+ seconds), which resulted in leadership changes. Analysis indicated that this was related to very long GC pauses (27+ seconds, 3 times in 20-30 min period). We had a JFR running, and this bug was opened (and patched) to help address the huge allocations. &lt;/p&gt;

&lt;p&gt;Still running scale testing with this patch &lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt; for this JIRA &lt;span class=&quot;error&quot;&gt;&amp;#91;2&amp;#93;&lt;/span&gt;  (and also doing GC tunings) to see if we can get this more stable.&lt;/p&gt;

&lt;p&gt;When you see this failure in the CSIT, are you monitoring the akka ports on all cluster nodes?&lt;/p&gt;

&lt;p&gt;I&apos;m attaching the simple script I use to monitor this (akkaMon.sh)&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;error&quot;&gt;&amp;#91;1&amp;#93;&lt;/span&gt; &lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/78377/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/#/c/78377/&lt;/a&gt;&lt;br/&gt;
&lt;span class=&quot;error&quot;&gt;&amp;#91;2&amp;#93;&lt;/span&gt; &lt;a href=&quot;https://jira.opendaylight.org/browse/OPNFLWPLUG-1047&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;https://jira.opendaylight.org/browse/OPNFLWPLUG-1047&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="65946" author="tpantelis" created="Mon, 10 Dec 2018 19:25:09 +0000"  >&lt;p&gt;That is related to shard leadership changes which isn&apos;t tied to akka membership events so wouldn&apos;t be affected by spurious UnreachableMember events.  This CSIT is related entity ownership which does use UnreachableMember events to possibly re-assign ownership. &lt;/p&gt;

&lt;p&gt;I&apos;ll push a patch to dump the akka cluster state  on UnreachableMember event so we can see if that status is actually reflected  in the backend state. If it isn&apos;t then we can workaround it and ignore the spurious event. Otherwise we can open a case with akka and hope they&apos;ll provide support but you know the first thing they will say is to try artery &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.opendaylight.org/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;
</comment>
                            <comment id="65948" author="tpantelis" created="Mon, 10 Dec 2018 23:16:14 +0000"  >&lt;p&gt;I tried dumping the cluster state on UnreachableMember event in my local env. I brought a node down and saw the expected &quot;Marking node(s) as UNREACHABLE ...&quot; log message just prior to the ShardManager receiving the UnreachableMember event. I expected to see the node reflected in the unreachable set but it wasn&apos;t &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.opendaylight.org/images/icons/emoticons/sad.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt; Of course when I checked JMX afterwards it was in the Unreachable set. So unfortunately it appears it&apos;s reflected in the &quot;queriable&quot; state some time after we receive the event. Maybe if we wait a little time or poll the state for a time to verify... but that would start to getting into the &quot;ugly&quot; territory.&lt;/p&gt;

&lt;p&gt;I say we try artery to see if it reproduces there. &lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=jluhrsen&quot; class=&quot;user-hover&quot; rel=&quot;jluhrsen&quot;&gt;jluhrsen&lt;/a&gt;  I cherry-picked your Oxygen patch to master:&lt;br/&gt;
&lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/78629/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/#/c/78629/&lt;/a&gt;. Can you run the &quot;Global Rpc Isolate&quot; over and over with that patch?&lt;/p&gt;</comment>
                            <comment id="65949" author="jluhrsen" created="Tue, 11 Dec 2018 05:34:15 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=tpantelis&quot; class=&quot;user-hover&quot; rel=&quot;tpantelis&quot;&gt;tpantelis&lt;/a&gt;, yeah I&apos;ll work on this. I have a csit patch that we need to account&lt;br/&gt;
for a few differences in the environment when we use artery. I need to double check&lt;br/&gt;
the actual cluster config deployment too.&lt;/p&gt;</comment>
                            <comment id="65954" author="jluhrsen" created="Tue, 11 Dec 2018 22:01:26 +0000"  >&lt;p&gt;working on it here:&lt;br/&gt;
  &lt;a href=&quot;https://jenkins.opendaylight.org/sandbox/job/jamo-controller-csit-3node-clustering-ask-all-neon/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://jenkins.opendaylight.org/sandbox/job/jamo-controller-csit-3node-clustering-ask-all-neon/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="65958" author="jluhrsen" created="Wed, 12 Dec 2018 00:55:47 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=tpantelis&quot; class=&quot;user-hover&quot; rel=&quot;tpantelis&quot;&gt;tpantelis&lt;/a&gt;, I&apos;m having trouble... the controllers are not coming up properly in the csit. I can&lt;br/&gt;
tell the distro is the one coming from your neon cherry pick of my patch because the akka.conf&lt;br/&gt;
file (you can see it in the console log) has the aeron-udp tweak. Next, I&apos;m removing that .tcp&lt;br/&gt;
string from akka.tcp in akka.conf. That&apos;s all I remember having to do before and I got it working&lt;br/&gt;
(both locally and in csit).&lt;/p&gt;

&lt;p&gt;the controllers are giving a 401 Unauthorized to a GET on /restconf/modules&lt;/p&gt;

&lt;p&gt;can you think of anything?&lt;/p&gt;</comment>
                            <comment id="65959" author="tpantelis" created="Wed, 12 Dec 2018 01:58:18 +0000"  >&lt;p&gt;We need to look at the karaf log.&lt;/p&gt;</comment>
                            <comment id="65975" author="jluhrsen" created="Wed, 12 Dec 2018 22:11:49 +0000"  >&lt;p&gt;I got it figured out &lt;b&gt;finally&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jenkins.opendaylight.org/sandbox/job/jamo-controller-csit-3node-clustering-ask-all-neon/15/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;this job &lt;/a&gt; ran and passed that global rpc isolate once. I&apos;ll start running this in&lt;br/&gt;
loops to see if it reproduces.&lt;/p&gt;</comment>
                            <comment id="65991" author="jluhrsen" created="Thu, 13 Dec 2018 18:59:34 +0000"  >&lt;p&gt;This bug also happens with artery. logs here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/566/jamo-controller-csit-3node-clustering-ask-all-neon/19/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/566/jamo-controller-csit-3node-clustering-ask-all-neon/19/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;what&apos;s next &lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=tpantelis&quot; class=&quot;user-hover&quot; rel=&quot;tpantelis&quot;&gt;tpantelis&lt;/a&gt;&lt;/p&gt;
</comment>
                            <comment id="65992" author="tpantelis" created="Thu, 13 Dec 2018 19:26:28 +0000"  >&lt;p&gt;It&apos;s an issue in akka so, assuming we want to pursue it, then the next step would be to enable akka cluster and remoting debug and open an issue in akka upstream. &lt;/p&gt;</comment>
                            <comment id="65995" author="tpantelis" created="Thu, 13 Dec 2018 22:30:58 +0000"  >&lt;p&gt;I think I have an idea what&apos;s happening. Since there&apos;s no &quot;Marking node(s) as UNREACHABLE ...&quot; log message, as is usually seen when the UnreachableMember event is issued, and the spurious event occurs when the partition is healed, I think that indicates that it wasn&apos;t the local node that observed the unreachability of the third node (it really shouldn&apos;t have) - it was the state of the third node as observed by the previously isolated node.  So when gossip convergence occurs when the partition is healed, the previously isolated node reports that the third node is still unreachable from its vantage point. If so then technically the event may be valid even though it&apos;s from the vantage point of the previously isolated node. During isolation, split-brain occurs and, after isolation, both sides eventually converge on the correct state but, during that time, we may observe the temporary intermediate state. The fact that it&apos;s sporadic seems to bear this out, ie sometimes the  previously isolated node may have already re-connected to the third node and reports a reachable state on convergence and thus we don&apos;t receive the spurious UnreachableMember event.&lt;/p&gt;

&lt;p&gt;If the UnreachableMember event included which node observed the unreachability then we could possibly filter it out but unfortunately it doesn&apos;t. Also, from my testing, querying the current cluster state when the UnreachableMember event is received unfortunately doesn&apos;t report the node as unreachable so we can&apos;t rely on that.&lt;/p&gt;

&lt;p&gt;The only recourse I see short term is to remove the expectation that the owner won&apos;t change after partition heal. If it does sometimes, while not ideal, functionally we still end up  in a steady state with one owner in the cluster.&lt;/p&gt;

&lt;p&gt;For longer term, we could report it upstream and see what the akka folks say.&lt;/p&gt;

&lt;p&gt;PS - based on &lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=rovarga&quot; class=&quot;user-hover&quot; rel=&quot;rovarga&quot;&gt;rovarga&lt;/a&gt; comment in &lt;a href=&quot;https://bugs.opendaylight.org/show_bug.cgi?id=8430#c9&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://bugs.opendaylight.org/show_bug.cgi?id=8430#c9&lt;/a&gt;, I think he may have theorized the same or similar.&lt;/p&gt;</comment>
                            <comment id="65997" author="jluhrsen" created="Thu, 13 Dec 2018 23:02:34 +0000"  >&lt;p&gt;ok, I think I can follow what you are saying. Essentially, when the isolated node is rejoining&lt;br/&gt;
the cluster and they are &quot;comparing notes&quot;, it might happen that it&apos;s not yet realized that it&lt;br/&gt;
can connect to one of the nodes so reports it as unreachable and in that case we could get a new&lt;br/&gt;
leader election? Does this mean that all nodes are up for grabs to become Leader when this &lt;br/&gt;
happens? meaning the isolated node could also end up as the Leader?&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=tpantelis&quot; class=&quot;user-hover&quot; rel=&quot;tpantelis&quot;&gt;tpantelis&lt;/a&gt;, I think we could learn something if you reported this to the akka folks, even if it&apos;s to learn&lt;br/&gt;
that this possibility is expected.&lt;/p&gt;

&lt;p&gt;as for what to do with the test, we may just have to ignore this one if it&apos;s going to&lt;br/&gt;
randomly fail on us for a legit reason. But, I&apos;m also wondering if there is a slightly&lt;br/&gt;
more careful way to isolate and bring it back that would ensure leadership remains, but&lt;br/&gt;
I can&apos;t really think of anything.&lt;/p&gt;</comment>
                            <comment id="65998" author="tpantelis" created="Thu, 13 Dec 2018 23:31:37 +0000"  >&lt;p&gt;&amp;gt; Does this mean that all nodes are up for grabs to become Leader when this &lt;br/&gt;
happens? meaning the isolated node could also end up as the Leader?&lt;/p&gt;

&lt;p&gt;Any node that is known to be reachable. The  isolated node is reachable at that point so could become owner, in fact that does happen in some runs.&lt;/p&gt;

&lt;p&gt;I can&apos;t see any way of ensuring ownership remains from the CSIT perspective. It&apos;s dependent on network communication and the timing of when nodes re-connect and when gossip messages are received etc - no way to control that.&lt;/p&gt;</comment>
                            <comment id="65999" author="jluhrsen" created="Thu, 13 Dec 2018 23:56:02 +0000"  >&lt;p&gt;curious, what&apos;s the flow and state of things if we first isolate a node (I)&lt;br/&gt;
(doing it by blocking traffic &lt;b&gt;outgoing&lt;/b&gt; on the isolated node) then only&lt;br/&gt;
bringing back it&apos;s communication to the Leader (L) and not the healthy&lt;br/&gt;
Follower (F)?&lt;/p&gt;

&lt;p&gt;Essentially, the Leader has two nodes it can talk with? Would the previously&lt;br/&gt;
isolated node tell the Leader that it can&apos;t reach F and that causes a new&lt;br/&gt;
Leader election? Or does the Leader keep the existing 2 node cluster with&lt;br/&gt;
L and F?&lt;/p&gt;</comment>
                            <comment id="66000" author="tpantelis" created="Fri, 14 Dec 2018 00:21:11 +0000"  >&lt;p&gt;Not sure I follow exactly... what is L the leader of?&lt;/p&gt;

&lt;p&gt;I assume you mean the new Leader of the EOS shard that is elected in the majority partition after isolation. Let&apos;s say &lt;b&gt;L&lt;/b&gt; and &lt;b&gt;F&lt;/b&gt; are in the majority partition and &lt;b&gt;I&lt;/b&gt; is isolated. So &lt;b&gt;L&lt;/b&gt; and &lt;b&gt;F&lt;/b&gt; can talk to each other but neither can talk to &lt;b&gt;I&lt;/b&gt;. Let&apos;s say entity &lt;b&gt;E&lt;/b&gt; is now owned by &lt;b&gt;F&lt;/b&gt; in the majority partition. &lt;b&gt;I&lt;/b&gt; also still owns &lt;b&gt;E&lt;/b&gt;&#160;on its side b/c it doesn&apos;t know it&apos;s isolated. At this point we have 2 owners in the cluster. Then communication between &lt;b&gt;L&lt;/b&gt; and &lt;b&gt;I&lt;/b&gt; is re-established. Initially &lt;b&gt;I&lt;/b&gt; rejoins and relinquishes ownership of &lt;b&gt;E&lt;/b&gt; leaving &lt;b&gt;F&lt;/b&gt; as the sole owner. So far so good. Via akka&apos;s gossip convergence, &lt;b&gt;L&lt;/b&gt; learns from &lt;b&gt;I&lt;/b&gt; that &lt;b&gt;F&lt;/b&gt; is unreachable from its vantage point. So &lt;b&gt;L&lt;/b&gt; receives an UnreachableMember event that &lt;b&gt;F&lt;/b&gt; is unreachable and selects a new owner for &lt;b&gt;E&lt;/b&gt;, which could be either &lt;b&gt;L&lt;/b&gt; or &lt;b&gt;I&lt;/b&gt;.&lt;/p&gt;</comment>
                            <comment id="66001" author="jluhrsen" created="Fri, 14 Dec 2018 00:52:32 +0000"  >&lt;p&gt;Ok, I&apos;m starting to understand better. can we try it the other way now?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&lt;b&gt;L&lt;/b&gt; and &lt;b&gt;F&lt;/b&gt; can talk to each other but not &lt;b&gt;I&lt;/b&gt;, so like you said &lt;b&gt;I&lt;/b&gt; owns&lt;br/&gt;
&lt;b&gt;E&lt;/b&gt; just like &lt;b&gt;L&lt;/b&gt; owns &lt;b&gt;E&lt;/b&gt; because it doesn&apos;t know it&apos;s isolated. Now,&lt;br/&gt;
let &lt;b&gt;I&lt;/b&gt; get communication to &lt;b&gt;F&lt;/b&gt; instead of &lt;b&gt;L&lt;/b&gt; this time.&lt;/p&gt;

&lt;p&gt;what is our end state then?&lt;/p&gt;</comment>
                            <comment id="66002" author="tpantelis" created="Fri, 14 Dec 2018 02:38:52 +0000"  >&lt;p&gt;That&apos;s an interesting scenario. If &lt;b&gt;I&lt;/b&gt; was the shard leader before isolation then it would remain as IsolatedLeader during&#160;isolation. When connection is re-established to &lt;b&gt;F&lt;/b&gt;, it would send AppendEntries but &lt;b&gt;F&lt;/b&gt; would reject it b/c &lt;b&gt;L&lt;/b&gt;&apos;s term would be less than &lt;b&gt;F&lt;/b&gt;&apos;s (as &lt;b&gt;F&lt;/b&gt;&apos;s term was bumped due to the new leader election). &lt;b&gt;L&lt;/b&gt; would change to Follower due to the&#160;AppendEntriesReply with a higher term. Eventually it would timeout due to not hearing from the leader &lt;b&gt;L&lt;/b&gt; and start a new election. It would request a vote from &lt;b&gt;F&lt;/b&gt; but would (or should) be rejected b/c &lt;b&gt;F&lt;/b&gt;&apos;s last log index and term would be higher (part of the brilliance of the Raft algorithm). &lt;b&gt;I&lt;/b&gt; would then go back to Follower and repeat. So &lt;b&gt;I&lt;/b&gt; would essentially remain isolated.&#160;&lt;/p&gt;

&lt;p&gt;&lt;b&gt;L&lt;/b&gt; would or should not get an UnreachableMember event for itself - that wouldn&apos;t make sense. Looking at the logs in this case, &lt;b&gt;F&lt;/b&gt; (odl2) did not report an&#160;UnreachableMember event for itself.&lt;/p&gt;

&lt;p&gt;In any case, &lt;b&gt;L&lt;/b&gt; would remain the owner of &lt;b&gt;E&lt;/b&gt;.&lt;/p&gt;</comment>
                            <comment id="66003" author="tpantelis" created="Fri, 14 Dec 2018 05:15:15 +0000"  >&lt;p&gt;Looking at the logs again, odl1&#160;reported the UnreachableMember event for odl2 at 17:01:16,076. odl3 reported the ReachableMember event for odl2 at&#160;17:01:16,988 which is slightly after (912 ms) the&#160;UnreachableMember event. So that supports the theory. odl3 reported odl1 reachable at&#160;17:01:16,001 and odl1 reported odl3 reachable at 17:01:15,732. So odl1 and odl3 would have started gossip convergence before odl3 gained reachability with odl2.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="66004" author="rovarga" created="Fri, 14 Dec 2018 05:48:24 +0000"  >&lt;p&gt;Correct, the gory details are at &lt;a href=&quot;https://doc.akka.io/docs/akka/2.5/common/cluster.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://doc.akka.io/docs/akka/2.5/common/cluster.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="66005" author="tpantelis" created="Fri, 14 Dec 2018 06:05:48 +0000"  >&lt;p&gt;yeah -&#160;given akka&apos;s gossip convergence, it&apos;s FAD (and not to say it&apos;s wrong either).&lt;/p&gt;</comment>
                            <comment id="66017" author="jluhrsen" created="Fri, 14 Dec 2018 16:27:26 +0000"  >&lt;p&gt;thanks for walking me through these &lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=tpantelis&quot; class=&quot;user-hover&quot; rel=&quot;tpantelis&quot;&gt;tpantelis&lt;/a&gt;. I&apos;m learning a lot &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.opendaylight.org/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="66018" author="jluhrsen" created="Fri, 14 Dec 2018 16:30:17 +0000"  >&lt;p&gt;Since we cannot completely guarantee that this test case with an isolated node will keep the&lt;br/&gt;
same Leader when it rejoined, I will remove that test case. The only other option I could think&lt;br/&gt;
of would be to enhance the test case to get really &quot;white box&quot; and check the logs for&lt;br/&gt;
those UnreachableMember messages and somehow figure out if the Leadership change&lt;br/&gt;
(when it happens) is expected. But, now I think we are falling in to the land of testing akka&lt;br/&gt;
which we just have to trust is already done by the akka project themselves.&lt;/p&gt;</comment>
                            <comment id="66020" author="tpantelis" created="Fri, 14 Dec 2018 17:11:24 +0000"  >&lt;p&gt;yeah -&#160;we could do something fancy like that, ie before failing the test due to owner change, check the EOS shard leader&apos;s log for &apos;&apos;Received UnreachableMember&quot; for whichever the owner&apos;s member name was expected not to change. I don&apos;t think we&apos;d&#160; actually be testing akka but rather&#160;working around it so at least when the spurious event doesn&apos;t occur then we can verify no&#160;owner change. It&apos;s up to you if you think it&apos;s worth it - I&apos;m fine either way.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="15069" name="akkaMon.sh" size="191" author="vpickard" created="Mon, 10 Dec 2018 18:27:08 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                            <customfield id="customfield_11400" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10000" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>0|i03j73:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>