[CONTROLLER-1865] owner changed after bringing back isolated node Created: 02/Oct/18  Updated: 22/Dec/18  Resolved: 22/Dec/18

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: Neon, Oxygen SR4, Fluorine SR2

Type: Bug Priority: Medium
Reporter: Jamo Luhrsen Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File akkaMon.sh    

 Description   

initial email conversation

CSIT is sporadically failing when finding that basic-rpc-test ownership has changed from
it's current owner to some new owner after bringing back an isolated node. The expectation
is that no ownership should change when a new node is re-joined to the cluster.

an example of this failure

the above example was run with debugs enabled for

controller.cluster.datastore.entityownership:DEBUG

Note the reproduction above ran this test suite 8 times, but it only failed once.

Karaf logs here

In order to extract just the portion of logs from the single test suite that failed, you can run
this command:

sed -n "/16.59.*ROBOT MESSAGE: Starting suite/,/restart_odl_with_tell_based_false/p" odl1_karaf.log

the main piece to that command is the 16.59 which is taken from a timestamp I found
that will be seen in all of the karaf logs for that specific suite's start point.



 Comments   
Comment by Luis Gomez [ 20/Nov/18 ]

AFAIR this has been always the behavior, it seems like when isolated instance rejoins it forces an owner re-election. Long time back I commented a test in OFP checking this because I was told this behavior was expected.

Comment by Tom Pantelis [ 08/Dec/18 ]

The test isolates the node that originally owns the Basic-rpc-test entity. In this run that was odl3. During isolation, the majority partition, odl1 and old2, elects a new owner since it is able to make progress. In this case that was odl2. When odl3 was un-isolated, since it re-joins as a minority member, it should not depose the new owner in the majority partition, ie the owner should remain odl2. This part actually worked as designed. The problem is that on partition heal, the new EOS leader, odl1, received a notification from akka that odl2 was unreachable:

2018-10-02T17:01:16,076 | INFO  | opendaylight-cluster-data-shard-dispatcher-76 | ShardManager                     | 287 - org.opendaylight.controller.sal-distributed-datastore - 1.7.4.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-2}, address: akka.tcp://opendaylight-cluster-data@10.30.170.39:2550

It then received an ReachableMember event about a second later at 17:01:17,557. But this little blip caused the EOS to re-assign ownership to odl3 since it had received a ReachableMember event for odl3 shortly before.

The weird thing is that there's no "Marking node(s) as UNREACHABLE ..." log message from akka for odl2, just one for odl3 earlier which is expected since it was isolated. I don't know why akka sent a spurious UnreachableMember event for odl2 - I've never seen that happen w/o a corresponding UNREACHABLE log message.

Comment by Tom Pantelis [ 08/Dec/18 ]

This was also reported in https://bugs.opendaylight.org/show_bug.cgi?id=8430.

Comment by Jamo Luhrsen [ 10/Dec/18 ]

tpantelis, what do we want to do next with this one? I can try to reproduce it locally if
that's helpful. Or we can run our jobs with some tweaks or different logging levels.

Comment by Tom Pantelis [ 10/Dec/18 ]

Easiest thing to do short term is to not fail the test if the owner changes - functionally it doesn't really matter if it does. I can also try to put in a workaround to check the cluster status to verify if the member is really unreachable in the hopes that the spurious/false event doesn't actually reflect the real status (since there was no akka log message indicating unreachability). Other than that, we can try akka artery and hope that alleviates the spurious events.

Comment by Jamo Luhrsen [ 10/Dec/18 ]

well, thinking worst case here and whatever causes this spurious owner change starts happening
frequently, aren't we in a less-than optimal state? vpickard, didn't you have some environment
that was seeing unexpected leader changes in your cluster? Was that strictly related to performance/scale
testing?

Obviously something is funky here and akka is default and what companies are deploying with. I understand
it's not a big deal that once in a while Leadership changes, but this is a pretty simple test with
pretty simple expectations. Hopefully we can find a way to explain it without just ignoring it.

As to trying with artery, I had something working a while back to make testing in our infra possible, but
I'll have to refresh my memory for what I did.

Comment by Victor Pickard [ 10/Dec/18 ]

Jamo,
Yes, I'm currently running some scale testing downstream, and we have seen leadership changes in the cluster when none of the nodes were taken down.

This is likely related to resource issues. I was monitoring the akka port (2550) on all cluster nodes, and it was clear that the recvq for the akka connection was not processed for an extended period of time (40+ seconds), which resulted in leadership changes. Analysis indicated that this was related to very long GC pauses (27+ seconds, 3 times in 20-30 min period). We had a JFR running, and this bug was opened (and patched) to help address the huge allocations.

Still running scale testing with this patch [1] for this JIRA [2] (and also doing GC tunings) to see if we can get this more stable.

When you see this failure in the CSIT, are you monitoring the akka ports on all cluster nodes?

I'm attaching the simple script I use to monitor this (akkaMon.sh)

[1] https://git.opendaylight.org/gerrit/#/c/78377/
[2] https://jira.opendaylight.org/browse/OPNFLWPLUG-1047

Comment by Tom Pantelis [ 10/Dec/18 ]

That is related to shard leadership changes which isn't tied to akka membership events so wouldn't be affected by spurious UnreachableMember events. This CSIT is related entity ownership which does use UnreachableMember events to possibly re-assign ownership.

I'll push a patch to dump the akka cluster state on UnreachableMember event so we can see if that status is actually reflected in the backend state. If it isn't then we can workaround it and ignore the spurious event. Otherwise we can open a case with akka and hope they'll provide support but you know the first thing they will say is to try artery

Comment by Tom Pantelis [ 10/Dec/18 ]

I tried dumping the cluster state on UnreachableMember event in my local env. I brought a node down and saw the expected "Marking node(s) as UNREACHABLE ..." log message just prior to the ShardManager receiving the UnreachableMember event. I expected to see the node reflected in the unreachable set but it wasn't Of course when I checked JMX afterwards it was in the Unreachable set. So unfortunately it appears it's reflected in the "queriable" state some time after we receive the event. Maybe if we wait a little time or poll the state for a time to verify... but that would start to getting into the "ugly" territory.

I say we try artery to see if it reproduces there. jluhrsen I cherry-picked your Oxygen patch to master:
https://git.opendaylight.org/gerrit/#/c/78629/. Can you run the "Global Rpc Isolate" over and over with that patch?

Comment by Jamo Luhrsen [ 11/Dec/18 ]

tpantelis, yeah I'll work on this. I have a csit patch that we need to account
for a few differences in the environment when we use artery. I need to double check
the actual cluster config deployment too.

Comment by Jamo Luhrsen [ 11/Dec/18 ]

working on it here:
https://jenkins.opendaylight.org/sandbox/job/jamo-controller-csit-3node-clustering-ask-all-neon/

Comment by Jamo Luhrsen [ 12/Dec/18 ]

tpantelis, I'm having trouble... the controllers are not coming up properly in the csit. I can
tell the distro is the one coming from your neon cherry pick of my patch because the akka.conf
file (you can see it in the console log) has the aeron-udp tweak. Next, I'm removing that .tcp
string from akka.tcp in akka.conf. That's all I remember having to do before and I got it working
(both locally and in csit).

the controllers are giving a 401 Unauthorized to a GET on /restconf/modules

can you think of anything?

Comment by Tom Pantelis [ 12/Dec/18 ]

We need to look at the karaf log.

Comment by Jamo Luhrsen [ 12/Dec/18 ]

I got it figured out finally

this job ran and passed that global rpc isolate once. I'll start running this in
loops to see if it reproduces.

Comment by Jamo Luhrsen [ 13/Dec/18 ]

This bug also happens with artery. logs here:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/566/jamo-controller-csit-3node-clustering-ask-all-neon/19/

what's next tpantelis

Comment by Tom Pantelis [ 13/Dec/18 ]

It's an issue in akka so, assuming we want to pursue it, then the next step would be to enable akka cluster and remoting debug and open an issue in akka upstream.

Comment by Tom Pantelis [ 13/Dec/18 ]

I think I have an idea what's happening. Since there's no "Marking node(s) as UNREACHABLE ..." log message, as is usually seen when the UnreachableMember event is issued, and the spurious event occurs when the partition is healed, I think that indicates that it wasn't the local node that observed the unreachability of the third node (it really shouldn't have) - it was the state of the third node as observed by the previously isolated node. So when gossip convergence occurs when the partition is healed, the previously isolated node reports that the third node is still unreachable from its vantage point. If so then technically the event may be valid even though it's from the vantage point of the previously isolated node. During isolation, split-brain occurs and, after isolation, both sides eventually converge on the correct state but, during that time, we may observe the temporary intermediate state. The fact that it's sporadic seems to bear this out, ie sometimes the previously isolated node may have already re-connected to the third node and reports a reachable state on convergence and thus we don't receive the spurious UnreachableMember event.

If the UnreachableMember event included which node observed the unreachability then we could possibly filter it out but unfortunately it doesn't. Also, from my testing, querying the current cluster state when the UnreachableMember event is received unfortunately doesn't report the node as unreachable so we can't rely on that.

The only recourse I see short term is to remove the expectation that the owner won't change after partition heal. If it does sometimes, while not ideal, functionally we still end up in a steady state with one owner in the cluster.

For longer term, we could report it upstream and see what the akka folks say.

PS - based on rovarga comment in https://bugs.opendaylight.org/show_bug.cgi?id=8430#c9, I think he may have theorized the same or similar.

Comment by Jamo Luhrsen [ 13/Dec/18 ]

ok, I think I can follow what you are saying. Essentially, when the isolated node is rejoining
the cluster and they are "comparing notes", it might happen that it's not yet realized that it
can connect to one of the nodes so reports it as unreachable and in that case we could get a new
leader election? Does this mean that all nodes are up for grabs to become Leader when this
happens? meaning the isolated node could also end up as the Leader?

tpantelis, I think we could learn something if you reported this to the akka folks, even if it's to learn
that this possibility is expected.

as for what to do with the test, we may just have to ignore this one if it's going to
randomly fail on us for a legit reason. But, I'm also wondering if there is a slightly
more careful way to isolate and bring it back that would ensure leadership remains, but
I can't really think of anything.

Comment by Tom Pantelis [ 13/Dec/18 ]

> Does this mean that all nodes are up for grabs to become Leader when this
happens? meaning the isolated node could also end up as the Leader?

Any node that is known to be reachable. The isolated node is reachable at that point so could become owner, in fact that does happen in some runs.

I can't see any way of ensuring ownership remains from the CSIT perspective. It's dependent on network communication and the timing of when nodes re-connect and when gossip messages are received etc - no way to control that.

Comment by Jamo Luhrsen [ 13/Dec/18 ]

curious, what's the flow and state of things if we first isolate a node (I)
(doing it by blocking traffic outgoing on the isolated node) then only
bringing back it's communication to the Leader (L) and not the healthy
Follower (F)?

Essentially, the Leader has two nodes it can talk with? Would the previously
isolated node tell the Leader that it can't reach F and that causes a new
Leader election? Or does the Leader keep the existing 2 node cluster with
L and F?

Comment by Tom Pantelis [ 14/Dec/18 ]

Not sure I follow exactly... what is L the leader of?

I assume you mean the new Leader of the EOS shard that is elected in the majority partition after isolation. Let's say L and F are in the majority partition and I is isolated. So L and F can talk to each other but neither can talk to I. Let's say entity E is now owned by F in the majority partition. I also still owns E on its side b/c it doesn't know it's isolated. At this point we have 2 owners in the cluster. Then communication between L and I is re-established. Initially I rejoins and relinquishes ownership of E leaving F as the sole owner. So far so good. Via akka's gossip convergence, L learns from I that F is unreachable from its vantage point. So L receives an UnreachableMember event that F is unreachable and selects a new owner for E, which could be either L or I.

Comment by Jamo Luhrsen [ 14/Dec/18 ]

Ok, I'm starting to understand better. can we try it the other way now?

 

L and F can talk to each other but not I, so like you said I owns
E just like L owns E because it doesn't know it's isolated. Now,
let I get communication to F instead of L this time.

what is our end state then?

Comment by Tom Pantelis [ 14/Dec/18 ]

That's an interesting scenario. If I was the shard leader before isolation then it would remain as IsolatedLeader during isolation. When connection is re-established to F, it would send AppendEntries but F would reject it b/c L's term would be less than F's (as F's term was bumped due to the new leader election). L would change to Follower due to the AppendEntriesReply with a higher term. Eventually it would timeout due to not hearing from the leader L and start a new election. It would request a vote from F but would (or should) be rejected b/c F's last log index and term would be higher (part of the brilliance of the Raft algorithm). I would then go back to Follower and repeat. So I would essentially remain isolated. 

L would or should not get an UnreachableMember event for itself - that wouldn't make sense. Looking at the logs in this case, F (odl2) did not report an UnreachableMember event for itself.

In any case, L would remain the owner of E.

Comment by Tom Pantelis [ 14/Dec/18 ]

Looking at the logs again, odl1 reported the UnreachableMember event for odl2 at 17:01:16,076. odl3 reported the ReachableMember event for odl2 at 17:01:16,988 which is slightly after (912 ms) the UnreachableMember event. So that supports the theory. odl3 reported odl1 reachable at 17:01:16,001 and odl1 reported odl3 reachable at 17:01:15,732. So odl1 and odl3 would have started gossip convergence before odl3 gained reachability with odl2.

 

Comment by Robert Varga [ 14/Dec/18 ]

Correct, the gory details are at https://doc.akka.io/docs/akka/2.5/common/cluster.html

 

Comment by Tom Pantelis [ 14/Dec/18 ]

yeah - given akka's gossip convergence, it's FAD (and not to say it's wrong either).

Comment by Jamo Luhrsen [ 14/Dec/18 ]

thanks for walking me through these tpantelis. I'm learning a lot

Comment by Jamo Luhrsen [ 14/Dec/18 ]

Since we cannot completely guarantee that this test case with an isolated node will keep the
same Leader when it rejoined, I will remove that test case. The only other option I could think
of would be to enhance the test case to get really "white box" and check the logs for
those UnreachableMember messages and somehow figure out if the Leadership change
(when it happens) is expected. But, now I think we are falling in to the land of testing akka
which we just have to trust is already done by the akka project themselves.

Comment by Tom Pantelis [ 14/Dec/18 ]

yeah - we could do something fancy like that, ie before failing the test due to owner change, check the EOS shard leader's log for ''Received UnreachableMember" for whichever the owner's member name was expected not to change. I don't think we'd  actually be testing akka but rather working around it so at least when the spurious event doesn't occur then we can verify no owner change. It's up to you if you think it's worth it - I'm fine either way.

Generated at Wed Feb 07 19:56:38 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.