follower reports 401 (unauthorized) and 500 (Internal Error) when leader is isolated. (CONTROLLER-1838)

[CONTROLLER-1839] leader vs follower data Created: 22/Jun/18  Updated: 23/Jun/18  Resolved: 23/Jun/18

Status: Verified
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Sub-task Priority: Medium
Reporter: Jamo Luhrsen Assignee: Jamo Luhrsen
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

check results of jobs to see if there is any correlation between the passing and
failing jobs and whether the follower that is chosen to talk with (car-adding
script and robot rest calls) is the one that becomes the leader or stays as
follower when the initial leader is isolated.



 Comments   
Comment by Jamo Luhrsen [ 23/Jun/18 ]

TL;DR, There is no pattern in the jobs that pass or fail regarding which follower will end up being the
leader when the initial leader is isolated.

I looked at the most recent 10 jobs, 4 of which passed. one of those passing jobs was sending
the car creation script to the follower that ended up being the leader, while the other 3 were
talking to the follower that remained the follower.

also, in the 6 jobs that failed, one (job 123) saw a "read timed out" in the final test case, but
everything else was fine in that suite. We have a genius bug that was seeing this similar
thing. That actually had a fix merged though, so now I wonder what could have cause it.
This is not specific to this CONTROLLER-1839 issue though, just noting it here.

The other 5 jobs that failed all say the python tool get the 401 unauthorized problem,
which we think should be expected in the ask-based protocol (which is what these jobs
were running). Two of those were talking to the follower that would become the leader
and the other three were talking to the follower that stayed follower.

So, all said and done, there is no pattern or anything to zero in on

Below are my notes as I went through the jobs. might not make sense, but I wanted
to keep them.

A RTO on last TC
B python tool 401, but completed without traceback
C PASS new leader is dst
D PASS new leader is not dst

123 A . D
122 B . C
121 C
120 D
119 B . D
118 D
117 D
116 B . C
115 B . C
114 B . D

Generated at Wed Feb 07 19:56:34 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.