follower reports 401 (unauthorized) and 500 (Internal Error) when leader is isolated. (CONTROLLER-1838)

[CONTROLLER-1841] skip checking /restconf/modules Created: 22/Jun/18  Updated: 05/Jul/18  Resolved: 05/Jul/18

Status: Verified
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Sub-task Priority: Medium
Reporter: Jamo Luhrsen Assignee: Jamo Luhrsen
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

the robot test case to find the new leader normally first polls on /restconf/modules
before going to jolokia to check the car shard for leadership. in some failure cases
we notice that the polling goes for 35s on /restconf/modules (polling 5 times with
5s timeouts) and fails. We want to skip that check and just go straight to /jolokia
just to collect that data point.

/restconf/modules should still respond, imho, so we would want to figure that out
eventually as well



 Comments   
Comment by Jamo Luhrsen [ 26/Jun/18 ]

When I remove the initial /restconf/modules check from the logic that figures out who the new leader and followers
are after isolation, those steps are passing. However, there are other REST interactions in these tests that will fail
for a similar reason.

I think at the end of all of this, we really need to understand and fix what is causing /restconf to be unresponsive.
That is too basic of a problem to ignore, and is just getting in the way of other debugging we might need to do.

For example, after skipping the /restconf call when figuring out the new leader, the test just fails when trying to use
restconf to add cars, and then again when trying to read from restconf to count how many cars are added.

If we are lucky, and this tell-based protocol is doing what we expect, then maybe this unresponsive /restconf/
issue is our only problem in this job to worry about.

Comment by Jamo Luhrsen [ 26/Jun/18 ]

example job

Comment by Jamo Luhrsen [ 05/Jul/18 ]

we did try this, and it did help somewhat as we didn't get quicker failures, but when the restconf was getting locked up (401 or 500) the csit will eventually hit the problems in some fashion
(all ODL interaction is via REST)

Generated at Wed Feb 07 19:56:35 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.