BAT SIGNAL on infra issues (RELENG-113)

[RELENG-114] Analyze CSIT Red Dots Created: 04/May/18  Updated: 01/Jun/18  Resolved: 01/Jun/18

Status: Closed
Project: releng
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Sub-task Priority: Medium
Reporter: Jamo Luhrsen Assignee: Jamo Luhrsen
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

red dots are what we see on infra aborts. However, there are also other instances
when we will see red dots that are not red dots. For example, if some bug creeps
in that bloats our ODL logs to such a large size that we cannot archive them  and
and jenkins gives up (speaking from experience).

this task is for someone to take a reasonable sample of the recent red dots in
CSIT jobs (preferabbly netvirt, openflowplugin and controller) and report back
the percentage of infra aborts vs not.

Based on this data, we can decide how smart we want to get with our bat signal
deciding when to sound off. We don't want to create so much noise that it's annoying
meaningless to the admins that will be expected to watch this.



 Comments   
Comment by Jamo Luhrsen [ 21/May/18 ]

some analysis:

 
this job had 4 red dots in 30 runs over the course of a month. Only one 1 red dot was an infra abort.

this job had 13 runs over a month, 2 of which were red dots. One 1 of those red dots was an infra issue.

this job had 30 runs over a four day period with two red dots. Neither red dot was a problem with infra.

this job also with 30 runs over a four day period had 3 red dots, none of which were infra related.

To summarize, that's aprox 120 jobs with 11 red dots, but only 2 were infra related. So, essentially we have a 1.6% rate of infra
aborts. Also, the 2 infra aborts we saw were not some indication of infra being broken and affecting jobs running afterward.

I'm not sure it's justifiable to spend time on automation to warn the right people if we get an abort, unless someone has
the extra cycles or there is some other problem I'm missing. This is not to say we didn't have major issues in the past when
something like this would have been very useful.

Generated at Wed Feb 07 20:37:31 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.