[RELENG-75] Heat scripts fail to bring nodes online Created: 04/Jan/18  Updated: 12/Jan/18  Resolved: 12/Jan/18

Status: Resolved
Project: releng
Component/s: Jenkins Job Builder
Affects Version/s: None
Fix Version/s: None

Type: Story Priority: Medium
Reporter: Thanh Ha (zxiiro) Assignee: Thanh Ha (zxiiro)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

jluhrsen mentioned that CSIT jobs are often failing to come online and failing job builds for example see the history of [0]. It seems to be failing often enough.

[0] https://jenkins.opendaylight.org/releng/user/jluhrsen/my-views/view/netvirt%20csit/job/netvirt-csit-1node-openstack-ocata-upstream-stateful-snat-conntrack-oxygen/



 Comments   
Comment by Thanh Ha (zxiiro) [ 04/Jan/18 ]

Not sure if this is the cause but Heat stack cleanup scripts currently run every 15 minutes. There's definitely a race condition that's possible here between when a stack is in progress creating while the VM is coming fully online and is passed back to the CSIT job. It's possible that the script's list of stacks contains stacks that are in this in progress state while it's coming online and deleting the stacks before it can be passed back to the CSIT job.

We should re-evaluate this script and see if we can add some smarts into it to improve things.

https://github.com/opendaylight/releng-builder/blob/master/jjb/opendaylight-infra-cleanup-stale-stacks.sh

Comment by Thanh Ha (zxiiro) [ 04/Jan/18 ]

Proposed patches:

Merged.

Comment by Thanh Ha (zxiiro) [ 05/Jan/18 ]

I suspect the previous patch might not solve the problem after poking at it more deeply. I decided to additionally add https://git.opendaylight.org/gerrit/66883 patch to inprove the debug output to make it more useful. Hopefully we can get more useful info the next time it happens.

Comment by Thanh Ha (zxiiro) [ 05/Jan/18 ]

With the additional logging we got from the stack patches last night we know what the real issue is now. The error message is:

Resource CREATE failed: ResourceInError: resources.vm_0_group.resources[0].resources.instance: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500

Which tells me we're using too many robot systems. Taking a look at releng/builder I noticed that we now have both 1c and 2c robot nodes each allowing 25 parallel robots to run. This is not a good idea. Unfortunately with Robot systems every job should use the same robot vm so that we can properly limit the max nodes. I think next steps here are:

1. Contact the cloud provider (Done)
2. Switch all releng/builder jobs to the 2c robot nodes
3. If necessary reduce the limit of robot vms further

Comment by Thanh Ha (zxiiro) [ 05/Jan/18 ]

Proposed patch to move all robots to the same type https://git.opendaylight.org/gerrit/66904

Comment by Thanh Ha (zxiiro) [ 05/Jan/18 ]

Add a test to check that we only ever use 1 robot node https://git.opendaylight.org/gerrit/66905 this ensures future changes does not miss this detail.

Comment by Thanh Ha (zxiiro) [ 12/Jan/18 ]

Infra seems to be stable now so closing this off as resolved. It seems the changes we made last week made things a lot better.

Generated at Wed Feb 07 20:37:26 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.