[NETVIRT-512] CSIT Sporadic failures - tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_stop_start Created: 06/Mar/17  Updated: 27/Nov/19

Status: Confirmed
Project: netvirt
Component/s: General
Affects Version/s: Oxygen
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Jamo Luhrsen Assignee: Srinivas Rachakonda
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File failing.log     File passing.log    
External issue ID: 7909

 Description   

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-mitaka-upstream-transparent-carbon/407/archives/log.html.gz#s1-s2-s1-t11



 Comments   
Comment by Jamo Luhrsen [ 06/Jul/17 ]

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-transparent-carbon/568/log.html.gz#s1-s2-s1-t11

Comment by Jamo Luhrsen [ 18/Jul/17 ]

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-learn-carbon/334/log.html.gz#s1-s2-s1-t11

Comment by Jamo Luhrsen [ 27/Sep/17 ]

no longer seen in CSIT

Comment by Jamo Luhrsen [ 07/Mar/18 ]

This is back now, and we need to track it. Starting it out as an Oxygen Blocker since it

was introduced recently and very consistent.

Comment by Jamo Luhrsen [ 07/Mar/18 ]

it's consistent in queens as well as pike. Pike is easier to debug because the openstack services have their logs
separate. Apparently, in devstack queens, the logs are all munged together in the system messages file.

some clues:

  • the tempest test stops the instance and then tries to start it to check connectivity remains. But, the instance
    never comes back and is stuck in SHUTOFF state. We collect debugs in each tempest scenario test before
    the test case tears down. Here you can see the server list which shows the instance stuck in SHUTOFF
    state.
  • in the compute node nova-cpu log there is a line:
    2018-03-05 05:24:08.014 27550 WARNING nova.virt.libvirt.driver [req-9f7b4144-e42a-49fd-b17b-6a53de727ac8 service nova] [instance: 11442853-e52d-43ff-af61-e028b62cd5ea] Timeout waiting for vif plugging callback for instance with vm_state stopped and task_state powering-on.: Timeout: 300 seconds

    that is coming a couple of minutes after the test has given up and marking it a failure.

  • it looks like this showed up sometime between Feb 28th and Mar 1st Be careful when you look at the Mar 1st link because it still shows a failure, but it's not the SHUTOFF failure. It's failing because of one our sporadic connectivity issues. If you look at the graph you can see that the test case was always passing, then two sporadic connectivity failures, and finally the rest are all for this SHUTOFF issue.
Comment by Daniel Farrell [ 08/Mar/18 ]

k.faseela - Any update on this? Is someone actively working on it? Friendly reminder that we need to get this resolved ASAP as it's blocking the imminent Oxygen release.

Comment by Faseela K [ 08/Mar/18 ]

Jamo, the new failure is not anything on the flow programming side, rather the VM is not coming up, right?

I don't think I am the right person to debug this, some openstack expertise will be needed.

Vishal or Vivekanandan might be able to help?

Comment by Faseela K [ 08/Mar/18 ]

shague jluhrsen : Had a chat with Vishal, he is telling this needs to be looked upon by someone who has nova/neutron expertise. He was suggesting to check with yamahata

Comment by Daniel Farrell [ 09/Mar/18 ]

Has someone started digging into this? yamahata? Maybe jhershbe could help?

Comment by Jamo Luhrsen [ 10/Mar/18 ]

I've dug around a lot, looking for something obvious, but not really finding anything blatant.

There is a difference in some of the nova logs, when I compare a passing job to a failing job.
(attached are the outputs of the compute nodes journalctl.log file for all entries matching the
relevant instances UUID)

in the passing job (timestamp Feb 08 07:03:54) you will see that the instance had the lifecycle
event for "VM Resumed". That is not there in the failing job (timestamp: Mar 07 23:00:51), where
lifecycle event is "VM Started". In the failing log, you can next see (less than a second later) that
the instance gets a "VM Paused" event. I'm assuming that's what sticks it to SHUTOFF state
and ruins the test.

I'm trying to think of ways to rule out OpenDaylight as the culprit, so we can move this from
blocker state, but we need to know why that instance got quickly cycled off after trying to be
restarted.

 

Comment by Jamo Luhrsen [ 11/Mar/18 ]

I was wondering if this could somehow be a bug outside of ODL, maybe in the openstack side of things.
I tried to run the Queens jobs, but rolling things back to mid-Feb-ish (when the bug wasn't happening)
but that did not help (I rolled back devstack, networking-odl, neutron and nova).

However, I did run Oxygen with the Ocata branch of openstack and the bug did not show up. log here

I'm not sure what to make of this yet.

Comment by Josh Hershberg [ 12/Mar/18 ]

So I also suspect that this might be an openstack issue or even a libvirt issue. When I looked at this last week I noticed this test was failing on the start (unpause) of the VM after it was stopped. Turns out, neutron and ODL are not involved in that process at all.

Comment by Daniel Farrell [ 12/Mar/18 ]

jluhrsen - Is pushing on this hard, looking like it's an OpenStack bug so far, will update by EOD PT.

Comment by Jamo Luhrsen [ 13/Mar/18 ]

Jamo Luhrsen - Is pushing on this hard, looking like it's an OpenStack bug so far, will update by EOD PT.

It took me all day to get my local setup working, and did not get this specific test to run yet. I'll hopefully get some time in my
late evening to work on it, but no specific news to update with now as I had intended. stay tuned...

Comment by Sam Hague [ 13/Mar/18 ]

There are earlier jobs with the failure back to 2/8/18:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-queens-upstream-stateful-oxygen/144/robot-plugin/log_05_tempest.html.gz

pike: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-pike-upstream-stateful-oxygen/190/robot-plugin/log_05_tempest.html.gz

  • but the pike looks like a different failure in the test
Comment by Jamo Luhrsen [ 14/Mar/18 ]

This is not a bug with OpenDaylight itself. I have removed this from blocker, but leaving the bug open for now
until we have the appropriate openstack bug filed and tracked.

It appears this patch in nova is what introduced the failure. However, it's possible (and theorized by Mohamed Naser) that
this could be a problem in networking_odl not actually unplugging the port when it should. I'm not totally understanding those
details yet. However, when we run our CSIT by rolling back nova to a commit hash just before the patch mentioned earlier,
the failure goes away.

We've also noticed that non-voting tempest tests that run against networking-odl also started failing around the same time
that this nova patch landed.

Comment by Sam Hague [ 05/Apr/18 ]

https://review.openstack.org/553035 was merged to revert the original nova patch that was causing the problem. running new gates to reenable the test and see if it is fixed.

Comment by Sam Hague [ 10/Apr/18 ]

New patch to fix original issue in nova: https://review.openstack.org/558001

Comment by Abhinav Gupta [ 25/Nov/19 ]

any update here?

Comment by Jamo Luhrsen [ 25/Nov/19 ]

I'm not keeping track of netvirt jobs any more. Can we get someone to go through the tempest jobs and see if this test case is still failing sporadically or not?

actually it's not that hard to figure out. here is a link for the sodium job that shows that it does still fail sporadically.

Generated at Wed Feb 07 20:21:45 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.