[NETVIRT-1019] CSIT Sporadic failures - port leak in tempest suite Created: 20/Nov/17  Updated: 23/Jul/18  Resolved: 23/Jul/18

Status: Verified
Project: netvirt
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Jamo Luhrsen Assignee: Jamo Luhrsen
Resolution: Done Votes: 0
Labels: csit:failures
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Priority: Normal

 Description   

Some ports are not cleaned up during our tempest suite and the suite teardown
is unable to delete the external network which has these left over ports.

Functionally, I think all the tempest test cases are passing for the most
part. We are only catching this because of our suite teardown which is
failing to clean up. This is probably one of our resource leaks others
have found from time to time.

I can't find the smoking gun, and it could be coming from multiple tempest
test cases, but I think we are getting left overs even from the tempest.api.network
group of tests. Now that we are collecting debug logs at the teardown()
portion of each tempest.scenario test we can see a list of the ports that
exist. If we compare a passing job [0] we can see there are less ports
from the failing job [1] in the first case after tempest.api.network. Then
you can see at the very end of those jobs that our passing case only has 4
ports in use (I'm assuming those are the ports being used by that one last
tempest scenario case). But, in the failing job, you can see 10 ports are
still in there, so proabably we've leaked 6 ports at that point.

I saw an update_port_precommit exception when digging around:
_*

2017-11-20 10:14:20.192 16601 DEBUG neutron.plugins.ml2.managers [req-0e73cc7b-709c-422d-a3cd-50e84233e2fc - -] DB exception raised by Mechanism driver 'opendaylight_v2' in update_port_precommit _call_on_drivers /opt/stack/neutron/neutron/plugins/ml2/managers.py:433
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers Traceback (most recent call last):
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers   File /opt/stack/neutron/neutron/plugins/ml2/managers.py, line 426, in _call_on_drivers
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers     getattr(driver.obj, method_name)(context)
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers   File /usr/lib/python2.7/site-packages/oslo_log/helpers.py, line 67, in wrapper
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers     return method(*args, **kwargs)
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers   File /opt/stack/networking-odl/networking_odl/ml2/mech_driver_v2.py, line 139, in update_port_precommit
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers     context, odl_const.ODL_PORT, odl_const.ODL_UPDATE)
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers   File /opt/stack/networking-odl/networking_odl/ml2/mech_driver_v2.py, line 109, in _record_in_journal
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers     ml2_context=context)
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers   File /opt/stack/networking-odl/networking_odl/journal/journal.py, line 121, in record
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers     raise exception.RetryRequest(e)
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers RetryRequest
2017-11-20 10:14:20.192 16601 ERROR neutron.plugins.ml2.managers 

*_

I can also take the mac address of a port I think has leaked and grep the
karaf.log and see that it did get an "add event" but never a "remove event".
Whereas you can see the add and remove in the passing job's karaf.log

[0] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-nitrogen/443/log_tempest.html.gz
[1] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-nitrogen/99/log_tempest.html.gz



 Comments   
Comment by Jamo Luhrsen [ 27/Nov/17 ]

This is still a problem even though we thought maybe Vic's patch could have fixed it. see this email

Comment by Jamo Luhrsen [ 08/Dec/17 ]

because of this test patch, this will not be visible in our upstream tempest suite. Please do not mark this issue as resolved

until a new test patch is made that will include a final test case in the tempest suite to try to delete the external-net. That

test case should fail if/when the port leak happens.

Comment by Jamo Luhrsen [ 11/Dec/17 ]

here's a test patch to try to catch the port leak:

https://git.opendaylight.org/gerrit/#/c/66366/

Comment by Jamo Luhrsen [ 19/Dec/17 ]

below you can see the router leak, which is connected to the external subnet. This could be a tempest
bug in cleanup, but it does not happen every time and it does not happen in the Ocata branch. We now
have a good comprehensive cleanup step in CSIT so this leak does not affect future tests/suites, but
the leak needs to be fixed nonetheless.

 

# ROUTER LISTING ("test" router is unrelated to this bug)
(overcloud) [stack@undercloud-0 ~]$ openstack router list
-----------------------------------------+-------------------------------+---------------------------------
| ID                                   | Name | Status | State | Distributed | HA    | Project                          |
-----------------------------------------+-------------------------------+---------------------------------
| cdfb3cdd-ef04-4ae6-a683-8d45fdb38a76 | test | ACTIVE | UP    | False       | False | cad71c1bc5044672b4a4277a1f26ed80 |
-----------------------------------------+-------------------------------+---------------------------------

 

#RUNNING TEMPEST. NOTE THE REGEX WHICH MAKES IT QUICK TO REPRODUCE:
(overcloud) [stack@undercloud-0 ~]$ sudo tempest run --conf /etc/tempest/tempest.conf --regex tempest.api.network.admin.*snat
{{running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} }}
{{OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} }}
{{OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} }}
{{${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover --list }}
{{running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} }}
{{OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} }}
{{OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} }}
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover  --load-list /tmp/tmprfPdKT
{{running=OS_STDOUT_CAPTURE=${OS_STDOUT_CAPTURE:-1} }}
{{OS_STDERR_CAPTURE=${OS_STDERR_CAPTURE:-1} }}
{{OS_TEST_TIMEOUT=${OS_TEST_TIMEOUT:-500} }}
${PYTHON:-python} -m subunit.run discover -t /usr/lib/python2.7/site-packages/tempest /usr/lib/python2.7/site-packages/tempest/test_discover  --load-list /tmp/tmpwY_75j
{{

{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_create_router_with_default_snat_value [3.470747s] ... ok}}
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_create_router_with_default_snat_value [4.851977s] ... ok
{{{1}

tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_create_router_with_snat_explicit [6.949165s] ... ok}}
{{

{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_create_router_with_snat_explicit [6.545988s] ... ok}}
{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_reset_gateway_without_snat [4.042365s] ... ok
{{{0}

tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_reset_gateway_without_snat [4.330543s] ... ok}}
{{

{1} tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_set_gateway_with_snat_explicit [3.337132s] ... ok}}
{0} tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_set_gateway_with_snat_explicit [3.372157s] ... ok
{{{1}

tempest.api.network.admin.test_routers.RoutersIpV6AdminTest.test_update_router_set_gateway_without_snat [4.280598s] ... ok}}
{{

{0}

tempest.api.network.admin.test_routers.RoutersAdminTest.test_update_router_set_gateway_without_snat [4.387563s] ... ok}}

======
Totals
======
Ran: 10 tests in 32.0000 sec.
 - Passed: 10
 - Skipped: 0
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 45.5682 sec.

==============
Worker Balance
==============
 - Worker 0 (5 tests) => 0:00:23.490230
 - Worker 1 (5 tests) => 0:00:22.082137

# WHEN THE BUG IS HIT, A TEMPEST ROUTER WILL BE LEFT OVER:
(overcloud) [stack@undercloud-0 ~]$ openstack router list
-------------------------------------------------------------------------------+-------------------------------+---------------------------------
| ID                                   | Name                                       | Status | State | Distributed | HA    | Project                          |
-------------------------------------------------------------------------------+-------------------------------+---------------------------------
| 31ede456-748e-4862-89a0-edbc7f2b5043 | tempest-RoutersAdminTest-router-1427583233 | ACTIVE | DOWN  | False       | False | cd2e6d38909746a3aaa5dbc09b1e30a5 |
| cdfb3cdd-ef04-4ae6-a683-8d45fdb38a76 | test                                       | ACTIVE | UP    | False       | False | cad71c1bc5044672b4a4277a1f26ed80 |
-------------------------------------------------------------------------------+-------------------------------+---------------------------------
{{(overcloud) [stack@undercloud-0 ~]$ }}

#BELOW IS SOME EXTRA INFO TO ILLUSTRATE THE PROBLEM WE USED TO HAVE IN CSIT:

}}{{(overcloud) [stack@undercloud-0 ~]$ openstack router show 31ede456-748e-4862-89a0-edbc7f2b5043
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                   | Value                                                                                                                                                                                  |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up          | DOWN                                                                                                                                                                                   |
| availability_zone_hints | None                                                                                                                                                                                   |
| availability_zones      | None                                                                                                                                                                                   |
| created_at              | 2017-12-19T18:14:44Z                                                                                                                                                                   |
| description             |                                                                                                                                                                                        |
| distributed             | False                                                                                                                                                                                  |
{{| external_gateway_info   | {"network_id": "53a4903e-7676-488c-89b6-6a2823c69dd7", "enable_snat": true, "external_fixed_ips": [

{"subnet_id": "fdd949f5-2e3f-4651-857f-9a21e7e51c80", "ip_address": "10.0.0.110"}

]} |}}
| flavor_id               | None                                                                                                                                                                                   |
| ha                      | False                                                                                                                                                                                  |
| id                      | 31ede456-748e-4862-89a0-edbc7f2b5043                                                                                                                                                   |
| name                    | tempest-RoutersAdminTest-router-1427583233                                                                                                                                             |
| project_id              | cd2e6d38909746a3aaa5dbc09b1e30a5                                                                                                                                                       |
| revision_number         | None                                                                                                                                                                                   |
| routes                  |                                                                                                                                                                                        |
| status                  | ACTIVE                                                                                                                                                                                 |
| tags                    |                                                                                                                                                                                        |
| updated_at              | 2017-12-19T18:14:45Z                                                                                                                                                                   |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack subnet list
---------------------------------------------------------------------------------------------------------+
| ID                                   | Name            | Network                              | Subnet           |
---------------------------------------------------------------------------------------------------------+
| 995fc643-821a-42b6-b779-0259a50f1e52 | test            | a1d225c0-4699-4de5-97f7-fa3c79df84de | 123.123.123.0/24 |
| fdd949f5-2e3f-4651-857f-9a21e7e51c80 | external-subnet | 53a4903e-7676-488c-89b6-6a2823c69dd7 | 10.0.0.0/24      |
---------------------------------------------------------------------------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ openstack network list
----------------------------------------------------------------------------------
| ID                                   | Name     | Subnets                              |
----------------------------------------------------------------------------------
| 53a4903e-7676-488c-89b6-6a2823c69dd7 | external | fdd949f5-2e3f-4651-857f-9a21e7e51c80 |
| a1d225c0-4699-4de5-97f7-fa3c79df84de | net_sfc  | 995fc643-821a-42b6-b779-0259a50f1e52 |
----------------------------------------------------------------------------------
(overcloud) [stack@undercloud-0 ~]$ openstack network delete external
Failed to delete network with name or ID 'external': HttpException: Conflict (HTTP 409) (Request-ID: req-dd33fe64-3d4d-4a29-9c8f-66e198d12e39), Unable to complete operation on network 53a4903e-7676-488c-89b6-6a2823c69dd7. There are one or more ports still in use on the network.
1 of 1 networks failed to delete.
{{(overcloud) [stack@undercloud-0 ~]$ }}

 

Comment by Itzik Brown [ 20/Dec/17 ]

jluhrsen If this is a tempest cleanup problem we need to open a bug in Launchpad (https://launchpad.net/tempest).

 

 

Comment by Jamo Luhrsen [ 20/Dec/17 ]

itzikb, if someone is able to figure out the root cause and it turns out to be a tempest issue, they yes we should open a bug against the tempest project.

Comment by Waldemar Znoinski [ 22/Mar/18 ]
  1. test results html: http://file.rdu.redhat.com/~wznoinsk/tempest-results-troubleshooting_leaks.1.html
  2. test_create_router_with_default_snat_value[id-847257cc-6afd-4154-b8fb-af49f5670ce8] created the below two routers
    ------------------------------------------------------------------------------------------------------------------------------------------
    ID Name Status State Distributed HA Project

    ------------------------------------------------------------------------------------------------------------------------------------------

    10e8215d-8d70-4817-aa9c-5293fa9ac820 tempest-RoutersAdminTest-router-920392578 ACTIVE DOWN False False c73655d095634c29a73940d319e69853
    d3ad894a-ce27-44ec-b1fc-ee76106249b8 tempest-RoutersAdminTest-router-920392578 ACTIVE DOWN False False c73655d095634c29a73940d319e69853

    ------------------------------------------------------------------------------------------------------------------------------------------

  1. test_create_router_with_snat_explicit[id-ea74068d-09e9-4fd7-8995-9b6a1ace920f] that run immediatelly after the above created 5 routers
    ------------------------------------------------------------------------------------------------------------------------------------------
    ID Name Status State Distributed HA Project

    ------------------------------------------------------------------------------------------------------------------------------------------

    6883e9ce-4755-4b84-aa68-bb8fcdaf445d tempest-snat-router-1700489628 ACTIVE UP False False f5b47e9fbbd34fc1a7989deae81913c2
    a7b4b847-db14-4f30-90c5-c558f05d608b tempest-snat-router-1700489628 ACTIVE UP False False f5b47e9fbbd34fc1a7989deae81913c2
    d3ad894a-ce27-44ec-b1fc-ee76106249b8 tempest-RoutersAdminTest-router-920392578 ACTIVE DOWN False False c73655d095634c29a73940d319e69853
    dece5135-a2a4-4936-ab42-b6684e3bbaa8 tempest-snat-router-1700489628 ACTIVE UP False False f5b47e9fbbd34fc1a7989deae81913c2
    e546ec63-8404-4733-92e4-aaa3e821899e tempest-snat-router-1700489628 ACTIVE UP False False f5b47e9fbbd34fc1a7989deae81913c2
    fe13097e-9316-47d9-a5f9-5ef90c67d3a8 tempest-snat-router-1700489628 ACTIVE UP False False f5b47e9fbbd34fc1a7989deae81913c2

    ------------------------------------------------------------------------------------------------------------------------------------------

NOTE: the 'tempest-RoutersAdminTest-router-920392578' leftover from previous test that will never be deleted

  1. after all the 10 tests from http://file.rdu.redhat.com/~wznoinsk/tempest-results-troubleshooting_leaks.1.html has finished, router show:
    -------------------------------------------------------------------------------------------------------------------------------------------
    ID Name Status State Distributed HA Project

    -------------------------------------------------------------------------------------------------------------------------------------------

    29ed6a06-540f-4ed8-9e55-f7de399cadbc tempest-snat-router-1847363806 ACTIVE UP False False 4f581326639c4e14be8429b13c940e73
    346c340a-7bdc-463a-98e2-89632151eefa tempest-RoutersAdminTest-router-1590746193 ACTIVE DOWN False False c73655d095634c29a73940d319e69853
    a7b4b847-db14-4f30-90c5-c558f05d608b tempest-snat-router-1700489628 ACTIVE UP False False f5b47e9fbbd34fc1a7989deae81913c2
    b7236109-8e0a-4c3d-95bc-2964cecc648e tempest-RoutersAdminTest-router-1590746193 ACTIVE DOWN False False c73655d095634c29a73940d319e69853
    d3ad894a-ce27-44ec-b1fc-ee76106249b8 tempest-RoutersAdminTest-router-920392578 ACTIVE DOWN False False c73655d095634c29a73940d319e69853
    dece5135-a2a4-4936-ab42-b6684e3bbaa8 tempest-snat-router-1700489628 ACTIVE UP False False f5b47e9fbbd34fc1a7989deae81913c2
    fe13097e-9316-47d9-a5f9-5ef90c67d3a8 tempest-snat-router-1700489628 ACTIVE UP False False f5b47e9fbbd34fc1a7989deae81913c2

    -------------------------------------------------------------------------------------------------------------------------------------------

this may indicate subsequent router creation request(s) are being sent (guess: due to lack of timely response to previous ones?) and only some of them are being registered for resource_cleanup later

Comment by Waldemar Znoinski [ 22/Mar/18 ]

additionally when I run one test on its own

tempest.api.network.admin.test_routers.RoutersAdminTest.test_create_router_with_default_snat_value

I can reproduce the issue (I see 2 routers being created but only one cleaned up) in 1 every ~5 runs

Comment by Waldemar Znoinski [ 27/Mar/18 ]

I think the leftovers issue is more complicated than initially looked like:

  1. osp13 neutron (non-odl) job which run tempest tests normally ran by opendaylight jobs shows leftovers: http://file.tlv.redhat.com/~wznoinsk/sosreport-testqosneutrontempestplugin.scenario.testqos.QoSTest-20180326234802/sos_commands/openstack_neutron/openstack_router_list the names of these left over routers indicate they were left by "neutron tempest plugin" - https://github.com/openstack/neutron-tempest-plugin/blob/118b52fb000f735267a0955a6ace95b4da39d63c/neutron_tempest_plugin/scenario/base.py#L173
  2. the osp13 odl job has both type of leftovers (the ones from tempest plugin as neutron job above) + routers created by the test from the main 'tempest' repository (router has 'className' in the name), the leftovers: http://file.tlv.redhat.com/~wznoinsk/sosreport-testtwosecgroupsneutrontempestplugin.scenario.testsecuritygroups.NetworkSecGroupTest-20180321094446/sos_commands/openstack_neutron/openstack_router_list , the way tempest creates router names (with ClassName): https://github.com/wznoinsk/tempest/blob/master/tempest/scenario/manager.py#L1090 

launchpad bug filled for the point 1. above: https://bugs.launchpad.net/neutron/+bug/1759321 

re the point 2. it doesn't seem like tempest issue (otherwise we'd see any of these leftovers in non-odl testing - point 1 above - because we run exactly the same test suite), it needs more investigation

 

Comment by Jamo Luhrsen [ 10/Apr/18 ]

when we have this fixed, this CSIT patch will help us catch any future regressions.

Here's what a failure would look like

Comment by OpenDaylight Release [ 03/May/18 ]

Since the bug is unassigned I'm currently assigning it to you.

Please assign to the relevant person. 

Comment by Sam Hague [ 25/May/18 ]

jluhrsen is this still an issue? The test case to catch port leaks was not merged yet.

Comment by Jamo Luhrsen [ 25/May/18 ]

yes.

we have a CSIT patch which will catch this in automation, but since we
dont want to see failures in our job we can't add it yet. Also Red Hat has a bug
as I know it affects their deployments. oh, and OPNFV complains about it too.

Comment by Jamo Luhrsen [ 12/Jul/18 ]

some patches in networking-odl that will hopefully resolve this issue:

https://review.openstack.org/#/c/575126/
https://review.openstack.org/#/c/575476/

Comment by Jamo Luhrsen [ 23/Jul/18 ]

This was a fix on the openstack side:
https://review.openstack.org/#/c/575476/

Comment by Jamo Luhrsen [ 23/Jul/18 ]

the openstack fix was merged on Jul 19th, 2018.

we can see a job the day before with leaked routers , and a job the next day with no leak

Generated at Wed Feb 07 20:23:02 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.