[NETVIRT-896] CSIT Sporadic failures - tempest.api.network precommit_delete failure on security group delete Created: 08/Sep/17  Updated: 03/May/18  Resolved: 05/Apr/18

Status: Resolved
Project: netvirt
Component/s: General
Affects Version/s: Nitrogen, Oxygen
Fix Version/s: Carbon-SR3

Type: Bug Priority: Medium
Reporter: Jamo Luhrsen Assignee: Isaku Yamahata
Resolution: Done Votes: 0
Labels: csit:failures
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File server.log.xz     File sg-precommit-delete.diff     File sg-precommit-delete.diff     File sg-precommit-exception.diff    
External issue ID: 9130
Priority: High

 Description   

this csit job:

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-nitrogen/3/log.html.gz

tempest output snippet:

{2}

tearDownClass (tempest.api.network.test_extensions.ExtensionsTestJSON) [0.000000s] ... FAILED

Captured traceback:
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):

File "tempest/test.py", line 190, in tearDownClass
 six.reraise(etype, value, trace)
 File "tempest/test.py", line 173, in tearDownClass
 teardown()
 File "tempest/test.py", line 477, in clear_credentials
 cls._creds_provider.clear_creds()
 File "tempest/lib/common/dynamic_creds.py", line 459, in clear_creds
 self._cleanup_default_secgroup(creds.tenant_id)
 File "tempest/lib/common/dynamic_creds.py", line 406, in _cleanup_default_secgroup
 nsg_client.delete_security_group(secgroup['id'])
 File "tempest/lib/services/network/security_groups_client.py", line 59, in delete_security_group
 return self.delete_resource(uri)
 File "tempest/lib/services/network/base.py", line 41, in delete_resource
 resp, body = self.delete(req_uri)
 File "tempest/lib/common/rest_client.py", line 301, in delete
 return self.request('DELETE', url, extra_headers, headers, body)
 File "tempest/lib/common/rest_client.py", line 659, in request
 self._error_checker(resp, resp_body)
 File "tempest/lib/common/rest_client.py", line 780, in _error_checker
 raise exceptions.Conflict(resp_body, resp=resp)
 tempest.lib.exceptions.Conflict: An object with that identifier already exists
 Details:
{u'message': u'Security Group 462df320-f024-4810-85a7-5b1195ee3602 cannot perform precommit_delete due to Callback networking_odl.common.callback.OdlSecurityGroupsHandler.sg_callback_precommit--9223372036854618144 failed with "argument of type \'NoneType\' is not iterable".', u'type': u'SecurityGroupInUse', u'detail': u''}

 



 Comments   
Comment by Jamo Luhrsen [ 18/Sep/17 ]

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-carbon/16/log.html.gz#s1-s2-s1

Comment by Jamo Luhrsen [ 27/Sep/17 ]

another example:

snippet from the neutron log [0]:

2017-09-27 14:36:11.991 15977 DEBUG networking_odl.journal.journal [req-feaf99a4-cde1-4f5e-8f45-907690caa0a2 - -] Start processing journal entries sync_pending_entries /opt/stack/networking-odl/networking_odl/journal/journal.py:221
2017-09-27 14:36:12.010 15977 DEBUG neutron.api.rpc.handlers.resources_rpc [req-96f23e4d-1b20-47bc-8c27-e8badfa60e43 tempest-RoutersTest-1789903320 tempest-RoutersTest-1789903320] neutron.api.rpc.handlers.resources_rpc.ResourcesPushRpcApi method push called with arguments (<neutron_lib.context.Context object at 0xa9cacd0>, [Port(admin_state_up=True,allowed_address_pairs=[],binding=PortBinding,binding_levels=[],created_at=2017-09-27T14:36:07Z,data_plane_status=<?>,description='',device_id='4f943f20-2fb8-4a03-b4fb-a67955677cde',device_owner='network:router_interface',dhcp_options=[],distributed_binding=None,dns=None,fixed_ips=[IPAllocation],id=3d3cacc1-f87b-4bbc-96af-cc612de0f0be,mac_address=fa:16:3e:f3:40:5f,name='',network_id=79f4139a-7bbb-4577-8905-c3202156d36d,project_id='e59173ac2305426f9a849edff8b58b0a',qos_policy_id=None,revision_number=4,security=PortSecurity(3d3cacc1-f87b-4bbc-96af-cc612de0f0be),security_group_ids=set([]),status='DOWN',updated_at=2017-09-27T14:36:10Z)], 'updated') {} wrapper /usr/lib/python2.7/site-packages/oslo_log/helpers.py:66
2017-09-27 14:36:12.015 15977 DEBUG neutron_lib.callbacks.manager [req-112bdc5c-9147-4ee8-9959-63083496c140 admin admin] Callback networking_odl.common.callback.OdlSecurityGroupsHandler.sg_callback_precommit--9223372036846803568 raised argument of type 'NoneType' is not iterable _notify_loop /usr/lib/python2.7/site-packages/neutron_lib/callbacks/manager.py:184
2017-09-27 14:36:12.020 15976 DEBUG neutron.quota.resource [req-db17e372-3bb5-4f33-84f4-f0f9ec2390d2 tempest-SecGroupIPv6Test-1702529477 tempest-SecGroupIPv6Test-1702529477] Synchronizing usage tracker for tenant:4047e9db93a744f48d093e3c3d9e914b on resource:security_group_rule resync /opt/stack/neutron/neutron/quota/resource.py:245
2017-09-27 14:36:12.023 15977 DEBUG oslo_concurrency.lockutils [req-96f23e4d-1b20-47bc-8c27-e8badfa60e43 tempest-RoutersTest-1789903320 tempest-RoutersTest-1789903320] Lock "event-dispatch" released by "neutron.plugins.ml2.ovo_rpc.dispatch_events" :: held 0.143s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282
2017-09-27 14:36:12.027 15976 DEBUG networking_odl.journal.journal [req-996ea9ed-d917-4ec8-98c7-06ca9a443272 - -] No journal entries to process sync_pending_entries /opt/stack/networking-odl/networking_odl/journal/journal.py:225
2017-09-27 14:36:12.027 15977 INFO neutron.api.v2.resource [req-112bdc5c-9147-4ee8-9959-63083496c140 admin admin] delete failed (client error): There was a conflict when trying to complete your request.
2017-09-27 14:36:12.029 15977 INFO neutron.wsgi [req-112bdc5c-9147-4ee8-9959-63083496c140 admin admin] 10.29.13.105 "DELETE /v2.0/security-groups/90cf7332-acac-40ca-87f2-abb2484cc9f1 HTTP/1.1" status: 409 len: 536 time: 1.0451260

nothing stood out to me in the karaf.log [1]. for the comments here, the tempest log is here [2]

[0] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-carbon/27/control_1/q-svc.log.2017-09-27-132116.gz
[1] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-carbon/27/odl1_karaf.log.gz
[2] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-carbon/27/tempest/tempest.log.gz

Comment by Aswin Suryanarayanan [ 28/Sep/17 ]

From initial look this looks like the cleanup of the resources are failing. It is not the same test case that is failing always. In run [1] it is the tempest.api.network.test_extensions.ExtensionsTestJSON and in the run [2] it is tempest.api.network.test_networks.NetworksIpV6Test. So are we too fast in doing the clean up?

The error states the SG we try to clean in use , which should mean it is associated with a port.

Are the tests run in order always or it purely random?

[1]
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-nitrogen/3/log.html.gz
[2]
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-carbon/27/control_1/q-svc.log.2017-09-27-132116.gz

Comment by Jamo Luhrsen [ 28/Sep/17 ]

(In reply to Aswin Suryanarayanan from comment #3)
> From initial look this looks like the cleanup of the resources are failing.
> It is not the same test case that is failing always. In run [1] it is the
> tempest.api.network.test_extensions.ExtensionsTestJSON and in the run [2] it
> is tempest.api.network.test_networks.NetworksIpV6Test. So are we too fast in
> doing the clean up?
>
> The error states the SG we try to clean in use , which should mean it is
> associated with a port.
>
> Are the tests run in order always or it purely random?
>
>
> [1]
> https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> pike-upstream-stateful-nitrogen/3/log.html.gz
> [2]
> https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> pike-upstream-stateful-carbon/27/control_1/q-svc.log.2017-09-27-132116.gz

Aswin,

good catch on realizing the same precommit error on SG was happening in different test
cases. AFAIK, by default tempest will run it's sub-tests in parallel so I guess that
probably means there is no guaranteed order. And I guess it's not happening every
time either.

This may be a tough one to debug. Maybe we need some sort of extra log level turned
on for a specific module?

I can look in to running the tests in sequence to see if that guarantees we never see
the failure, but we will need to fix this anyway, so not sure if it's worth the
effort.

any other ideas for debugging?

JamO

Comment by Aswin Suryanarayanan [ 03/Oct/17 ]

This is not observed after run 32[1]. Is there any thing changed here? or it is a part of random behavior? I am not able to reproduce this locally either.

[1]https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-nitrogen/36/log.html.gz

Comment by Aswin Suryanarayanan [ 03/Oct/17 ]

It is back again
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-nitrogen/38/log.html.gz#s1-s2-s1

Jamo,

Given the sporadic nature of the failure, the parallel execution of tempest tests could be a reason. Like when SG is associated with a port in a test another test tries to delete/clean the SGs? But I don't have an answer why it is only on pike.

Comment by Jamo Luhrsen [ 03/Oct/17 ]

(In reply to Aswin Suryanarayanan from comment #6)
> It is back again
> https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> pike-upstream-stateful-nitrogen/38/log.html.gz#s1-s2-s1

yeah, here is a good graphic to illustrate it's sporadic nature:

https://jenkins.opendaylight.org/releng/user/jluhrsen/my-views/view/netvirt%20csit/job/netvirt-csit-1node-openstack-pike-upstream-stateful-nitrogen/38/robot/netvirt-1node-openstack.txt/Tempest/Tempest/tempest.api.network/graph?zoomSignificant=false&failedOnly=false&criticalOnly=false&maxBuildsToShow=0&hd=true

> Jamo,
>
> Given the sporadic nature of the failure, the parallel execution of tempest
> tests could be a reason. Like when SG is associated with a port in a test
> another test tries to delete/clean the SGs? But I don't have an answer why
> it is only on pike.

From what I know about tempest, it's usually using unique names/ids for
the things each test is creating/updating/deleting. So, I think we should
assume that even running in parallel, there would not be one test case
modifying a SG that is being used by another test case. I could be wrong,
but that seems a reasonable assumption. Otherwise, I think a lot of the
community using tempest in parallel would run in to problems all the time.

Comment by Jamo Luhrsen [ 03/Oct/17 ]

(In reply to Jamo Luhrsen from comment #7)
> (In reply to Aswin Suryanarayanan from comment #6)
> > It is back again
> > https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> > pike-upstream-stateful-nitrogen/38/log.html.gz#s1-s2-s1
>
> yeah, here is a good graphic to illustrate it's sporadic nature:
>
> https://jenkins.opendaylight.org/releng/user/jluhrsen/my-views/view/
> netvirt%20csit/job/netvirt-csit-1node-openstack-pike-upstream-stateful-
> nitrogen/38/robot/netvirt-1node-openstack.txt/Tempest/Tempest/tempest.api.
> network/
> graph?zoomSignificant=false&failedOnly=false&criticalOnly=false&maxBuildsToSh
> ow=0&hd=true
>
>
> > Jamo,
> >
> > Given the sporadic nature of the failure, the parallel execution of tempest
> > tests could be a reason. Like when SG is associated with a port in a test
> > another test tries to delete/clean the SGs? But I don't have an answer why
> > it is only on pike.
>
> From what I know about tempest, it's usually using unique names/ids for
> the things each test is creating/updating/deleting. So, I think we should
> assume that even running in parallel, there would not be one test case
> modifying a SG that is being used by another test case. I could be wrong,
> but that seems a reasonable assumption. Otherwise, I think a lot of the
> community using tempest in parallel would run in to problems all the time.

are there any debugs we can enable on the ODL side that might help?

Comment by Aswin Suryanarayanan [ 05/Oct/17 ]

(In reply to Jamo Luhrsen from comment #8)
> (In reply to Jamo Luhrsen from comment #7)
> > (In reply to Aswin Suryanarayanan from comment #6)
> > > It is back again
> > > https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> > > pike-upstream-stateful-nitrogen/38/log.html.gz#s1-s2-s1
> >
> > yeah, here is a good graphic to illustrate it's sporadic nature:
> >
> > https://jenkins.opendaylight.org/releng/user/jluhrsen/my-views/view/
> > netvirt%20csit/job/netvirt-csit-1node-openstack-pike-upstream-stateful-
> > nitrogen/38/robot/netvirt-1node-openstack.txt/Tempest/Tempest/tempest.api.
> > network/
> > graph?zoomSignificant=false&failedOnly=false&criticalOnly=false&maxBuildsToSh
> > ow=0&hd=true
> >
> >
> > > Jamo,
> > >
> > > Given the sporadic nature of the failure, the parallel execution of tempest
> > > tests could be a reason. Like when SG is associated with a port in a test
> > > another test tries to delete/clean the SGs? But I don't have an answer why
> > > it is only on pike.
> >
> > From what I know about tempest, it's usually using unique names/ids for
> > the things each test is creating/updating/deleting. So, I think we should
> > assume that even running in parallel, there would not be one test case
> > modifying a SG that is being used by another test case. I could be wrong,
> > but that seems a reasonable assumption. Otherwise, I think a lot of the
> > community using tempest in parallel would run in to problems all the time.
>
> are there any debugs we can enable on the ODL side that might help?

My assumption is this related to some neutron/networking odl. Didn't see anything interesting in karaf log either. So nothing comes in mind that could help from ODL side

Comment by Jamo Luhrsen [ 05/Oct/17 ]

(In reply to Aswin Suryanarayanan from comment #9)
> (In reply to Jamo Luhrsen from comment #8)
> > (In reply to Jamo Luhrsen from comment #7)
> > > (In reply to Aswin Suryanarayanan from comment #6)
> > > > It is back again
> > > > https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> > > > pike-upstream-stateful-nitrogen/38/log.html.gz#s1-s2-s1
> > >
> > > yeah, here is a good graphic to illustrate it's sporadic nature:
> > >
> > > https://jenkins.opendaylight.org/releng/user/jluhrsen/my-views/view/
> > > netvirt%20csit/job/netvirt-csit-1node-openstack-pike-upstream-stateful-
> > > nitrogen/38/robot/netvirt-1node-openstack.txt/Tempest/Tempest/tempest.api.
> > > network/
> > > graph?zoomSignificant=false&failedOnly=false&criticalOnly=false&maxBuildsToSh
> > > ow=0&hd=true
> > >
> > >
> > > > Jamo,
> > > >
> > > > Given the sporadic nature of the failure, the parallel execution of tempest
> > > > tests could be a reason. Like when SG is associated with a port in a test
> > > > another test tries to delete/clean the SGs? But I don't have an answer why
> > > > it is only on pike.
> > >
> > > From what I know about tempest, it's usually using unique names/ids for
> > > the things each test is creating/updating/deleting. So, I think we should
> > > assume that even running in parallel, there would not be one test case
> > > modifying a SG that is being used by another test case. I could be wrong,
> > > but that seems a reasonable assumption. Otherwise, I think a lot of the
> > > community using tempest in parallel would run in to problems all the time.
> >
> > are there any debugs we can enable on the ODL side that might help?
>
> My assumption is this related to some neutron/networking odl. Didn't see
> anything interesting in karaf log either. So nothing comes in mind that
> could help from ODL side

random thought:
I wonder if you could artificially load your local environment where you are
trying unsuccessfully to reproduce this issue. There has always been some
concern that our shared upstream cloud environment for our CSIT systems can
be heavily loaded at times. I wonder if that could cause certain timing situations
that would uncover some races?

Comment by Aswin Suryanarayanan [ 24/Oct/17 ]

There was a bug related to this fixed in tempest[1]. But we should be having the latest?

[1]https://bugs.launchpad.net/tempest/+bug/1623691

Comment by Jamo Luhrsen [ 02/Nov/17 ]

I saw this in the sandbox tonight (11/1/2017). I'll keep it open for a while and keep checking
in releng.

Comment by Zhijiang Hu [ 02/Jan/18 ]

We also encounter this problem yesterday in OPNFV functest [1], and the log is [2], related ODL version is [3].

[1]https://build.opnfv.org/ci/job/functest-daisy-baremetal-daily-master/240/console

[2]http://artifacts.opnfv.org/logs/functest/zte-pod3/2018-01-01_13-53-07/functest.log

[3]http://cbs.centos.org/repos/nfv7-opendaylight-6-candidate/x86_64/os/Packages/opendaylight-6.3.0-0.1.20171211rel2065.el7.noarch.rpm

 

 

Comment by Sam Hague [ 09/Jan/18 ]

latest runs still have failure 1/8/18: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-pike-upstream-stateful-oxygen/114/log_05_tempest.html.gz

Comment by Jamo Luhrsen [ 09/Jan/18 ]

Using a local deployment (tripleo based), I can reproduce this with the following tempest run:

    --concurrency 2 --regex '(tempest.api.network.test_security_groups)'

that runs 43 tests and sometimes one or more will come back failed with this kind of error:

Details: {u'message': u'Security Group aa4f7835-44d9-489c-861b-552d25922d81 cannot perform precommit_delete due to Callback networking_odl.common.callback.OdlSecurityGroupsHandler.sg_callback_precommit-765033 failed with "argument of type \'NoneType\' is not iterable".', u'type': u'SecurityGroupInUse', u'detail': u''}

If concurrency is set to 1, I can't reproduce the problem.

Comment by Isaku Yamahata [ 10/Jan/18 ]

Can you please enable debug option for neutron and get log? Unfortunately the root cause exception is masked.

Also can you try the following patch? this patch is based on pure guess, though.

Comment by Isaku Yamahata [ 10/Jan/18 ]

diff --git a/networking_odl/ml2/mech_driver_v2.py b/networking_odl/ml2/mech_driver_v2.py
index 44306f8ed..5a773b080 100644
— a/networking_odl/ml2/mech_driver_v2.py
+++ b/networking_odl/ml2/mech_driver_v2.py
@@ -194,7 +194,8 @@ class OpenDaylightMechanismDriver(api.MechanismDriver):

  1. be deleted. At NeutronDB layer rules are auto deleted with
  2. cascade='all,delete'.
    if (object_type == odl_const.ODL_SG and
  • operation == odl_const.ODL_DELETE):
    + operation == odl_const.ODL_DELETE and
    + kwargs['security_group_rule_ids']):
    for rule_id in kwargs['security_group_rule_ids']:
    journal.record(context, odl_const.ODL_SG_RULE,
    rule_id, odl_const.ODL_DELETE, [object_uuid])
Comment by Isaku Yamahata [ 10/Jan/18 ]

sg-precommit-delete.diff

Comment by Isaku Yamahata [ 10/Jan/18 ]

sg-precommit-delete.diff

Comment by Jamo Luhrsen [ 12/Jan/18 ]

yamahata, I applied the changes to my setup, but the problem persists.

Attached is the neutron server log (DEBUG logs) when running with your changes. I ran this tempest command,
which seems to be using 4 workers (concurrency).

tempest run --regex '(tempest.api.network.test_security_groups)'

I can easily try more patches/fixes, if you have things to try.

Comment by Isaku Yamahata [ 13/Jan/18 ]

sg-precommit-exception.diff
Okay, the log helped much. Can you please the patch?
If it works, I'm going to bake it.

Comment by Isaku Yamahata [ 15/Jan/18 ]

the patch was uploaded to https://review.openstack.org/#/c/533706/

Comment by Zhijiang Hu [ 22/Jan/18 ]

Hello,

Thanks for catching this bug, want to add that this also affects OpenStack Pike + ODL Carbon 

http://artifacts.opnfv.org/logs/functest/zte-pod3/2018-01-21_19-53-22/functest.log

So when the patch is ready, could it be cherry-picked to networking-odl's stable/pike branch?

Comment by Jamo Luhrsen [ 23/Jan/18 ]

yamahata, local tests with your patch are passing for me. I've run the tempest.api.network.test_security_groups bucket
5 times back to back without failure using concurrency == 4. (NOTE: I hit it every time without the patch). Also, the full bucket
of tempest.api.network is also passing fine.

looking forward to getting your patch merged and ODL CSIT not hitting this any more.

Much appreciated!

Comment by Jamo Luhrsen [ 25/Jan/18 ]

CaseyCain, can we get someone to update this Jira with an explanation of why it is a blocker?

BTW, in case it's not clear in the comments above, this bug has it's fix coming on the openstack
side, not OpenDaylight side. I don't know how fast that will happen.

Comment by Zhijiang Hu [ 26/Jan/18 ]

jluhrsen The reason why I think it was an blocker for OPNFV MS3.2 becasue I saw it happened many times in OPNFV functest (with Daisy installer which deploy OpenStack Pike + ODL carbon), for example, http://artifacts.opnfv.org/logs/functest/zte-pod3/2018-01-21_19-53-22/functest.log

 

Following the JIRA ticket I got the solution patch (https://review.openstack.org/#/c/533706/) and  I have patched it dowstream into Daisy's repo first(https://gerrit.opnfv.org/gerrit/#/c/51031/). And functest for Daisy's sdn-nofeature scenario is stable now. So from Daisy Installer side, it is no longer a blocker for OPNFV MS3.2. Looking forward to getting the patch merged upstream.

Comment by Jaime Caamaño Ruiz [ 26/Jan/18 ]

For sake of completeness, this patch got rid of the problem also:

https://review.openstack.org/#/c/536935/

Comment by Jamo Luhrsen [ 26/Jan/18 ]

Thanks jaicaa.

any chance we can get this ported to pike? he's my try for that:
  https://review.openstack.org/#/c/538352/

Comment by Kit Lou [ 09/Feb/18 ]

Changing severity to major as we have confirmation that this is not a blocker.

Generated at Wed Feb 07 20:22:44 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.