[NETVIRT-896] CSIT Sporadic failures - tempest.api.network precommit_delete failure on security group delete Created: 08/Sep/17 Updated: 03/May/18 Resolved: 05/Apr/18 |
|
| Status: | Resolved |
| Project: | netvirt |
| Component/s: | General |
| Affects Version/s: | Nitrogen, Oxygen |
| Fix Version/s: | Carbon-SR3 |
| Type: | Bug | Priority: | Medium |
| Reporter: | Jamo Luhrsen | Assignee: | Isaku Yamahata |
| Resolution: | Done | Votes: | 0 |
| Labels: | csit:failures | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 9130 |
| Priority: | High |
| Description |
|
this csit job: tempest output snippet: {2}tearDownClass (tempest.api.network.test_extensions.ExtensionsTestJSON) [0.000000s] ... FAILED Captured traceback: File "tempest/test.py", line 190, in tearDownClass
six.reraise(etype, value, trace)
File "tempest/test.py", line 173, in tearDownClass
teardown()
File "tempest/test.py", line 477, in clear_credentials
cls._creds_provider.clear_creds()
File "tempest/lib/common/dynamic_creds.py", line 459, in clear_creds
self._cleanup_default_secgroup(creds.tenant_id)
File "tempest/lib/common/dynamic_creds.py", line 406, in _cleanup_default_secgroup
nsg_client.delete_security_group(secgroup['id'])
File "tempest/lib/services/network/security_groups_client.py", line 59, in delete_security_group
return self.delete_resource(uri)
File "tempest/lib/services/network/base.py", line 41, in delete_resource
resp, body = self.delete(req_uri)
File "tempest/lib/common/rest_client.py", line 301, in delete
return self.request('DELETE', url, extra_headers, headers, body)
File "tempest/lib/common/rest_client.py", line 659, in request
self._error_checker(resp, resp_body)
File "tempest/lib/common/rest_client.py", line 780, in _error_checker
raise exceptions.Conflict(resp_body, resp=resp)
tempest.lib.exceptions.Conflict: An object with that identifier already exists
Details:
{u'message': u'Security Group 462df320-f024-4810-85a7-5b1195ee3602 cannot perform precommit_delete due to Callback networking_odl.common.callback.OdlSecurityGroupsHandler.sg_callback_precommit--9223372036854618144 failed with "argument of type \'NoneType\' is not iterable".', u'type': u'SecurityGroupInUse', u'detail': u''}
|
| Comments |
| Comment by Jamo Luhrsen [ 18/Sep/17 ] |
| Comment by Jamo Luhrsen [ 27/Sep/17 ] |
|
another example: snippet from the neutron log [0]: 2017-09-27 14:36:11.991 15977 DEBUG networking_odl.journal.journal [req-feaf99a4-cde1-4f5e-8f45-907690caa0a2 - -] Start processing journal entries sync_pending_entries /opt/stack/networking-odl/networking_odl/journal/journal.py:221 nothing stood out to me in the karaf.log [1]. for the comments here, the tempest log is here [2] [0] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-carbon/27/control_1/q-svc.log.2017-09-27-132116.gz |
| Comment by Aswin Suryanarayanan [ 28/Sep/17 ] |
|
From initial look this looks like the cleanup of the resources are failing. It is not the same test case that is failing always. In run [1] it is the tempest.api.network.test_extensions.ExtensionsTestJSON and in the run [2] it is tempest.api.network.test_networks.NetworksIpV6Test. So are we too fast in doing the clean up? The error states the SG we try to clean in use , which should mean it is associated with a port. Are the tests run in order always or it purely random? [1] |
| Comment by Jamo Luhrsen [ 28/Sep/17 ] |
|
(In reply to Aswin Suryanarayanan from comment #3) Aswin, good catch on realizing the same precommit error on SG was happening in different test This may be a tough one to debug. Maybe we need some sort of extra log level turned I can look in to running the tests in sequence to see if that guarantees we never see any other ideas for debugging? JamO |
| Comment by Aswin Suryanarayanan [ 03/Oct/17 ] |
|
This is not observed after run 32[1]. Is there any thing changed here? or it is a part of random behavior? I am not able to reproduce this locally either. [1]https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-pike-upstream-stateful-nitrogen/36/log.html.gz |
| Comment by Aswin Suryanarayanan [ 03/Oct/17 ] |
|
It is back again Jamo, Given the sporadic nature of the failure, the parallel execution of tempest tests could be a reason. Like when SG is associated with a port in a test another test tries to delete/clean the SGs? But I don't have an answer why it is only on pike. |
| Comment by Jamo Luhrsen [ 03/Oct/17 ] |
|
(In reply to Aswin Suryanarayanan from comment #6) yeah, here is a good graphic to illustrate it's sporadic nature: > Jamo, From what I know about tempest, it's usually using unique names/ids for |
| Comment by Jamo Luhrsen [ 03/Oct/17 ] |
|
(In reply to Jamo Luhrsen from comment #7) are there any debugs we can enable on the ODL side that might help? |
| Comment by Aswin Suryanarayanan [ 05/Oct/17 ] |
|
(In reply to Jamo Luhrsen from comment #8) My assumption is this related to some neutron/networking odl. Didn't see anything interesting in karaf log either. So nothing comes in mind that could help from ODL side |
| Comment by Jamo Luhrsen [ 05/Oct/17 ] |
|
(In reply to Aswin Suryanarayanan from comment #9) random thought: |
| Comment by Aswin Suryanarayanan [ 24/Oct/17 ] |
|
There was a bug related to this fixed in tempest[1]. But we should be having the latest? [1]https://bugs.launchpad.net/tempest/+bug/1623691 |
| Comment by Jamo Luhrsen [ 02/Nov/17 ] |
|
I saw this in the sandbox tonight (11/1/2017). I'll keep it open for a while and keep checking |
| Comment by Zhijiang Hu [ 02/Jan/18 ] |
|
We also encounter this problem yesterday in OPNFV functest [1], and the log is [2], related ODL version is [3]. [1]https://build.opnfv.org/ci/job/functest-daisy-baremetal-daily-master/240/console [2]http://artifacts.opnfv.org/logs/functest/zte-pod3/2018-01-01_13-53-07/functest.log
|
| Comment by Sam Hague [ 09/Jan/18 ] |
|
latest runs still have failure 1/8/18: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-pike-upstream-stateful-oxygen/114/log_05_tempest.html.gz |
| Comment by Jamo Luhrsen [ 09/Jan/18 ] |
|
Using a local deployment (tripleo based), I can reproduce this with the following tempest run: --concurrency 2 --regex '(tempest.api.network.test_security_groups)' that runs 43 tests and sometimes one or more will come back failed with this kind of error: Details: {u'message': u'Security Group aa4f7835-44d9-489c-861b-552d25922d81 cannot perform precommit_delete due to Callback networking_odl.common.callback.OdlSecurityGroupsHandler.sg_callback_precommit-765033 failed with "argument of type \'NoneType\' is not iterable".', u'type': u'SecurityGroupInUse', u'detail': u''} If concurrency is set to 1, I can't reproduce the problem. |
| Comment by Isaku Yamahata [ 10/Jan/18 ] |
|
Can you please enable debug option for neutron and get log? Unfortunately the root cause exception is masked. Also can you try the following patch? this patch is based on pure guess, though. |
| Comment by Isaku Yamahata [ 10/Jan/18 ] |
|
diff --git a/networking_odl/ml2/mech_driver_v2.py b/networking_odl/ml2/mech_driver_v2.py
|
| Comment by Isaku Yamahata [ 10/Jan/18 ] |
| Comment by Isaku Yamahata [ 10/Jan/18 ] |
| Comment by Jamo Luhrsen [ 12/Jan/18 ] |
|
yamahata, I applied the changes to my setup, but the problem persists. Attached is the neutron server log (DEBUG logs) when running with your changes. I ran this tempest command, tempest run --regex '(tempest.api.network.test_security_groups)' I can easily try more patches/fixes, if you have things to try. |
| Comment by Isaku Yamahata [ 13/Jan/18 ] |
|
sg-precommit-exception.diff |
| Comment by Isaku Yamahata [ 15/Jan/18 ] |
|
the patch was uploaded to https://review.openstack.org/#/c/533706/ |
| Comment by Zhijiang Hu [ 22/Jan/18 ] |
|
Hello, Thanks for catching this bug, want to add that this also affects OpenStack Pike + ODL Carbon http://artifacts.opnfv.org/logs/functest/zte-pod3/2018-01-21_19-53-22/functest.log So when the patch is ready, could it be cherry-picked to networking-odl's stable/pike branch? |
| Comment by Jamo Luhrsen [ 23/Jan/18 ] |
|
yamahata, local tests with your patch are passing for me. I've run the tempest.api.network.test_security_groups bucket looking forward to getting your patch merged and ODL CSIT not hitting this any more. Much appreciated! |
| Comment by Jamo Luhrsen [ 25/Jan/18 ] |
|
CaseyCain, can we get someone to update this Jira with an explanation of why it is a blocker? BTW, in case it's not clear in the comments above, this bug has it's fix coming on the openstack |
| Comment by Zhijiang Hu [ 26/Jan/18 ] |
|
jluhrsen The reason why I think it was an blocker for OPNFV MS3.2 becasue I saw it happened many times in OPNFV functest (with Daisy installer which deploy OpenStack Pike + ODL carbon), for example, http://artifacts.opnfv.org/logs/functest/zte-pod3/2018-01-21_19-53-22/functest.log
Following the JIRA ticket I got the solution patch (https://review.openstack.org/#/c/533706/) and I have patched it dowstream into Daisy's repo first(https://gerrit.opnfv.org/gerrit/#/c/51031/). And functest for Daisy's sdn-nofeature scenario is stable now. So from Daisy Installer side, it is no longer a blocker for OPNFV MS3.2. Looking forward to getting the patch merged upstream. |
| Comment by Jaime Caamaño Ruiz [ 26/Jan/18 ] |
|
For sake of completeness, this patch got rid of the problem also: |
| Comment by Jamo Luhrsen [ 26/Jan/18 ] |
|
Thanks jaicaa. any chance we can get this ported to pike? he's my try for that: |
| Comment by Kit Lou [ 09/Feb/18 ] |
|
Changing severity to major as we have confirmation that this is not a blocker. |