[NETVIRT-1637] L3VPN CSIT failure post MRI actviity Created: 08/Nov/19  Updated: 09/Jan/20  Resolved: 09/Jan/20

Status: Verified
Project: netvirt
Component/s: vpnmanager
Affects Version/s: Magnesium
Fix Version/s: Magnesium

Type: Bug Priority: High
Reporter: Chetan Arakere Gowdru Assignee: Karthikeyan Krishnan
Resolution: Done Votes: 0
Labels: csit:failures
Remaining Estimate: 0 minutes
Time Spent: 3 days
Original Estimate: Not Specified


 Description   

As part of MRI activity, there are lot of code changes done(PFA) in netvirt. We are trying to stabilize CSIT post this failure and currently have 9 failure pending on l3vpn.

I tired to narrow down the issue(but not yet find the root cause) to some extent and see the following two reason for these failure.

1. After router is disassociated with L3VPN(2200:2), the fib entries are not added back into router vpn.
2. On deletion on L3VPN(2200:2), its remaining in pending_delete state in vpn-instance-op-data due to vpn-to-dpn-list not cleared. Resulting in further testcase failure which attempts to create same L3VPN.

Can you of us have a quick look on these issue so that we can close on MRI activity(which will unblock other pending patches for get merged)

The below JOB have TRACE level logs enabled for neutronvpn,vpnmanager and fibmanager modules and this will be cleared by tomorrow(as part sandbox job weekly clean-up)

https://logs.opendaylight.org/sandbox/vex-yul-odl-jenkins-2/srini-netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-itm-direct-tunnels-magnesium/7/

CSIT Link : https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-magnesium/61/robot/



 Comments   
Comment by Abhinav Gupta [ 08/Nov/19 ]

New sandbox log link: https://logs.opendaylight.org/sandbox/vex-yul-odl-jenkins-2/srini-netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-itm-direct-tunnels-magnesium/1/

Comment by Ashik Alias [ 06/Dec/19 ]

we have script issue of TCs are getting overlapped.  the router delete was called before the router dissociation from vpn was completed. we may need to add some amount of delay after router dissociation call
router is getting deleted at 9:05 where as the update of flows are tried later at 9:06 which failed as the router was in pending delete.

2019-12-06T09:05:59,198 | INFO  | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.l3.rev150712.routers.attributes.routers.Router_AsyncClusteredDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronRouterChangeListener      | 371 - org.opendaylight.netvirt.ipv6service-impl - 0.10.0.SNAPSHOT | Remove Router notification handler is invoked Uuid{_value=3c1e35cc-c027-45e4-aee4-2633416b0432}. 2019-12-06T09:05:59,210 | INFO  | jobcoordinator-main-task-41 | VpnInstanceListener              | 383 - org.opendaylight.netvirt.vpnmanager-impl - 0.10.0.SNAPSHOT | VPN-REMOVE: call: Operational status set to PENDING_DELETE for vpn 3c1e35cc-c027-45e4-aee4-2633416b0432 with rd 3c1e35cc-c027-45e4-aee4-2633416b0432

 

2019-12-06T09:06:31,864 | INFO  | jobcoordinator-main-task-39 | VpnInterfaceManager              | 383 - org.opendaylight.netvirt.vpnmanager-impl - 0.10.0.SNAPSHOT | updateVpnInstanceChange: failed to Add for update on VPNInterface 78bb3c21-4904-4588-b6a2-6ecbf5969331 from oldVpn(s) [4ae8cd92-48ca-49b5-94e1-b2921a261441] to newVpn 3c1e35cc-c027-45e4-aee4-2633416b0432 as the new vpn does not exist in oper DS or it is in pending state

Comment by Abhinav Gupta [ 06/Dec/19 ]

Yes, Srini. This is a script failure.

TC8 hasn't completed and TC10 has started executing, causing required DS entries to be wiped off prematurely.
How can we ensure TCs don't overlap, apart from introducing sleeps?

Here, if we go ahead with sleeps, we should add probably 10 secs of delay after router-dissociation call to succeed, and additionally check on fib entries to disappear before starting execution of next TC.

Comment by Abhinav Gupta [ 16/Dec/19 ]

Any update here, Srini?

Comment by Karthikeyan Krishnan [ 02/Jan/20 ]

Hi Srini,

   Request to work on this issue on high priority. We need to get 100% CSIT result.

 

Thanks & Regards,

Karthikeyan.

Comment by Srinivas Rachakonda [ 02/Jan/20 ]

Hi All,

Please let me know what should be the delay/sleep time to be put after router disassociation.

 

As of now, the script has a 30secs delay after the router disassociation.

 

Thanks,

Srinivas

 

Comment by Chetan Arakere Gowdru [ 02/Jan/20 ]

All,

Please note these CSIT UC do working fine in flourine/sodium CSIT. As specified earlier, Please verify the Magnesium MRI porting changes which I strongly suspect could be the issue(as these where broken post this activity)

Comment by Karthikeyan Krishnan [ 08/Jan/20 ]

Hi All,

  Have raised the fix[0] for L3VPN Application side problem.

 [0]https://git.opendaylight.org/gerrit/#/c/netvirt/+/86806/

Generated at Wed Feb 07 20:24:34 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.