[NETVIRT-1372] Fail: vpn basic ipv6.Delete Router And Router Interfaces With L3VPN: Verify GWMAC Flow Entry Removed From Flow Table' failed Created: 16/Jul/18  Updated: 16/Oct/18  Resolved: 16/Oct/18

Status: Resolved
Project: netvirt
Component/s: ipv6service
Affects Version/s: None
Fix Version/s: Fluorine-SR1

Type: Bug Priority: High
Reporter: Sam Hague Assignee: Somashekar B
Resolution: Cannot Reproduce Votes: 0
Labels: csit:exception
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File odl1_karaf.log.tar.xz    

 Description   

The test fails when ti checks for the gw mac's in the table 19 flows. fa:16:3e:1e:c9:e6 in this test run is the one that failed to be removed. This test fails intermittently every other day or so.

 cookie=0x8220015, duration=7053.025s, table=19, n_packets=62, n_bytes=2604, priority=100,arp,arp_op=1 actions=resubmit(,17)
 cookie=0x8220016, duration=7053.025s, table=19, n_packets=28, n_bytes=1176, priority=100,arp,arp_op=2 actions=resubmit(,17)
 cookie=0x8000009, duration=49.935s, table=19, n_packets=0, n_bytes=0, priority=20,metadata=0x30e0c/0xfffffe,dl_dst=fa:16:3e:1e:c9:e6 actions=goto_table:21
 cookie=0x1080000, duration=7053.025s, table=19, n_packets=3513, n_bytes=468116, priority=0 actions=resubmit(,17)
 cookie=0x88880000, duration=7052.683s, table=195, n_packets=193, n_bytes=8274, priority=0,arp actions=CONTROLLER:65535,learn(table=195,hard_timeout=5,priority=10,cookie=0x88880000,eth_type=0x806,NXM_OF_ARP_SPA[],NXM_OF_ARP_TPA[],OXM_OF_METADATA[24..39],load:0x1->NXM_NX_REG4[0..7]),learn(table=196,hard_timeout=5,priority=10,cookie=0x88880000,eth_type=0x806,NXM_OF_ARP_TPA[]=NXM_OF_ARP_SPA[],NXM_OF_ARP_SPA[]=NXM_OF_ARP_TPA[],OXM_OF_METADATA[24..39],load:0x1->NXM_NX_REG4[8..15])


 Comments   
Comment by Jamo Luhrsen [ 16/Jul/18 ]

shague, can you put a link to the robot log with this failure?

Comment by Somashekar B [ 03/Aug/18 ]

This issue is not observed on the recent builds.
I have verified builds from 76-111 which is from 27-July-18 to 2-Aug-18.
This testcase has failed in builds 79, 88, 90, 91, 94 but for different reasons, mainly because of new exceptions during it's execution. But it has not failed for the reason specified in this bug.

Comment by Waldemar Znoinski [ 15/Aug/18 ]

I see this issue in RH OSP CI recently, DEBUG karaf logs + robot log attached to: https://bugzilla.redhat.com/show_bug.cgi?id=1616288

Comment by Vishal Thapar [ 03/Sep/18 ]

Saw this issue in upstream Netvirt CSIT again:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-snat-conntrack-fluorine/52/robot-plugin/log_08_vpn_basic.html.gz

Comment by Jamo Luhrsen [ 04/Sep/18 ]

it's still happening, re-opening

Comment by Somashekar B [ 10/Sep/18 ]

I am not able to reproduce this issue.

Observation:
----------------
From the steps in CSIT, I could find that the stale flows in table 19 occur when router is associated with L3VPN.

When subnets are added to router, for every router interface (dl_dst=MAC), table 19 flow is programmed with router metadata.
When router is associated with L3VPN, the existing flows with router metadata will be deleted and a new set of flows with VPN metadata will be programmed for all the router interfaces.
In this case, one flow with router metadata is left over as stale on 2 out of 3 DPNs.
Now, when router interfaces are deleted, corresponding flows with VPN metadata will be deleted leaving router metadata flow as stale.
Later, router is dissociated with L3VPN.

In case, if we dissociate router from L3VPN and later delete router interfaces, we may not hit this stale issue.
I don't recommend to change the CSIT steps, just want to make a note of my observation here.

Logs with TRACE enabled might help a bit here.

Comment by Jamo Luhrsen [ 10/Sep/18 ]

the last time this failed in 1node was over two weeks ago, but maybe it's just more frequent in the 3node jobs, which is where
thapar saw it.

what modules would we need TRACE enabled for? One thing we can try is running this suite many times in a row
in the sandbox to see if it can be reproduced more commonly.

Comment by Somashekar B [ 11/Sep/18 ]

Please enable TRACE for below modules:

org.opendaylight.netvirt.neutronvpn
org.opendaylight.netvirt.vpnmanager
org.opendaylight.netvirt.fibmanager

Comment by Somashekar B [ 09/Oct/18 ]

jluhrsen, did you get a chance to get the TRACE logs for this issue?

Comment by Vishal Thapar [ 09/Oct/18 ]

I've started a run with just this test suite and trace logs:

https://jenkins.opendaylight.org/sandbox/user/thapar/my-views/view/MyView/job/thapar-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-snat-conntrack-neon/1/

Will share the results once available and rerun the suite if don't hit the issue.

Could you explain the workaround you mentioned? Do you think changing order can 'avoid' this issue? If yes, I'd like to try that too as backup plan.

Comment by Vishal Thapar [ 09/Oct/18 ]

First run didn't see the issue and I messed up the debugmap. Started a new run with correct debugmap.

Update: Second run passed but no logs. I think TRACE generates way too many log entries than our jenkins can manage. Can we get more directed traces to reduce amount of logs?

https://jenkins.opendaylight.org/sandbox/user/thapar/my-views/view/MyView/job/thapar-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-snat-conntrack-neon/2/

Comment by Somashekar B [ 09/Oct/18 ]

Hi Vishal,

Try setting "org.opendaylight.netvirt.neutronvpn.NeutronHostConfigChangeListener" to INFO level.

Comment by Somashekar B [ 09/Oct/18 ]

Hi Vishal,

Thanks for running the test suite.

From CSIT logs, issue has happened when router is associated with L3VPN.
For one of the router interface (dl_dst=fa:16:3e:0c:a0:75), table 19 flow having router metadata is not getting deleted in 2 out of 3 DPNs (refer [2]).
So, when we perform router dissociation, it should come back to same stage as [1] which was before router assocation with BGPVPN.
Now if we delete router interfaces (removing subnets from router), these table 19 flows should get deleted.
There should not be any stale flows in table 19.

[1] Before router assocation with BGPVPN:

cookie=0x8220015, duration=1035.396s, table=19, n_packets=45, n_bytes=1890, priority=100,arp,arp_op=1 actions=resubmit(,17)
cookie=0x8220016, duration=1035.396s, table=19, n_packets=22, n_bytes=924, priority=100,arp,arp_op=2 actions=resubmit(,17)
cookie=0x8000009, duration=173.732s, table=19, n_packets=0, n_bytes=0, priority=20,metadata=0x33c9e/0xfffffe,dl_dst=fa:16:3e:fa:a9:90 actions=goto_table:21
cookie=0x8000009, duration=169.363s, table=19, n_packets=0, n_bytes=0, priority=20,metadata=0x33c9e/0xfffffe,dl_dst=fa:16:3e:0c:a0:75 actions=goto_table:21
cookie=0x1080000, duration=1035.396s, table=19, n_packets=5503, n_bytes=569678, priority=0 actions=resubmit(,17)

[2] After router assocation with BGPVPN:

cookie=0x8220015, duration=1109.108s, table=19, n_packets=45, n_bytes=1890, priority=100,arp,arp_op=1 actions=resubmit(,17)
cookie=0x8220016, duration=1109.108s, table=19, n_packets=22, n_bytes=924, priority=100,arp,arp_op=2 actions=resubmit(,17)
cookie=0x8000009, duration=243.075s, table=19, n_packets=0, n_bytes=0, priority=20,metadata=0x33c9e/0xfffffe,dl_dst=fa:16:3e:0c:a0:75 actions=goto_table:21
cookie=0x8000009, duration=44.144s, table=19, n_packets=0, n_bytes=0, priority=20,metadata=0x33cb6/0xfffffe,dl_dst=fa:16:3e:fa:a9:90 actions=goto_table:21
cookie=0x8000009, duration=41.477s, table=19, n_packets=0, n_bytes=0, priority=20,metadata=0x33cb6/0xfffffe,dl_dst=fa:16:3e:0c:a0:75 actions=goto_table:21
cookie=0x1080000, duration=1109.108s, table=19, n_packets=5764, n_bytes=593666, priority=0 actions=resubmit(,17)
   

Comment by Vishal Thapar [ 09/Oct/18 ]

Hi Som,

Could you clarify this? Set all other TRACE and this to INFO?

Comment by Somashekar B [ 10/Oct/18 ]

Yes, Vishal.

Comment by Jamo Luhrsen [ 10/Oct/18 ]

trying this here:
https://jenkins.opendaylight.org/sandbox/user/thapar/my-views/view/MyView/job/thapar-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-snat-conntrack-neon/4/

Comment by Vishal Thapar [ 12/Oct/18 ]

And we're not hitting the issue with TRACE enabled

Comment by Somashekar B [ 12/Oct/18 ]

Should we close this bug for now and re-open again if it's reproducible later?

Comment by Sam Hague [ 16/Oct/18 ]

Yes, let's go ahead and close it for now.

Generated at Wed Feb 07 20:23:54 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.