[NETVIRT-509] Dissociates l3vpn from router and then Associates with network has 100% packet loss Created: 03/Mar/17  Updated: 03/May/18  Resolved: 14/Dec/17

Status: Resolved
Project: netvirt
Component/s: General
Affects Version/s: Boron
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Suvitha Balu Assignee: Aswin Suryanarayanan
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 7893

 Description   

Steps:
1. l3vpn associated with router which has two subnet and corresponding VMs.
2. Datapath test works fine.
3. Dissociate l3vpn from router.
4. Remove interface from router and delete router.
5. Associate l3vpn with network
6. Verified fib entry and flow table 21, which has corresponding entry.
7. But datapath test failed between networks and its 100% packet loss.
8. Also tested OVS restart , which also has same behavior like 100% packet loss.
9 But when i did one VM instance restart on one of the network, the datapath works fine between networks.

Attached the ODL Sandbox log which has ODL and OVS dump.

Issue observed on both carbon and boron.



 Comments   
Comment by Suvitha Balu [ 03/Mar/17 ]

Sandbox log: https://logs.opendaylight.org/sandbox/jenkins091/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-learn-carbon/15/archives/log.html.gz

Comment by Vivekanandan Narasimhan [ 03/Mar/17 ]

Hi Hanumant,

This issue may be clearly related to the fix for NETVIRT-452 here:
https://git.opendaylight.org/gerrit/52655

Can you do a check in sandbox flows/fibEntries and let me know?

Vivek

Comment by Hanamantagoud V Kandagal [ 03/Mar/17 ]

Please find our analysis :

Ping 20.1.1.10/32(VM_IP_NET20) from 10.1.1.9/32(VM_IP_NET10) is being done.

  • MAC Address : 10.1.1.9/32 fa:16:3e:34:de:4a
    20.1.1.10/32 fa:16:3e:c3:ec:b2
  • On a source DPN (i.e where VM_IP_NET10 is hosted) , we see below packet path
  • Here neutron router is not present , instead neutron network is being associated to L3VPN. Hence when ARP is resolved for 10.1.1.1 (gateway IP) , mac-addr fe:16:3e:34:de:4a is returned in the ARP responder table=81

table0 (in_port=2 actions=write_metadata:0x30000000000) ==>
table17 (metadata=0x30000000000, actions=write_metadata:0x6000030000000000) ==>
table40 (here it doesn't match any flow , so goes to default flow i.e actions=resubmit(,41),resubmit(,42) ==>

table41 (doesn't match any entry here)
table42 (here it matches below entry)

table=42, n_packets=122, n_bytes=11728, priority=61010,ip,metadata=0x30000000000/0xfffff0000000000 actions=learn(table=252,idle_timeout=300,priority=61010,delete_learned,cookie=0x6900000,eth_type=0x800,NXM_OF_IP_SRC[]=NXM_OF_IP_DST[],NXM_OF_IP_DST[]=NXM_OF_IP_SRC[],NXM_OF_IP_PROTO[],load:0x1->NXM_NX_REG5[0..7]),resubmit(,17) ==>

table17 ==> table19 . On table19 it must have matched one of the below 2 entries , but it didn't . Packet count shows 0.

cookie=0x8000009, duration=26.309s, table=19, n_packets=0, n_bytes=0, priority=20,metadata=0x222f2/0xfffffffe,dl_dst=fe:16:3e:34:de:4a actions=goto_table:21 cookie=0x8000009, duration=26.119s, table=19, n_packets=0, n_bytes=0, priority=20,metadata=0x222f2/0xfffffffe,dl_dst=fe:16:3e:8a:e2:a9 actions=goto_table:21

We suspect packet destMAC is tampered with , hence its not able to match the dl_dst in table=19.

Comment by Vivekanandan Narasimhan [ 16/Mar/17 ]

Hi Hanumant,

Thanks for the analysis.

It looks like with the steps put by submitter here, the VM would be completely unaware that it went out of a router-based-vpn and re-entered into a network-based-vpn. The VM might be holding the old MAC ARP resolved for the gateway-ip-address of 10.1.1.1. And so it might have used the same old gateway-mac-address incorrectly to send the IP Packets.

Hi Suvitha,

Can we please check with wireshark if the VM attempts ARPing after it sees IP Packet losses and through that ARPing it gets the new mac-address now applied for 10.1.1.1 which is fe:xx:xx:xx:xx rather than the router-interface mac-address.

Vivek

Comment by Jamo Luhrsen [ 20/Mar/17 ]

(In reply to Vivekanandan Narasimhan from comment #4)
> Hi Hanumant,
>
> Thanks for the analysis.
>
> It looks like with the steps put by submitter here, the VM would be
> completely unaware that it went out of a router-based-vpn and re-entered
> into a network-based-vpn. The VM might be holding the old MAC ARP resolved
> for the gateway-ip-address of 10.1.1.1. And so it might have used the same
> old gateway-mac-address incorrectly to send the IP Packets.
>
> Hi Suvitha,
>
> Can we please check with wireshark if the VM attempts ARPing after it sees
> IP Packet losses and through that ARPing it gets the new mac-address now
> applied for 10.1.1.1 which is fe:xx:xx:xx:xx rather than the
> router-interface mac-address.
>
> Vivek

right now, we don't have a way to do pcaps in these openstack instances,
but I had started to work on that a while back. I gave up and abandoned
it when it sat for too long without work. Hanumant, do you have the
cycles to finish it up? it's here:

https://git.opendaylight.org/gerrit/#/c/45441/

This is the second time in maybe 6 months that we've wanted this, which
isn't a lot, but it surely would be helpful sooner or later.

JamO

Comment by Jamo Luhrsen [ 20/Mar/17 ]

(In reply to Jamo Luhrsen from comment #5)
> (In reply to Vivekanandan Narasimhan from comment #4)
> > Hi Hanumant,
> >
> > Thanks for the analysis.
> >
> > It looks like with the steps put by submitter here, the VM would be
> > completely unaware that it went out of a router-based-vpn and re-entered
> > into a network-based-vpn. The VM might be holding the old MAC ARP resolved
> > for the gateway-ip-address of 10.1.1.1. And so it might have used the same
> > old gateway-mac-address incorrectly to send the IP Packets.
> >
> > Hi Suvitha,
> >
> > Can we please check with wireshark if the VM attempts ARPing after it sees
> > IP Packet losses and through that ARPing it gets the new mac-address now
> > applied for 10.1.1.1 which is fe:xx:xx:xx:xx rather than the
> > router-interface mac-address.
> >
> > Vivek
>
> right now, we don't have a way to do pcaps in these openstack instances,
> but I had started to work on that a while back. I gave up and abandoned
> it when it sat for too long without work. Hanumant, do you have the
> cycles to finish it up? it's here:
>
> https://git.opendaylight.org/gerrit/#/c/45441/
>
> This is the second time in maybe 6 months that we've wanted this, which
> isn't a lot, but it surely would be helpful sooner or later.
>
> JamO

I meant, Suvitha, not Hanumant

Comment by Hari Krishna [ 21/Mar/17 ]

Trying to reproduce this in local setup. Went through the pipeline, everything looked ok.

Comment by Suvitha Balu [ 21/Mar/17 ]

Sure Jamo, i can explore on this.

Comment by Vivekanandan Narasimhan [ 03/Apr/17 ]

Hi Hari,

Do we have any updates on this?

Vivek

Comment by Hari Krishna [ 25/Apr/17 ]

ETA - 5th May 2017

Comment by Hari Krishna [ 01/May/17 ]

Hi Suvitha,

I tried to reproduce this bug locally, and wasn't able to. PLease could you try it again and let me know if you still this issue.

Regards
Hari

Comment by Suvitha Balu [ 03/May/17 ]

Log from Sandbox:

https://logs.opendaylight.org/sandbox/jenkins091/netvirt-csit-1node-openstack-newton-upstream-learn-boron/1

https://logs.opendaylight.org/sandbox/jenkins091/netvirt-csit-1node-openstack-newton-upstream-learn-carbon/2

Comment by Hari Krishna [ 03/May/17 ]

Hi Suvitha,

Thank you for running it again. The difference between the csit and my setup is that i don't configure ACL's. I ran the test without ACL's enabled. This seems to be the root cause. I need to investigate this further.

I am putting the ETA as 10th May 2017 as i need to investigate this further. There is nothing to be done from L3VPN side.

Regards
Hari

Comment by Hari Krishna [ 05/May/17 ]

HI Suvitha,

I have tried to recreate this manually. with manual steps it is working. I have tried this multiple times and don’t see an issue.

Please can you add some delay of like 2-3 sec after router disassociation and also introduce a delay of 2-3 seconds after adding networks to L3VPN and before you ping.

Can you try this an let me know.

Regards
Hari

Comment by Hari Krishna [ 08/May/17 ]

HI Suvitha,
In the csit script, did you put a delay and check. Also in the script after network association, the ping count is 3. Can you make it 20 and test again. We saw some delay in ACL learning tables which introduces a delay. It takes time for ACL learning tables.
Please can you check and revert back.

Regards
Hari

Comment by Jamo Luhrsen [ 08/May/17 ]

(In reply to Hari Krishna from comment #15)
> HI Suvitha,
> In the csit script, did you put a delay and check. Also in the script after
> network association, the ping count is 3. Can you make it 20 and test again.
> We saw some delay in ACL learning tables which introduces a delay. It takes
> time for ACL learning tables.
> Please can you check and revert back.
>
> Regards
> Hari

Hari,

how long of a delay do you suggest? I don't think this problem is
happening every time, so we would need to explain why sometimes there
is a delay and other times not.

I think the idea of increasing ping count to 20 for debugging is best,
as that will give us an easy way to know how long the delay is.

Suvitha,

any chance you know of a failure for this bug in an releng job? the
sandbox is purged on a weekly basis so the logs we have links for in
this bug are not working any more.

Thanks,
JamO

Comment by Hari Krishna [ 09/May/17 ]

Hi Jamo/Suvitha,

In our local setup what I have seen is after disassociate from router and associating to networks. The ACL learn tables namely
212,213,214 are not learning in a timely manner. There could be two issues.
One as soon as networks are associated they would take some time to learn. That is why I suggested 2-3 seconds delay after associating networks
and before trying the ping.
Secondly what I have noticed is, the very first time ping is done, there is packet loss. This packet loss is between 10-15 packets before the ACL learn tables
get populated. The CSIT test case only tried with packet count of 3, which would fail initially. That is why I have recommended to Suvitha to increase the
packet count to 20 and test.
But please not once the ACL learn tables are populated, no matter how many time you disassociate/associate networks, the traffic flow becomes normal.
It’s only the first time that there is traffic loss.

I am following up with the ACL team separately to find out, if the above two observations are valid and what is the expectation from the ACL learn tables.

Regards
Hari

Comment by Hari Krishna [ 09/May/17 ]

Hi Som

the steps to reproduce this issue.

#Create Neutron networks
#List the networks
#Create neutron subnets
#add allow rules
#Create neutron ports
#Create nova VM's
#Check ELAN data traffic within the networks
#create tunnel
#Create routers
#Add interface to router
#Check L2 path with router
#Create L3VPN
#Associate to router
#dissassociate router
#Remove subnets from router
#delete router
#Associate networks to l3vpn

I have anyway sent you the log files, I am assigning this bug to you to have a look.

Regards
Hari

Hi Som

In our local setup what I have seen is after disassociate from router and associating to networks. The ACL learn tables namely
212,213,214 are not learning in a timely manner. There could be two issues.
One as soon as networks are associated they would take some time to learn. That is why I suggested 2-3 seconds delay after associating networks
and before trying the ping.
Secondly what I have noticed is, the very first time ping is done, there is packet loss. This packet loss is between 10-15 packets before the ACL learn tables
get populated. The CSIT test case only tried with packet count of 3, which would fail initially. That is why I have recommended to increase the
packet count to 20 and test.
But not once the ACL learn tables are populated, no matter how many time you disassociate/associate networks, the traffic flow becomes normal.
It’s only the first time that there is traffic loss.

Yes this testing is done one latest nitrogen.
I am also attaching the log files.

Regards
Hari

Comment by Somashekar Byrappa [ 09/May/17 ]

Hi Slava,

As this issue is related to ACL in Learn mode, I am assigning this one to you.
Please redirect to the right person, if needed.

Thanks,
Som

Comment by Aswin Suryanarayanan [ 14/Dec/17 ]

Learn mode is deprecated.

Generated at Wed Feb 07 20:21:44 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.