[OVSDB-253] Arp responder (and Icmp echo responder) rules not removed properly from ovs (L3 DVR) Created: 10/Jan/16 Updated: 29/May/18 Resolved: 09/Mar/16 |
|
| Status: | Resolved |
| Project: | ovsdb |
| Component/s: | openstack.net-virt |
| Affects Version/s: | unspecified |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Josh Hershberg | Assignee: | Josh Hershberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 4917 |
| Priority: | High |
| Description |
|
After the router interface and the router is deleted "ovs-ofctl -O OpenFlow13 dump-flows br-int" shows the rules still present. |
| Comments |
| Comment by Flavio Fernandes [ 11/Jan/16 ] |
|
this is duplicate of |
| Comment by Sam Hague [ 12/Jan/16 ] |
|
(In reply to Flavio Fernandes from comment #1) Possibly related but different. 4844 is br-ex has gone away but the arp responder code is still trying to arp for a gateway. This bug here is that there are still flows on br-ex that should have been removed when the neutron router interface was deleted. Both outcomes might be related to some cleanup code. |
| Comment by Flavio Fernandes [ 13/Jan/16 ] |
|
(In reply to Sam Hague from comment #2) ah, ok. the logic for stopping the arp responder is triggered by the |
| Comment by Sam Hague [ 02/Feb/16 ] |
|
Josh, is this bug fixed? |
| Comment by Hari Prasidh [ 05/Feb/16 ] |
|
From our observation, When Router is deleted When tenant network is deleted In L3NeutronAdapter as part of the update event its observed that the Arp flows are written for the default gateway IP. Dump flows for the above operations are given below Dump flows before router and tenant delete operation. cookie=0x0, duration=309.454s, table=20, n_packets=0, n_bytes=0, priority=1024,arp,tun_id=0x419,arp_tpa=20.0.0.2,arp_op=1 actions=move:NXM_OF_ETH_SRC[] Dump flows after router delete operation cookie=0x0, duration=763.474s, table=20, n_packets=0, n_bytes=0, priority=1024,arp,tun_id=0x419,arp_tpa=20.0.0.2,arp_op=1 actions=move:NXM_OF_ETH_SRC[] dump flow after tenant delete operation. Deleted tenant in 10.0.0.X subnet, The DHCP arp flows for DHCP IP 10.0.0.2 is removed but the arp flows for default gateway IP 10.0.0.1 is added again. cookie=0x0, duration=812.489s, table=20, n_packets=0, n_bytes=0, priority=1024,arp,tun_id=0x419,arp_tpa=20.0.0.2,arp_op=1 actions=move:NXM_OF_ETH_SRC[] Our understanding is that the cache is not cleared correctly during the delete event handling of the default gateway IP. |
| Comment by Sam Hague [ 05/Feb/16 ] |
|
I think this is because of gatewayMacResolved() or related. That gets triggered from packet-ins. So that is called async of the delete events. Eventually that methods goes to write the ARP flow. We should probably have a check before writing the flow to ensure that there is still a network, if not then don't write the flow. We also might want to remove the ARPing for the gateway if it is no longer needed - if there are no more networks using it. Another place to look is the doNeutronNetworkDeleted() method. That method was added before we had the caches, so I wondr if that is no longer needed since the caches keep track of the async events. This may not be related to the error in this bug but we probably should still look at if this is needed. |
| Comment by Josh Hershberg [ 08/Feb/16 ] |
|
OK. Here's what I've found. When you remove the router interface the ARP and ICMP flows are successfully removed from OVS...BUT they are added again if you delete the subnet because the router is still in subnetIdToRouterInterfaceCache. Here are the details: 1) When you remove the router interface the local caches are not properly cleaned. This is due to the following code which is invoked at the end of handleNeutronPortEvent:
if (neutronPort != null) { networkIdToRouterMacCache.remove(neutronPort.getNetworkUUID()); networkIdToRouterIpListCache.remove(neutronPort.getNetworkUUID()); subnetIdToRouterInterfaceCache.remove(neutronRouterInterface.getSubnetUUID()); } } 2) When you issue a "neutron subnet-delete" command neutron starts by issuing an update to the ports on that subnet before the subnet is deleted. 3) in the function handleNeutronPortEvent the following loop causes the flows to be rewritten: } |
| Comment by Josh Hershberg [ 08/Feb/16 ] |
|
UPDATE: This issue is resolved by undoing this change https://github.com/stackforge/networking-odl which confirms the analysis I posted above. However, there are still a few wrinkles. 1) ArpResponder for the DHCP addresses remain in the MdSal but are not in OVS. After restarting OVS, this disappers. 2) L3Forwarding flows are present in both mdsal and OVS 3) LocalTableMiss and TunnelMiss are still present in mdsal and OVS. I suspect these are unrelated issues but want to make sure. |
| Comment by Josh Hershberg [ 09/Feb/16 ] |
|
Sorry the commit this reverts is https://git.opendaylight.org/gerrit/#/c/27521/ |
| Comment by Hari Prasidh [ 12/Feb/16 ] |
|
Hi, I tested with below patch which is recently merged. Here I noticed arp entries are still not deleted for DHCP port in control node even we deleted router and network. PFB dump flows: Add network (tenant1) : cookie=0x0, duration=83.198s, table=20, n_packets=0, n_bytes=0, priority=1024,arp,tun_id=0x434,arp_tpa=10.0.0.2,arp_op=1 actions=move:NXM_OF_ETH_SRC[] Add routerinterface with tenant1 : cookie=0x0, duration=117.513s, table=20, n_packets=0, n_bytes=0, priority=1024,arp,tun_id=0x434,arp_tpa=10.0.0.2,arp_op=1 actions=move:NXM_OF_ETH_SRC[] Delete router interface: cookie=0x0, duration=257.230s, table=20, n_packets=0, n_bytes=0, priority=1024,arp,tun_id=0x434,arp_tpa=10.0.0.2,arp_op=1 actions=move:NXM_OF_ETH_SRC[] Delete network(tenant1) : cookie=0x0, duration=291.280s, table=20, n_packets=0, n_bytes=0, priority=1024,arp,tun_id=0x434,arp_tpa=10.0.0.2,arp_op=1 actions=move:NXM_OF_ETH_SRC[] |
| Comment by Sam Hague [ 12/Feb/16 ] |
|
Hari, could you try the same tests using a latest build from stable/beryllium? We also found that the patch you mentioned didn't solve the problem completely. There were two other patches that when all three together looked to clean up the flows. |
| Comment by Sam Hague [ 12/Feb/16 ] |
|
(In reply to Sam Hague from comment #11) Hari, also could you describe your test setup - separate openstack control nodes and compute nodes, control+compute on node with a second compute node, etc? |
| Comment by Hari Prasidh [ 15/Feb/16 ] |
|
Attachment dump_flows&karaf_log.zip has been added with description: karaf logs, local conf and dump flows for your reference |
| Comment by Hari Prasidh [ 15/Feb/16 ] |
|
Hi Sam, I tested with latest build which from stable/beryllium, and I noticed still arp flows are not cleared from table 20. Am using one control and 2 compute nodes. I've attached local conf for control and compute nodes. I deleted networks and router interface but still arp flows are not cleared in Control node and compute nodes. I've attached karaf logs and a doc which having dump flows for all 3 nodes. |
| Comment by Josh Hershberg [ 15/Feb/16 ] |
|
Just to make sure, are they ARP flows for routers or VMs? The fix in this bug relates to routers. (In reply to hari prasad from comment #14) |
| Comment by Hari Prasidh [ 15/Feb/16 ] |
|
Those arp flows are(In reply to Joshua from comment #15) Hi JOshua, |
| Comment by Josh Hershberg [ 23/Feb/16 ] |
|
The remaining issue, the ARP flows for DHCP ports is really a separate issue. Opening a new bug for that one Bug-5408 |
| Comment by Josh Hershberg [ 23/Feb/16 ] |
|
New bug opened for last lingering ARP flows which are really a different bug. See Bug-5408 |
| Comment by Sam Hague [ 23/Feb/16 ] |
|
(In reply to Joshua from comment #18) Josh, can this bug be closed now? |
| Comment by Hari Prasidh [ 04/Mar/16 ] |
|
Hi, I tested with latest build , the bug is fixed for ARP Responder service. Can we close this bug ? |