[OVSDB-439] Stale connection check is failing and end up removing node from data store. Created: 15/Dec/17  Updated: 09/Jul/18  Resolved: 09/Jul/18

Status: Resolved
Project: ovsdb
Component/s: Southbound.Open_vSwitch
Affects Version/s: Carbon-SR3
Fix Version/s: Oxygen-SR3, Fluorine

Type: Bug Priority: High
Reporter: Sam Hague Assignee: suneelu varma
Resolution: Done Votes: 0
Labels: csit:3node, ds
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Issue Links:
Relates
relates to OVSDB-443 Write CSIT test to capture this issue. Open
relates to OVSDB-462 Bridge randomly goes missing in topol... Verified
relates to OVSDB-433 OVSDB entity... node-id=ovsdb://uuid/... Resolved
relates to OVSDB-438 operational node goes missing upon ov... Resolved
Sub-Tasks:
Key
Summary
Type
Status
Assignee
OVSDB-443 Write CSIT test to capture this issue. Sub-task Open  
Epic Link: Clustering Stability

 Description   

Hello Anil,
 
Looks like we hit the same issue in our local testing. 
With ODL Carbon (+ Pike, OVS2.7), during one reboot scenario, we observed some race condition in ODL/OVSDB. 
Can you please let us know if this issue is a known-issue/addressed in OVSDB?
 
Steps to Reproduce (in a working setup with a controller and two compute nodes):
1. Restart the compute node and wait for the compute node to come up.
2. Launch an instance on the compute node
3. You can observe that the instance initially stays in "spawning" state and then transitions to "error" state.
4. Restart the openvswitch on the compute node
5. Launch a new instance and it would boot successfully.
 
Basically, when we issue the reboot on the compute node, ODL identifies that the node is idle and triggers the disconnection chain. 
But, while this is going on, when the Compute node comes up, we could see that there is a race condition between the cleanup events and the events related to the node reconciliation.
 
In this process, we could see that finally the Compute node is deleted from the operational store [#] eventhough its connected to the controller. 
Since the node info is deleted from the datastore, the side effect is that port-binding fails and we are unable to spawn new VMs until we restart the OVS Switch on the Compute node.
Following[@] is a SNAP of the karaf logs which show this sequence.
 
Additional notes:
In case, the compute node comes up with some delay (i.e., after the cleanup is properly done in ODL) this issue (i.e., step3 above) is not seen.
 
[#] 2017-08-01 07:48:16,660 | INFO  | lt-dispatcher-49 | OvsdbConnectionManager           | 289 - org.opendaylight.ovsdb.southbound-impl - 1.4.1.Carbon-redhat-1 | Entity{type='ovsdb', id=/(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}

]/node/node[\{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/e9806896-8dc2-4f17-83ea-c1c957608915}]} has no owner, cleaning up the operational data store
[@] https://gist.github.com/sridhargaddam/3761ef080e11f2dd2429c8d7016ae6d0



 Comments   
Comment by Anil Vishnoi [ 22/Dec/17 ]

This issue can be recreated with just OVSDB southbound plugin by using the following steps.

 

(1) Connect the ovs (set-manager) to the controller.

(2) Add following iptables rule, that will disconnect the communication to the controller 

sudo iptables -A OUTPUT -d <CONTROLLER_IP> -j DROP
sudo iptables -A INPUT -s <CONTROLLER_IP> -j DROP

 

(3) Sleep for 90 seconds

(4) Remove the roles and enable the communication to the controller

sudo iptables -D OUTPUT -d <CONTROLLER_IP> -j DROP
sudo iptables -D INPUT -s <CONTROLLER_IP> -j DROP

 

This should trigger the second connection from the switch to the controller, but the previous connection is still lingering in the controller. Current code has a bug that ends up disconnecting the second connection as well.

 

jluhrsen this is a good CSIT test to check the connection flips between OVS and controller.

Comment by Anil Vishnoi [ 22/Dec/17 ]

stable/carbon : https://git.opendaylight.org/gerrit/66718

Comment by Jamo Luhrsen [ 05/Jan/18 ]

I can't reproduce this. I have a CSIT patch in the works, but I want to make sure the CSIT
can actually hit the bug. Locally, I'm doing this:

  • set manager
  • iptables block 6640 incoming
  • wait until I see the connection is not there anymore (ovs-vsctl show AND netstat)
  • unblock 6640

at this point I see the connection is made again and seems to stay. I must be missing
some step.

BTW,I know ovs will continue to try and sometimes the ovs-vsctl show will say
is_connected: True briefly, as I think it does that as soon as the initial TCP 3whs
succeeds. But, even so if it's rejected by ovsdb-lib then we'll see ovs trying to
reconnect again. Because of this, I'm trying to write the test to actually make sure
there is no long term ESTABLISHED connection on port 6640 showing up using the
same source tcp port number.

Comment by Anil Vishnoi [ 06/Jan/18 ]

Basically this issue will recreate only if there are parallel connection to the controller from the same switch. To put controller you will have to drop the traffic both ways (controller to switch and vice versa), that's the reason i was installing two rules 

sudo iptables -A OUTPUT -d <CONTROLLER_IP> -j DROP
sudo iptables -A INPUT -s <CONTROLLER_IP> -j DROP

 

You don't have to wait for connection to be gone, just wait for 60-90 seconds before removing the rules and it will get recreated consistently.

Comment by Jamo Luhrsen [ 28/Jun/18 ]

this patch is still unmerged. what's the plan here?

Comment by Jamo Luhrsen [ 09/Jul/18 ]

marking this done, as the patch went in back in December for carbon. I think we just didn't
close this jira at that point.

Generated at Wed Feb 07 20:36:24 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.