[OVSDB-288] passive connection not reconnected if ovs service is restarted Created: 03/Feb/16  Updated: 06/Apr/16  Resolved: 06/Apr/16

Status: Resolved
Project: ovsdb
Component/s: Southbound.Open_vSwitch
Affects Version/s: unspecified
Fix Version/s: None

Type: Bug
Reporter: Jamo Luhrsen Assignee: Anil Vishnoi
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 5221

 Description   

scenario:

controller initiates the connection to passive ovs instance (e.g. ptcp)

instance is connected and configurations can be made

restart the ovs instance (reboot entire system, stop/start ovsdb-server/ovs-vswitchd)

result:

ovs instance is no longer connected or existing in operational store. ovsdb
plugin does not seem to retry connection.



 Comments   
Comment by Anil Vishnoi [ 10/Feb/16 ]

Although we can add the retry logic, but there is no deterministic approach about when and how long we should retry. There is a possibility that the switch is totally gone and it's not going to come back up at all.

But i agree that we should have some retry logic for a reasonable time to atleast try and see if switch is back on, probably do 10 retry where first try will be 10 second after disconnection and then we increment the delay by 10 seconds for each next try e.g (1st try at = 10S, 2nd = 20S 3rd=30S 4th=40S.... 10th=100S ), that will give around 9 minutes window to user to reboot/reconfigure the system.

Comment by Jamo Luhrsen [ 10/Feb/16 ]

(In reply to Anil Vishnoi from comment #1)
> Although we can add the retry logic, but there is no deterministic approach
> about when and how long we should retry. There is a possibility that the
> switch is totally gone and it's not going to come back up at all.
>
> But i agree that we should have some retry logic for a reasonable time to
> atleast try and see if switch is back on, probably do 10 retry where first
> try will be 10 second after disconnection and then we increment the delay by
> 10 seconds for each next try e.g (1st try at = 10S, 2nd = 20S 3rd=30S
> 4th=40S.... 10th=100S ), that will give around 9 minutes window to user to
> reboot/reconfigure the system.

ofp has retry mechanisms right? why not use something similar. Not sure
I think we should ever give up though. maybe max out at 60s and then keep
going once every 60s. Thinking about scheduled outages, etc.

Comment by Anil Vishnoi [ 10/Feb/16 ]

(In reply to Jamo Luhrsen from comment #2)
> (In reply to Anil Vishnoi from comment #1)
> > Although we can add the retry logic, but there is no deterministic approach
> > about when and how long we should retry. There is a possibility that the
> > switch is totally gone and it's not going to come back up at all.
> >
> > But i agree that we should have some retry logic for a reasonable time to
> > atleast try and see if switch is back on, probably do 10 retry where first
> > try will be 10 second after disconnection and then we increment the delay by
> > 10 seconds for each next try e.g (1st try at = 10S, 2nd = 20S 3rd=30S
> > 4th=40S.... 10th=100S ), that will give around 9 minutes window to user to
> > reboot/reconfigure the system.
>
> ofp has retry mechanisms right? why not use something similar. Not sure
> I think we should ever give up though. maybe max out at 60s and then keep
> going once every 60s. Thinking about scheduled outages, etc.

OFP has no retry logic, because OFP there is only active connections (switch to controller). Keep re-trying is a bad idea, there is possibility that servers/compute node will never join back, in that case you will keep retrying it forever. If this is a planned outage, application need to remove that configuration from the data store and add it back once server/ovsdb is back.

Comment by Sam Hague [ 10/Feb/16 ]

Another thought, OVS and OVSDB by default retry connections. It is in the other direction rather than the request here, but maybe since that is supported we don't really need the functionality requested in the bug as a high priority.

Comment by Jamo Luhrsen [ 11/Feb/16 ]

(In reply to Anil Vishnoi from comment #3)
> (In reply to Jamo Luhrsen from comment #2)
> > (In reply to Anil Vishnoi from comment #1)
> > > Although we can add the retry logic, but there is no deterministic approach
> > > about when and how long we should retry. There is a possibility that the
> > > switch is totally gone and it's not going to come back up at all.
> > >
> > > But i agree that we should have some retry logic for a reasonable time to
> > > atleast try and see if switch is back on, probably do 10 retry where first
> > > try will be 10 second after disconnection and then we increment the delay by
> > > 10 seconds for each next try e.g (1st try at = 10S, 2nd = 20S 3rd=30S
> > > 4th=40S.... 10th=100S ), that will give around 9 minutes window to user to
> > > reboot/reconfigure the system.
> >
> > ofp has retry mechanisms right? why not use something similar. Not sure
> > I think we should ever give up though. maybe max out at 60s and then keep
> > going once every 60s. Thinking about scheduled outages, etc.
>
> OFP has no retry logic, because OFP there is only active connections (switch
> to controller). Keep re-trying is a bad idea, there is possibility that
> servers/compute node will never join back, in that case you will keep
> retrying it forever. If this is a planned outage, application need to remove
> that configuration from the data store and add it back once server/ovsdb is
> back.

ok, you are right about retrying indefinitely.

Comment by Jamo Luhrsen [ 11/Feb/16 ]

so this bug is being attributed to CSIT failures for not quite the right
reason. Maybe this should be, as Sam says, lowered to a simple enhancement
request. Some short term re-try mechanism in the case that our ovs node
services go missing for a period.

The more serious failure that CSIT is failing on goes like this
(seems like a new bug to track, but please advise on that)

A
ovs in passive mode

B
initiate connection from controller

C
verify it exists in config and operational

D
ovs-ctl stop, then start on ovs node (simulates node going away for a brief period)

E
ovs configured back in passive mode

F
verify operational does NOT see node (because we don't retry)

G
verify still in config (we never deleted it)

H
delete from config (starting suggested steps to recover)

I
verify not in operational and config (as expected)

J
initiate connection from controller fails now.

log messages of interest, I think, coming from step E:

2016-02-11 01:03:28,759 | INFO | entLoopGroup-8-1 | OvsdbConnectionService | 153 - org.opendaylight.ovsdb.library - 1.2.1.SNAPSHOT | Connection closed ConnectionInfo [Remote-address=209.132.179.50, Remote-port=6634, Local-address172.18.182.19, Local-port=36453, type=ACTIVE]
2016-02-11 01:03:28,760 | INFO | entLoopGroup-8-1 | OvsdbConnectionManager | 159 - org.opendaylight.ovsdb.southbound-impl - 1.2.1.SNAPSHOT | Library disconnected ACTIVE from /209.132.179.50:6634 to /172.18.182.19:36453. Cleaning up the operational data store
2016-02-11 01:03:28,783 | INFO | lt-dispatcher-21 | OvsdbConnectionManager | 159 - org.opendaylight.ovsdb.southbound-impl - 1.2.1.SNAPSHOT | Entity{type='ovsdb', id=/(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}

]/node/node[

{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://209.132.179.50:6634}

]} has no owner, cleaning up the operational data store

log messages of interest, I think, coming from step J:

2016-02-11 01:03:32,432 | INFO | lt-dispatcher-21 | OvsdbConnectionManager | 159 - org.opendaylight.ovsdb.southbound-impl - 1.2.1.SNAPSHOT | Disconnecting from 209.132.179.50:6634
2016-02-11 01:03:32,621 | WARN | ult-dispatcher-4 | OvsdbDataChangeListener | 159 - org.opendaylight.ovsdb.southbound-impl - 1.2.1.SNAPSHOT | Connection to device ConnectionInfo{getRemoteIp=IpAddress [_ipv4Address=Ipv4Address [_value=209.132.179.50], _value=[2, 0, 9, ., 1, 3, 2, ., 1, 7, 9, ., 5, 0]], getRemotePort=PortNumber [_value=6634], augmentations={}} already exists. Plugin does not allow multiple connections to same device, hence dropping the request OvsdbNodeAugmentation{getConnectionInfo=ConnectionInfo{getRemoteIp=IpAddress [_ipv4Address=Ipv4Address [_value=209.132.179.50], _value=[2, 0, 9, ., 1, 3, 2, ., 1, 7, 9, ., 5, 0]], getRemotePort=PortNumber [_value=6634], augmentations={}}}

Comment by Anil Vishnoi [ 11/Feb/16 ]

(In reply to Sam Hague from comment #4)
> Another thought, OVS and OVSDB by default retry connections. It is in the
> other direction rather than the request here, but maybe since that is
> supported we don't really need the functionality requested in the bug as a
> high priority.

Yes, but this is something that works for active connection only, and not for passive connection and for passive connection it's kind of enhancement work.

Comment by Sam Hague [ 11/Feb/16 ]

(In reply to Anil Vishnoi from comment #7)
> (In reply to Sam Hague from comment #4)
> > Another thought, OVS and OVSDB by default retry connections. It is in the
> > other direction rather than the request here, but maybe since that is
> > supported we don't really need the functionality requested in the bug as a
> > high priority.
>
> Yes, but this is something that works for active connection only, and not
> for passive connection and for passive connection it's kind of enhancement
> work.

Agreed, that is what I meant by "other direction" - it is the active connection rather than passive.

Comment by Anil Vishnoi [ 10/Mar/16 ]

stable/beryllium : https://git.opendaylight.org/gerrit/36028

There are two issues discussed in the issue

(1) No retry for controller initiated connection if connected gets dropped.

(2) Sometime when switch abruptly goes away (machine down etc), controller is not able to establish connection to switch and there is no operational data in the data store.

Above patch fixes issue (1). This patch added a reconciliation mechanism, where if controller initiated connection get dropped (connection flapping, machine crash), it will immediately attempt to connect back and after that it will make 10 attempts to connect to switch with the incremental time interval (10,20,30,40....100). Overall it will wait for 9 minutes for switch to come back up. After that it will give up. If there are usecase where we need to wait for longer time, I am open for suggestion.

We still recommend that if it's planned outage, user application should explicitly disconnect the switch.

Issue (2) is happening because tcp connection is not reset/terminated properly from controller side, because switch went down without sending TCP_FIN packet. Fix for this issue is proposed in following patch https://git.opendaylight.org/gerrit/#/c/35436/

Comment by Anil Vishnoi [ 31/Mar/16 ]

patch stable/beryllium : https://git.opendaylight.org/gerrit/#/c/36028/

Comment by Anil Vishnoi [ 31/Mar/16 ]

wiki : https://wiki.opendaylight.org/view/OVSDB_Integration:OVSDB_SB_Reconciliation

Comment by Jamo Luhrsen [ 06/Apr/16 ]

As outlined in comment 6, this bug is fixed. I have verified in stable/beryllium distro built on 04/02/2016.

moving to Resolved/Fixed state.

Generated at Wed Feb 07 20:36:00 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.