[OVSDB-288] passive connection not reconnected if ovs service is restarted Created: 03/Feb/16 Updated: 06/Apr/16 Resolved: 06/Apr/16 |
|
| Status: | Resolved |
| Project: | ovsdb |
| Component/s: | Southbound.Open_vSwitch |
| Affects Version/s: | unspecified |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Jamo Luhrsen | Assignee: | Anil Vishnoi |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 5221 |
| Description |
|
scenario: controller initiates the connection to passive ovs instance (e.g. ptcp) instance is connected and configurations can be made restart the ovs instance (reboot entire system, stop/start ovsdb-server/ovs-vswitchd) result: ovs instance is no longer connected or existing in operational store. ovsdb |
| Comments |
| Comment by Anil Vishnoi [ 10/Feb/16 ] |
|
Although we can add the retry logic, but there is no deterministic approach about when and how long we should retry. There is a possibility that the switch is totally gone and it's not going to come back up at all. But i agree that we should have some retry logic for a reasonable time to atleast try and see if switch is back on, probably do 10 retry where first try will be 10 second after disconnection and then we increment the delay by 10 seconds for each next try e.g (1st try at = 10S, 2nd = 20S 3rd=30S 4th=40S.... 10th=100S ), that will give around 9 minutes window to user to reboot/reconfigure the system. |
| Comment by Jamo Luhrsen [ 10/Feb/16 ] |
|
(In reply to Anil Vishnoi from comment #1) ofp has retry mechanisms right? why not use something similar. Not sure |
| Comment by Anil Vishnoi [ 10/Feb/16 ] |
|
(In reply to Jamo Luhrsen from comment #2) OFP has no retry logic, because OFP there is only active connections (switch to controller). Keep re-trying is a bad idea, there is possibility that servers/compute node will never join back, in that case you will keep retrying it forever. If this is a planned outage, application need to remove that configuration from the data store and add it back once server/ovsdb is back. |
| Comment by Sam Hague [ 10/Feb/16 ] |
|
Another thought, OVS and OVSDB by default retry connections. It is in the other direction rather than the request here, but maybe since that is supported we don't really need the functionality requested in the bug as a high priority. |
| Comment by Jamo Luhrsen [ 11/Feb/16 ] |
|
(In reply to Anil Vishnoi from comment #3) ok, you are right about retrying indefinitely. |
| Comment by Jamo Luhrsen [ 11/Feb/16 ] |
|
so this bug is being attributed to CSIT failures for not quite the right The more serious failure that CSIT is failing on goes like this A B C D E F G H I J log messages of interest, I think, coming from step E: 2016-02-11 01:03:28,759 | INFO | entLoopGroup-8-1 | OvsdbConnectionService | 153 - org.opendaylight.ovsdb.library - 1.2.1.SNAPSHOT | Connection closed ConnectionInfo [Remote-address=209.132.179.50, Remote-port=6634, Local-address172.18.182.19, Local-port=36453, type=ACTIVE] ]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://209.132.179.50:6634}]} has no owner, cleaning up the operational data store log messages of interest, I think, coming from step J: 2016-02-11 01:03:32,432 | INFO | lt-dispatcher-21 | OvsdbConnectionManager | 159 - org.opendaylight.ovsdb.southbound-impl - 1.2.1.SNAPSHOT | Disconnecting from 209.132.179.50:6634 |
| Comment by Anil Vishnoi [ 11/Feb/16 ] |
|
(In reply to Sam Hague from comment #4) Yes, but this is something that works for active connection only, and not for passive connection and for passive connection it's kind of enhancement work. |
| Comment by Sam Hague [ 11/Feb/16 ] |
|
(In reply to Anil Vishnoi from comment #7) Agreed, that is what I meant by "other direction" - it is the active connection rather than passive. |
| Comment by Anil Vishnoi [ 10/Mar/16 ] |
|
stable/beryllium : https://git.opendaylight.org/gerrit/36028 There are two issues discussed in the issue (1) No retry for controller initiated connection if connected gets dropped. (2) Sometime when switch abruptly goes away (machine down etc), controller is not able to establish connection to switch and there is no operational data in the data store. Above patch fixes issue (1). This patch added a reconciliation mechanism, where if controller initiated connection get dropped (connection flapping, machine crash), it will immediately attempt to connect back and after that it will make 10 attempts to connect to switch with the incremental time interval (10,20,30,40....100). Overall it will wait for 9 minutes for switch to come back up. After that it will give up. If there are usecase where we need to wait for longer time, I am open for suggestion. We still recommend that if it's planned outage, user application should explicitly disconnect the switch. Issue (2) is happening because tcp connection is not reset/terminated properly from controller side, because switch went down without sending TCP_FIN packet. Fix for this issue is proposed in following patch https://git.opendaylight.org/gerrit/#/c/35436/ |
| Comment by Anil Vishnoi [ 31/Mar/16 ] |
|
patch stable/beryllium : https://git.opendaylight.org/gerrit/#/c/36028/ |
| Comment by Anil Vishnoi [ 31/Mar/16 ] |
|
wiki : https://wiki.opendaylight.org/view/OVSDB_Integration:OVSDB_SB_Reconciliation |
| Comment by Jamo Luhrsen [ 06/Apr/16 ] |
|
As outlined in comment 6, this bug is fixed. I have verified in stable/beryllium distro built on 04/02/2016. moving to Resolved/Fixed state. |