[GENIUS-119] Auto tunnels: race between creation of tun port and bridge ref, port not created Created: 09/Apr/18 Updated: 13/Jun/19 Resolved: 13/Jun/19 |
|
| Status: | Verified |
| Project: | genius |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Sodium |
| Type: | Bug | Priority: | Medium |
| Reporter: | Josh Hershberg | Assignee: | nidhi adhvaryu |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
What went wrong functionally: I'm currently working on CSIT to load test auto-tunnels. The test starts 30 docker containers each containing an OVS. Once the containers are running the test quickly set-manager and set-controllers the OVS instances. Test results show that some of the OVS nodes do not get all tunnel ports. Why? The reason for this is a race condition where genius attempts to create a tunnel port on the br-int of a node before the BridgeRedEntry is created for that br-int. As such the following code in OvsInterfaceConfigAddHelper#addTunnelConfiguration never enters the if block and the port is never created: LOG.debug("creating bridge interfaceEntry in ConfigDS {}", dpId);
interfaceMetaUtils.createBridgeInterfaceEntryInConfigDS(dpId, interfaceNew.getName());
// create bridge on switch, if switch is connectedBridgeRefEntry bridgeRefEntry = interfaceMetaUtils.getBridgeRefEntryFromOperDS(dpId);
if (bridgeRefEntry != null && bridgeRefEntry.getBridgeReference() != null) {
LOG.debug("creating bridge interface on dpn {}", dpId);
InstanceIdentifier<OvsdbBridgeAugmentation> bridgeIid =
(InstanceIdentifier<OvsdbBridgeAugmentation>) bridgeRefEntry
.getBridgeReference().getValue();
if (createTunnelPort) {
southboundUtils.addPortToBridge(bridgeIid, interfaceNew, tunnelName);
}
...
This is evident from the times and line numbers on the following log statements. First we have: 36516 2018-04-08T12:38:16,674 | INFO | ForkJoinPool-1-worker-3 | OvsInterfaceConfigAddHelper | 328 - org.opendaylight.genius.interfacemanager-impl - 0.4.1.SNAPSHOT | a dding tunnel configuration for interface tuneebd1154715 36517 2018-04-08T12:38:16,674 | DEBUG | ForkJoinPool-1-worker-3 | OvsInterfaceConfigAddHelper | 328 - org.opendaylight.genius.interfacemanager-impl - 0.4.1.SNAPSHOT | c reating bridge interfaceEntry in ConfigDS 77216042541384 The next line should have been "creating bridge interface on dpn..." (from the code above) but that line never comes because the BridgeRefEntry is only created here: 37304 2018-04-08T12:38:17,159 | DEBUG | ForkJoinPool-1-worker-3 | InterfaceMetaUtils | 328 - org.opendaylight.genius.interfacemanager-impl - 0.4.1.SNAPSHOT | C reating bridge ref entry for dpn: 77216042541384 bridge: InstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb .rev150105.OvsdbBridgeAugmentation, path=[org.opendaylight.yang.gen.v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.NetworkTopology, org.opendaylight.yang.gen .v1.urn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.Topology[key=TopologyKey [_topologyId=Uri [_value=ovsdb:1]]], org.opendaylight.yang.gen.v1.u rn.tbd.params.xml.ns.yang.network.topology.rev131021.network.topology.topology.Node[key=NodeKey [_nodeId=Uri [_value=ovsdb://uuid/9630ada5-c950-49ec-8599-a53e776ab729/ bridge/br-int]]], org.opendaylight.yang.gen.v1.urn.opendaylight.params.xml.ns.yang.ovsdb.rev150105.OvsdbBridgeAugmentation]}
|
| Comments |
| Comment by Faseela K [ 09/Apr/18 ] |
|
jhershbe : There is InterfaceTopologyStateListener which fetches all tunnel interfaces from DS and program them on the switch. Are you saying this also did not work? Because it ended up in race with tunnel-interface creation thread? |
| Comment by Faseela K [ 10/Apr/18 ] |
|
jhershbe : I can see two approaches to solve this :
Suggestions? |
| Comment by Josh Hershberg [ 10/Apr/18 ] |
I did not delve into the details but, no, it did not work. |
| Comment by Josh Hershberg [ 10/Apr/18 ] |
|
k.faseela, regarding the two approaches you propose, I'm thinking that since (1) is SO much simpler (much smaller change) and safer it's probably worth giving it a try. I figure since I have that CSIT job I'm working on I can implement (1) and then do some scale tests with and without it to make sure the extra registrations do not cause problems with scale. I don't expect it to be an issue since from 30-40 nodes this race conditions happens just 1-2 times. Sound good? |
| Comment by Faseela K [ 10/Apr/18 ] |
|
jhershbe : I remember vorburger initially indicating that this EventCallBackRegistrar has some performance issues, which we will fix in subsequent patches, not sure whether those patches actually came in. Please keep in mind that for Oxygen, Ericsson's requirements is to support 150 switches, and we do have plans to use auto-tunnels. |
| Comment by Michael Vorburger [ 10/Apr/18 ] |
|
k.faseela I never said that EventCallBackRegistrar has known some performance issues, just that if there are ever reasons that prove that its current implementation is a performance bottleneck, it could possibly be optimized; see also the in-line comment here. |
| Comment by Faseela K [ 11/Apr/18 ] |
|
Thanks vorburger ! Josh is proceeding with option1. |
| Comment by Josh Hershberg [ 12/Apr/18 ] |
|
Noted, that's why I want to test it I've been thinking more about this specific bug and fix and I think it is the perfect case for the registrar. The reason is that here we have a kind of corner case and we know that the "happy path" works 99.99% of the time. As such, it would be a shame to rewrite things for this corner case and we have a relatively clear and simply solution in the form of the regsitrar. Just my two cents. Anyway, I plan on checking the performance viability of it. |
| Comment by Josh Hershberg [ 12/Apr/18 ] |
|
Hah, just saw this and said the same thing a second ago. Sorry for the redundancy. |
| Comment by Michael Vorburger [ 19/Apr/18 ] |
|
|
| Comment by Faseela K [ 22/May/18 ] |
|
Josh, any update on this? All the required utils are already merged. Were you able to use them? |
| Comment by Michael Vorburger [ 24/Aug/18 ] |
|
Housekeeping: Looks like this issue didn't make it into Fluorine, thus Fix Version Neon. |
| Comment by nidhi adhvaryu [ 13/Jun/19 ] |
|
There is a timing issue between SB and NB event while creating bridge. When we are trying to add bridge to ovsdb from SB, NB is trying to add port for same bridge. during which NB will check DS/cache for bridge entry, which will be empty as SB is trying to add it in ovsdb at the same time. So we are adding eventcallback to synchronize both events.
The fix for this issue is merged here, https://git.opendaylight.org/gerrit/#/c/81980/ |
| Comment by nidhi adhvaryu [ 13/Jun/19 ] |
|
CSIT has passed on the fix. and it is locally tested. |