[GENIUS-115] OVS Nodes Connect Quickly, Not All Get TransportZone TEPs Created: 22/Feb/18 Updated: 09/Mar/18 Resolved: 09/Mar/18 |
|
| Status: | Resolved |
| Project: | genius |
| Component/s: | General |
| Affects Version/s: | Oxygen |
| Fix Version/s: | Oxygen |
| Type: | Bug | Priority: | Highest |
| Reporter: | Josh Hershberg | Assignee: | Josh Hershberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | patch_merged | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
When OVS nodes connect quickly, one after another, some nodes will not get all the TEPs. The cause of this is as follows: In org.opendaylight.genius.itm.confighelpers.ItmTepAddWorker we have the following code: this.meshedDpnList = dpnTEPsInfoCache.getAllPresent(); Note that "meshedDpnList" is retrieved from a cache. This cache is a subclass of mdsalutil's DataObjectCache which waits for notifications of type DpnTEPsInfo and caches them in memory. What I observed is that if the switches connect in quick succession, sometimes that cache is not updated when the code above is executed, resulting in that new switch never getting the TEP to connect to the missing DPN. Consider the following two log lines, separated by less than 0.1 of a second: 2018-02-22T09:35:57,918 | DEBUG | ForkJoinPool-1-worker-3 | ItmInternalTunnelAddWorker | 176 - org.opendaylight.genius.itm-impl - 0.4.0.SNAPSHOT | Updating CONFIGURATION datastore with DPN DPNTEPsInfo [_dPNID=37528269029441, _key=DPNTEPsInfoKey [_dPNID=37528269029441], _tunnelEndPoints=[TunnelEndPoints [_gwIpAddress=IpAddress [_ipv4Address=Ipv4Address [_value=0.0.0.0]], _interfaceName=37528269029441::0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=172.17.0.2]], _key=TunnelEndPointsKey [_portname=, _vLANID=0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=172.17.0.2]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan], _optionTunnelTos=0, _portname=, _subnetMask=IpPrefix [_ipv4Prefix=Ipv4Prefix [_value=255.255.255.255/32]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan, _tzMembership=[TzMembership [_key=TzMembershipKey [_zoneName=default-transport-zone], _zoneName=default-transport-zone, augmentation=[]]], _vLANID=0, _optionOfTunnel=false, augmentation=[]]], augmentation=[]] And 2018-02-22T09:35:57,987 | DEBUG | ForkJoinPool-1-worker-3 | ItmTepAddWorker | 176 - org.opendaylight.genius.itm-impl - 0.4.0.SNAPSHOT | Invoking Internal Tunnel build method with Configured DpnList [DPNTEPsInfo [_dPNID=116901406912847, _key=DPNTEPsInfoKey [_dPNID=116901406912847], _tunnelEndPoints=[TunnelEndPoints [_gwIpAddress=IpAddress [_ipv4Address=Ipv4Address [_value=0.0.0.0]], _interfaceName=116901406912847::0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=172.17.0.3]], _key=TunnelEndPointsKey [_portname=, _vLANID=0, _ipAddress=IpAddress [_ipv4Address=Ipv4Address [_value=172.17.0.3]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan], _optionTunnelTos=0, _portname=, _subnetMask=IpPrefix [_ipv4Prefix=Ipv4Prefix [_value=255.255.255.255/32]], _tunnelType=class org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan, _tzMembership=[TzMembership [_key=TzMembershipKey [_zoneName=default-transport-zone], _zoneName=default-transport-zone, augmentation=[]]], _vLANID=0, _optionOfTunnel=false, augmentation=[]]], augmentation=[]]] ; Meshed DpnList [] Note that the DpnTEPsInfo for 172.17.0.2 is created in the first line but is absent in the meshed list on the second. |
| Comments |
| Comment by Vivek Srivastava [ 28/Feb/18 ] |
|
Yes, race condition is with any 2 nodes getting added at same time. So probability of hitting it in scale setup is more as larger the number of nodes, greater the likelihood of any 2 nodes coming in at same time. Regards, Vishal. From: Josh Hershberg [jhershbe@redhat.com
Slight correction. The functional failure is a race condition that can occur with just two nodes. Not always, but I've seen it. 20 is just how many docker/ovs instances I can do on my laptop at once.
On Feb 27, 2018 12:44 PM, "Vishal Thapar" <vishal.thapar@ericsson.com Hi Hema,
Using Auto tunnels with current code it is running into functional failures with 20DPNs. Scale testing that we did, do you recall how are teps added? With any delay with each tep add, in batches with delay between batches or all in one shot?
Regards, Vishal.
From: Josh Hershberg mailto:[jhershbe@redhat.com
So I've been running this on my laptop with Docker containers. I can run about 20 of them.
On Feb 27, 2018 10:58, "Hema Gopalkrishnan" <hema.gopalkrishnan@ericsson.com Hi Vishal, Unless we test it in a scale setup we cannot conclude how Josh’s approach will perform in such a scenario. Josh, will you be able to test it in a scaled setup and how many switches can you scale up to ? Thanks, Hema
From: Vishal Thapar Sent: 27 February 2018 13:26 To: Josh Hershberg <jhershbe@redhat.com
Similar to what you already done, but instead of just creation find out how long it takes for all tunnels to come up.
Hema, Can you share the log entry that we can use to track when each tunnel comes up? This way from logs we can find out when all tunnels came up and don’t have to poll for it.
Regards, Vishal.
From: Josh Hershberg [jhershbe@redhat.com
What specifically do you want to test? Measure what, exactly?
On Feb 27, 2018 09:47, "Vishal Thapar" <vishal.thapar@ericsson.com Hi Hema,
It adds a bit more delay but is still susceptible to the same issue. This batching and caching was at the time tested and used with manual tep additions where we have an inherent user induced delay. Did we factor user induced delay in scale+perf numbers? I won’t be surprised if no caching + DS read gives better performance for auto tunnels when you factor in user delay.
Josh, Since you got the perfect setup, why don’t we test with your patch once and then compare with Hema’s patch? Test numbers will be best way to see which solution works better.
Regards, Vishal.
From: Josh Hershberg mailto:[jhershbe@redhat.com
Sure. Let me know when it's rebased and then I can test it.
On Tue, Feb 27, 2018 at 8:50 AM, Hema Gopalkrishnan <hema.gopalkrishnan@ericsson.com Hi Josh, I have a fix for addition of teps in quick succession. Here is my patch which I had raised in Nitrogen. It’s been sitting around for some time now and hasn’t been merged.
The fix here is that
This will give it a little more delay for the subsequent teps to “see” the teps added earlier.
I will rebase this patch and Can you please test it with this ?
Caching and Batching are very essential for ITM to scale. I would prefer not change them.
[0] - https://git.opendaylight.org/gerrit/#/c/62950/
Regards Hema
From: Vishal Thapar Sent: 27 February 2018 11:56 To: Josh Hershberg <jhershbe@redhat.com
Yeah, as of now this is only solution possible. Batching+caching are to improve performance but were designed for manual addition where there is delay between adding teps. Alternative is to require user to have a manual delay between adding.
My take is, it is okay to do DS read and this is not the path that needs to be optimized, correction is more crucial. Any user added delay between additions will always be slower than a DS read. So, might as well add the reads and no batching and make things move as fast as they can.
Regards, Vishal.
From: Josh Hershberg [jhershbe@redhat.com
Here is what I wrote in another thread about a working fix I have:
So, in order to get this to work I found that two fixes are necessary:
1) In ItemTepAddWorker do NOT use cached DPNTEPsInfos, read from DS instead 2) In ItmInternalTunnelAddWorker do not queue the new DPNTEPsInfos via ITMBatchingUtils, instead write synchronously and unbatched in that thread
Since this is a rather intrusive fix so I would really appreciate some feedback. I've also just pushed the fix here [0].
I'm sick at home today but will check emails and respond.
-Josh
[0] https://git.opendaylight.org/gerrit/68797
On Tue, Feb 27, 2018 at 7:35 AM, Vivek Srivastava V <vivek.v.srivastava@ericsson.com Hi,
Jira issue [1] is an Oxygen blocker on GENIUS. It is assigned to Josh, I hope he is working on it. I would request Hema and Vishal to team up with Josh to get this resolved ASAP. Please keep jira updated regularly.
Regards, Vivek
[1]: https://jira.opendaylight.org/browse/GENIUS-115?filter=10206
|
| Comment by Vivek Srivastava [ 28/Feb/18 ] |
|
A few thoughts on Hema's patch:
As Vishal pointed out above, this patch simply delays the read from the cache and as such does not completely alleviate the race condition. I would like to add to this something about the time scales we seem to be dealing with. As far as I understand the patch in question, we're talking about we're talking about a delay in the 1/10ths to 1/100ths of a second. However, the scale at which the race condition occurs is much bigger, the docker containers spin up roughly every 1-2 seconds...and the race condition clearly occurs at that rate. As such, I find it highly questionable as to whether this patch will be effective and even if it is for my test environment it still leaves a race condition in the code.
Hema, I'm happy to give it a test on the setup on my computer. I looked at rebasing it and it's not a trivial rebase so I think it would be safer and quicker if you did it since I assume you are much more familiar with the code.
If I have to sum up my understanding of the issue at this point it is the following. We must choose between three options (or a mix of them):
1) Hema's patch which improves but does not solve the situation. 2) My patch which solves the race condition but may introduce a performance hit. 3) Work on a better solution for the next release and in the mean time clearly document the limitations of the system.
-Josh |
| Comment by Vivek Srivastava [ 28/Feb/18 ] |
|
I've been tinkering and I have a lot better data at this point.
Tinker #1: I modified my test so that instead of connecting 20 OVS instances one every 1-3 seconds I now connect 20 OVS instances in the span of only ~2 seconds.
Tinker #2: I cooked up my own quick and dirty version of Hema's patch [0]. Note that although Hema only mentioned the idea of deferring the cache retrieval, the actual patch also contains the non-batched writing of DPNTEPsInfo objects. I believe the modifications in [0] are functionaly equivalent to those in the initial patch.
What I've found is that with [0] the problem still occurs. As such, I think we really have no choice but to read directly from the DS.
I also did some VERY initial testing of performance measurements and what I've found is that for my patch, the one with the DS read the time between the first and last invocation of ItmTepAddWorker.call is roughly 5 seconds.
-Josh
[0] diff --git a/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/confighelpers/ItmExternalTunnelAddWorker.java b/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/co nfighelpers/ItmExternalTunnelAddWorker.java index 7e8d756..8b4ec66 100644 — a/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/confighelpers/ItmExternalTunnelAddWorker.java +++ b/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/confighelpers/ItmExternalTunnelAddWorker.java @@ -62,6 +62,7 @@ public class ItmExternalTunnelAddWorker {
public List<ListenableFuture<Void>> buildTunnelsToExternalEndPoint(Collection<DPNTEPsInfo> cfgDpnList, IpAddress extIp, Class<? extends TunnelTypeBase> tunType) { + cfgDpnList = dpnTEPsInfoCache.getAllPresent(); if (null != cfgDpnList) { WriteTransaction transaction = dataBroker.newWriteOnlyTransaction(); for (DPNTEPsInfo teps : cfgDpnList) { diff --git a/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/confighelpers/ItmInternalTunnelAddWorker.java b/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/co nfighelpers/ItmInternalTunnelAddWorker.java index e30b39e..ec4f73e 100644 — a/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/confighelpers/ItmInternalTunnelAddWorker.java +++ b/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/confighelpers/ItmInternalTunnelAddWorker.java @@ -87,18 +87,18 @@ public final class ItmInternalTunnelAddWorker { meshedDpnList.add(dpn); } // Update the config datastore – FIXME – Error Handling - updateDpnTepInfoToConfig(dpn); + updateDpnTepInfoToConfig(dpn, transaction); } })); }
- private static void updateDpnTepInfoToConfig(DPNTEPsInfo dpn) { + private static void updateDpnTepInfoToConfig(DPNTEPsInfo dpn, WriteTransaction tx) { LOG.debug("Updating CONFIGURATION datastore with DPN {} ", dpn); InstanceIdentifier<DpnEndpoints> dep = InstanceIdentifier.builder(DpnEndpoints.class).build() ; List<DPNTEPsInfo> dpnList = new ArrayList<>() ; dpnList.add(dpn) ; DpnEndpoints tnlBuilder = new DpnEndpointsBuilder().setDPNTEPsInfo(dpnList).build() ; - ITMBatchingUtils.update(dep, tnlBuilder, ITMBatchingUtils.EntityType.DEFAULT_CONFIG); + tx.merge(LogicalDatastoreType.CONFIGURATION, dep, tnlBuilder); } private void buildTunnelFrom(DPNTEPsInfo srcDpn, Collection<DPNTEPsInfo> meshedDpnList, diff --git a/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/confighelpers/ItmTepAddWorker.java b/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/confighelpers/ItmTepAddWorker.java index c5df528..aaff762 100644 — a/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/confighelpers/ItmTepAddWorker.java +++ b/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/confighelpers/ItmTepAddWorker.java @@ -54,6 +54,7 @@ public class ItmTepAddWorker implements Callable<List<ListenableFuture<Void>>> { @Override public List<ListenableFuture<Void>> call() { List<ListenableFuture<Void>> futures = new ArrayList<>() ; + this.meshedDpnList = ItmUtils.getDpnTEPsInfos(dataBroker); this.meshedDpnList = dpnTEPsInfoCache.getAllPresent(); LOG.debug("Invoking Internal Tunnel build method with Configured DpnList {} ; Meshed DpnList {} ", cfgdDpnList, meshedDpnList); diff --git a/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/impl/ItmUtils.java b/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/impl/ItmUtils.java index 3d5942d..f45728a 100644 — a/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/impl/ItmUtils.java +++ b/itm/itm-impl/src/main/java/org/opendaylight/genius/itm/impl/ItmUtils.java @@ -1390,4 +1390,14 @@ public final class ItmUtils { } return exTunnel; } + + public static List<DPNTEPsInfo> getDpnTEPsInfos(DataBroker dataBroker) { + InstanceIdentifier<DpnEndpoints> iid = InstanceIdentifier.builder(DpnEndpoints.class).build(); + Optional<DpnEndpoints> dpnEndpoints = ItmUtils.read(LogicalDatastoreType.CONFIGURATION, iid, dataBroker); + if (dpnEndpoints.isPresent()) { + return dpnEndpoints.get().getDPNTEPsInfo(); + } else { + return new ArrayList<>(); + } + } }
|
| Comment by Josh Hershberg [ 28/Feb/18 ] |
|
Some non-scientific times. These are the times in between the first and last tiime ItmTepAddWorker.call is invoked: 1) Non-batching DPNTEPsInfo writes + deferred cache read (my rough copy of Hema's patch):
2) Non-batching DPNTEPsInfo writes + DS read (my patch):
Note, these times are non-scientific and were measure on my laptop with 20 OVS instances connecting over a span of ~2 seconds to a local ODL. Also note, that for (1) the configuration was not successful and not all nodes received all the TEPs they require. |
| Comment by Josh Hershberg [ 05/Mar/18 ] |
| Comment by Daniel Farrell [ 08/Mar/18 ] |
|
Waiting on verify job. |
| Comment by Sam Hague [ 09/Mar/18 ] |