[GENIUS-254] default tunnels are down and/or missing from config Created: 10/Dec/18  Updated: 13/Jun/19  Resolved: 13/Jun/19

Status: Verified
Project: genius
Component/s: General
Affects Version/s: None
Fix Version/s: Sodium

Type: Bug Priority: Medium
Reporter: Jamo Luhrsen Assignee: nidhi adhvaryu
Resolution: Done Votes: 0
Labels: apex:gate, csit:failures
Remaining Estimate: 0 minutes
Time Spent: 1 day
Original Estimate: Not Specified

Attachments: File karaf-1.tar.xz    

 Description   

In approximately 1 in 30 jobs, the suite setups is failing because the odltools check
to verify default tunnels are up is not passing.

example output:

2018-12-09 08:33:29,241 | ERR | common.rest_client   | 0052 | 404 Client Error: Not Found for url: http://192.0.2.19:8081/restconf/config/itm-state:dpn-teps-state
Analysing transport-zone:default-transport-zone
..TerminationPoint tun9c586aac495 between 192.0.2.19 and 192.0.2.5 missing from config
..Interface tuna642e00a5c1 is down between 192.0.2.5 and 192.0.2.19

Examples:
one

two

three

[Note that jobs 1 and 2 will be purged in a week, but infra is stopping me from copying logs to the logs server at the moment]



 Comments   
Comment by Jamo Luhrsen [ 10/Dec/18 ]

email thread discussing this problem

Comment by Jamo Luhrsen [ 11/Dec/18 ]

another example

Comment by Vishal Thapar [ 17/Dec/18 ]

Since this is NETVIRT Issue, will be fixed as https://jira.opendaylight.org/browse/NETVIRT-1541

Comment by Vishal Thapar [ 13/Jan/19 ]

Saw a different variation of the issue with bug in genius

Comment by Vishal Thapar [ 13/Jan/19 ]

While testing NETVIRT-1541 fixes, ran into another variation of this bug in d/s testing, this time in Genius.

tunea6b6a44e1b is not created. This is what sequence of log entries looks like:

2019-01-11T07:46:30,224 | INFO | nioEventLoopGroup-9-9 | opendaylight.openflowplugin.impl.device.initialization.OF13DeviceInitializer | Static node openflow:9444498845632 info: OFPMPPORTDESC collected
2019-01-11T07:46:30,226 | DEBUG | ForkJoinPool-1-worker-5 | org.opendaylight.genius.interfacemanager.listeners.InterfaceTopologyStateListener | Bridge entry not found in config DS for dpn: 154706290356175
2019-01-11T07:46:30,226 | INFO | ForkJoinPool-1-worker-5 | genius.interfacemanager.renderer.ovs.confighelpers.OvsInterfaceConfigAddHelper | adding tunnel configuration for interface tunea6b6a44e1b
2019-01-11T07:46:30,227 | DEBUG | ForkJoinPool-1-worker-5 | genius.interfacemanager.renderer.ovs.confighelpers.OvsInterfaceConfigAddHelper | creating bridge interfaceEntry in ConfigDS 154706290356175

No further references to this tunnel after this.

This is what they look like for a tunnel that gets created:

2019-01-11T07:46:30,208 | INFO | ForkJoinPool-1-worker-2 | genius.interfacemanager.renderer.ovs.confighelpers.OvsInterfaceConfigAddHelper | adding tunnel configuration for interface tun4fab841ac4e
2019-01-11T07:46:30,208 | DEBUG | ForkJoinPool-1-worker-2 | genius.interfacemanager.renderer.ovs.confighelpers.OvsInterfaceConfigAddHelper | creating bridge interfaceEntry in ConfigDS 9444498845632
2019-01-11T07:46:30,209 | DEBUG | ForkJoinPool-1-worker-2 | genius.interfacemanager.renderer.ovs.confighelpers.OvsInterfaceConfigAddHelper | creating bridge interface on dpn 9444498845632
2019-01-11T07:46:30,209 | DEBUG | ForkJoinPool-1-worker-2 | genius.interfacemanager.renderer.ovs.utilities.SouthboundUtils | adding tunnel port tun4fab841ac4e to bridge InstanceIdentifier

The highlighted log entry in second is missing from first.

Looking at the code, following check is failing

LOG.debug("creating bridge interfaceEntry in ConfigDS {}", dpId);
interfaceMetaUtils.createBridgeInterfaceEntryInConfigDS(dpId, interfaceNew.getName());

// create bridge on switch, if switch is connected
BridgeRefEntry bridgeRefEntry = interfaceMetaUtils.getBridgeRefEntryFromOperDS(dpId);
if (bridgeRefEntry != null && bridgeRefEntry.getBridgeReference() != null) {
LOG.debug("creating bridge interface on dpn {}", dpId);

This means bridgeRefEntry is missing, even though it was created couple log entries before the one above:

2019-01-11T07:46:30,216 | DEBUG | ForkJoinPool-1-worker-5 | org.opendaylight.genius.interfacemanager.commons.InterfaceMetaUtils | Creating bridge ref entry for dpn: 154706290356175 bridge:

For reference, this is log entry for bridgeRefEntry for the working case:

2019-01-11T07:46:29,351 | DEBUG | ForkJoinPool-1-worker-4 | org.opendaylight.genius.interfacemanager.commons.InterfaceMetaUtils | Creating bridge ref entry for dpn: 9444498845632 bridge:

In working there is a gap of about 1s between two. In failure case it is less than 10ms.

Comment by Vishal Thapar [ 14/Jan/19 ]

k.faseela Issue is tunnel missing from OVSDB config. From log entries I shared you can see ITM never pushes the tunnel to OVSDB config and have pointed out what I think is the race condition. Any suggestions on fix?

Comment by Faseela K [ 16/Jan/19 ]

thapar : Got the issue, let me think about an easy way to fix this.

Comment by Jamo Luhrsen [ 16/Jan/19 ]

great what's the TL;DR on the root cause?

Comment by Vishal Thapar [ 16/Jan/19 ]

jluhrsen Race condition. Genius stores BridgeRefEntry that maps a bridge to its corresponding DPNID. It gets populated when bridge is created or switch first connects, whichever is latter. Tunnels creation trigger is also same but different part of code. Due to extra steps in tunnel creation, we mostly have this entry in place before we get to point where we try creating tunnel port. But in this case by the time we read it is not present yet. It might have something to do with changes done as part of tx runner that we're hitting this race more frequently.

Comment by Faseela K [ 16/Jan/19 ]

thapar : Here is an overall idea of the fix I want to try out.

                         FYI, not tested yet.

                          https://git.opendaylight.org/gerrit/79553

 

Comment by Faseela K [ 17/Jan/19 ]

thapar/jluhrsen : Is it possible to run the patch against netvirt CSIT couple of times to see if the issue is not reproducible now? Have done local testing to make sure the patch is not breaking anything, but couldn't really hit the race condition timings to see if that problem is solved.

Comment by Jamo Luhrsen [ 17/Jan/19 ]

this gerrit keyword should trigger the 3node job: "test-genius-cluster-netvirt"
I triggered it once just now.

Comment by nidhi adhvaryu [ 31/May/19 ]

For ITM also we have added a similar fix.

https://git.opendaylight.org/gerrit/#/c/81980/

Comment by nidhi adhvaryu [ 13/Jun/19 ]

The fix for this issue has been merged here, https://git.opendaylight.org/gerrit/#/c/81980/ for ITM.

Generated at Wed Feb 07 20:00:18 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.