[GENIUS-254] default tunnels are down and/or missing from config Created: 10/Dec/18 Updated: 13/Jun/19 Resolved: 13/Jun/19 |
|
| Status: | Verified |
| Project: | genius |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | Sodium |
| Type: | Bug | Priority: | Medium |
| Reporter: | Jamo Luhrsen | Assignee: | nidhi adhvaryu |
| Resolution: | Done | Votes: | 0 |
| Labels: | apex:gate, csit:failures | ||
| Remaining Estimate: | 0 minutes | ||
| Time Spent: | 1 day | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
In approximately 1 in 30 jobs, the suite setups is failing because the odltools check example output: 2018-12-09 08:33:29,241 | ERR | common.rest_client | 0052 | 404 Client Error: Not Found for url: http://192.0.2.19:8081/restconf/config/itm-state:dpn-teps-state Analysing transport-zone:default-transport-zone ..TerminationPoint tun9c586aac495 between 192.0.2.19 and 192.0.2.5 missing from config ..Interface tuna642e00a5c1 is down between 192.0.2.5 and 192.0.2.19 Examples: [Note that jobs 1 and 2 will be purged in a week, but infra is stopping me from copying logs to the logs server at the moment] |
| Comments |
| Comment by Jamo Luhrsen [ 10/Dec/18 ] |
|
email thread discussing this problem |
| Comment by Jamo Luhrsen [ 11/Dec/18 ] |
| Comment by Vishal Thapar [ 17/Dec/18 ] |
|
Since this is NETVIRT Issue, will be fixed as https://jira.opendaylight.org/browse/NETVIRT-1541 |
| Comment by Vishal Thapar [ 13/Jan/19 ] |
|
Saw a different variation of the issue with bug in genius |
| Comment by Vishal Thapar [ 13/Jan/19 ] |
|
While testing tunea6b6a44e1b is not created. This is what sequence of log entries looks like: 2019-01-11T07:46:30,224 | INFO | nioEventLoopGroup-9-9 | opendaylight.openflowplugin.impl.device.initialization.OF13DeviceInitializer | Static node openflow:9444498845632 info: OFPMPPORTDESC collected No further references to this tunnel after this. This is what they look like for a tunnel that gets created: 2019-01-11T07:46:30,208 | INFO | ForkJoinPool-1-worker-2 | genius.interfacemanager.renderer.ovs.confighelpers.OvsInterfaceConfigAddHelper | adding tunnel configuration for interface tun4fab841ac4e The highlighted log entry in second is missing from first. Looking at the code, following check is failing LOG.debug("creating bridge interfaceEntry in ConfigDS {}", dpId); // create bridge on switch, if switch is connected This means bridgeRefEntry is missing, even though it was created couple log entries before the one above: 2019-01-11T07:46:30,216 | DEBUG | ForkJoinPool-1-worker-5 | org.opendaylight.genius.interfacemanager.commons.InterfaceMetaUtils | Creating bridge ref entry for dpn: 154706290356175 bridge: For reference, this is log entry for bridgeRefEntry for the working case: 2019-01-11T07:46:29,351 | DEBUG | ForkJoinPool-1-worker-4 | org.opendaylight.genius.interfacemanager.commons.InterfaceMetaUtils | Creating bridge ref entry for dpn: 9444498845632 bridge: In working there is a gap of about 1s between two. In failure case it is less than 10ms. |
| Comment by Vishal Thapar [ 14/Jan/19 ] |
|
k.faseela Issue is tunnel missing from OVSDB config. From log entries I shared you can see ITM never pushes the tunnel to OVSDB config and have pointed out what I think is the race condition. Any suggestions on fix? |
| Comment by Faseela K [ 16/Jan/19 ] |
|
thapar : Got the issue, let me think about an easy way to fix this. |
| Comment by Jamo Luhrsen [ 16/Jan/19 ] |
|
great |
| Comment by Vishal Thapar [ 16/Jan/19 ] |
|
jluhrsen Race condition. Genius stores BridgeRefEntry that maps a bridge to its corresponding DPNID. It gets populated when bridge is created or switch first connects, whichever is latter. Tunnels creation trigger is also same but different part of code. Due to extra steps in tunnel creation, we mostly have this entry in place before we get to point where we try creating tunnel port. But in this case by the time we read it is not present yet. It might have something to do with changes done as part of tx runner that we're hitting this race more frequently. |
| Comment by Faseela K [ 16/Jan/19 ] |
|
thapar : Here is an overall idea of the fix I want to try out. FYI, not tested yet. https://git.opendaylight.org/gerrit/79553
|
| Comment by Faseela K [ 17/Jan/19 ] |
|
thapar/jluhrsen : Is it possible to run the patch against netvirt CSIT couple of times to see if the issue is not reproducible now? Have done local testing to make sure the patch is not breaking anything, but couldn't really hit the race condition timings to see if that problem is solved. |
| Comment by Jamo Luhrsen [ 17/Jan/19 ] |
|
this gerrit keyword should trigger the 3node job: "test-genius-cluster-netvirt" |
| Comment by nidhi adhvaryu [ 31/May/19 ] |
|
For ITM also we have added a similar fix. |
| Comment by nidhi adhvaryu [ 13/Jun/19 ] |
|
The fix for this issue has been merged here, https://git.opendaylight.org/gerrit/#/c/81980/ for ITM. |