[OVSDB-472] not all tunnels getting created Created: 30/Oct/18  Updated: 28/Nov/18  Resolved: 28/Nov/18

Status: Resolved
Project: ovsdb
Component/s: Southbound.Open_vSwitch
Affects Version/s: None
Fix Version/s: Oxygen-SR4, Fluorine-SR2, Neon

Type: Bug Priority: Highest
Reporter: Jamo Luhrsen Assignee: Tim Rozet
Resolution: Done Votes: 0
Labels: csit:failures
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Priority: High

 Description   

sporadically in u/s CSIT jobs we are getting failures where instances are not able to
get their ip address. It looks like it might be that not all vxlan tunnels are getting
created.

in this failure it seems that compute #2 does not
have the tunnel to the control node. but, note that it does exist on the control node.
somewhere this creation is failing.

all logs under this directory

an identical passing job for comparison



 Comments   
Comment by Jamo Luhrsen [ 08/Nov/18 ]

is anyone looking at this?

here's another example in our Fluorine SR1 release candidate:

Comment by Faseela K [ 09/Nov/18 ]

jluhrsen : Will take a look at this today.

Comment by Faseela K [ 09/Nov/18 ]

jluhrsen, shague : Could you please check if the failing job link above is proper? It shows a passing job.

And I do see total 6 tunnels across control, compute-1 and compute-2 in the link.

And I assume that you are talking about br-int bridge on OVS. Please note that Genius ITM is managing only tunnels which are created with "tun" prefix. If you are talking about "br-datacentre" tunnels, it is not even connected to controller via openflow, I don't know who creates those tunnels

Comment by Faseela K [ 09/Nov/18 ]

jluhrsen : Looking at https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-itm-direct-tunnels-fluorine/119/robot-plugin/log_full.html.gz validate Deployment failed with 404 Error :

${output} = 2018-11-08 18:35:27,926 | ERR | common.rest_client | 0052 | 404 Client Error: Not Found for url: http://10.30.170.77:8181/restconf/config/ietf-interfaces:interfaces

This says URL not found, are we checking this once ODL and restconf is fully UP, or anything wrong in the script?

Comment by Sam Hague [ 09/Nov/18 ]

Yes, the services are up, at least whatever the diagstatus check is looking for - they are checked just before the tunnels are checked: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-itm-direct-tunnels-fluorine/119/robot-plugin/log_full.html.gz#s1-s1-k1-k1-k9-k2-k1-k1-k1-k1-k1-k1-k1-k1-k1-k1-k4

But this last link is for the itm-direct-tunnels case - which it does look like the odltools check isn't right. I will push a patch to ignore the check for itm-direct-tunnels or use a different method for it.

Comment by Sam Hague [ 09/Nov/18 ]

The original job linked in the description does not have the issue described, so I think we have the wrong link there. We will look for a better link.

Comment by Jamo Luhrsen [ 09/Nov/18 ]

What about the link to the failing job I added yesterday? Is that not this bug?

Comment by Faseela K [ 09/Nov/18 ]

jluhrsen : That was an itm-direct-tunnels job, where the validate tunnels script has to be modified, as the datastore URL being checked is not valid for ITM direct tunnels.

Comment by Sam Hague [ 09/Nov/18 ]

No, the itm-direct-tunnels csit is checking for tunnels using odltools - but odltools does not support that check today, so the validate check there is always failing regardless of the tunnel state. Further down you can see that the tunnels are there.

We need to find a non itm-direct-tunnels CSIT run that has the issue. The one linked in the description does not have the issue so I think we posted the wrong link there. I looked in the recent csits briefly and did not find a tunnel failure job so far.

Also need to tweak the validate for the itm-direct-tunnels csit to use a different validation or get Vishal's changes in to add the support to odltools. For now, I think I will just pass the check when itm-direct-tunnels are configured so we don't falsely fail the suite setup.

Comment by Jamo Luhrsen [ 09/Nov/18 ]

how is itm-direct-tunnels passing sometimes? shouldn't it always be blowing up like the link I posted?

Comment by Jamo Luhrsen [ 09/Nov/18 ]

how about this guy ?

Comment by Faseela K [ 09/Nov/18 ]

jluhrsen, shague : The last link given by Jamo has tunnel as "DOWN". But would have been nice if we had the TearDown Dumps for this step, as I cannot easily debug without knowing whether tunnel was present there on the switch or not.

Comment by Faseela K [ 09/Nov/18 ]

jluhrsen : I cannot even find karaf logs in the folder in this case. If validation step fails, those will not be captured?

Comment by Jamo Luhrsen [ 09/Nov/18 ]

I chatted with Faseela on IRC, but for the record, test teardown debugs and karaf logs

Comment by Sam Hague [ 09/Nov/18 ]

there was a bug in the validate code that was letting some pass - remember the FAIL or False thing? Previously the check was just "${status}" == "FAIL", but this odltools was returning False so it was validated. I changed that recently to check both FAIL and False.

Comment by Faseela K [ 10/Nov/18 ]

More details of the discussion at https://lists.opendaylight.org/pipermail/genius-dev/2018-November/003429.html

This seems to be not a GENIUS bug. jluhrsen/ shague : Move to appropriate project that handles the script probably?

Comment by Jamo Luhrsen [ 10/Nov/18 ]

I don't believe it yet

I've replied on that email thread, but we can discuss things here too. What should we do next
to get to the bottom of this.

BTW, there is not an easy way yet to enable TRACE debugs with a jenkins parameter
since this is an apex job. We need to account for that somehow. I guess I can add the
debugging with a csit patch, if we need it.

Comment by Faseela K [ 13/Nov/18 ]

jluhrsen : From the current INFO logs itself, it is clear that the tunnel deletion is not triggered from GENIUS.

thapar : Do you have any clue on the auto-bridge creation logic, that can go wrong here? If so, should we move the Jira to Netvirt?

Just an FYI that, currently I am not doing anything for this issue.

Comment by Faseela K [ 13/Nov/18 ]

jluhrsen TRACES for org.opendaylight.genius.interfacemanager & org.opendaylight.genius.itm (along with whatever thapar is going to ask for)

Comment by Vishal Thapar [ 13/Nov/18 ]

TRACE logs for org.opendaylight.netvirt.elan.internal.ElanBridgeManager

Comment by Vishal Thapar [ 13/Nov/18 ]

ITM may not be needed if interfaces present in IFM config. Too much logging will cause problems. Would prefer if go with DEBUG for interfacemanager than TRACE.

Comment by Faseela K [ 13/Nov/18 ]

Asked for ITM, as I was wondering whether auto tunnels code along with itm direct tunnels somehow kicks in and delete tunnels on the switch[ofcourse this is not itm direct tunnels job that failed, but what if the check has gone wrong somewhere]

interfacemanager has enough info logs, and none of them are showing up in this delete case, so may be interfacemanager logs are not needed.

Comment by Vishal Thapar [ 13/Nov/18 ]

Could be if we're switching between the two but I think we're not.

Generated at Wed Feb 07 20:36:29 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.