[NETVIRT-589] stable/boron not usable in OPNFV-test framwork - DHCP timesout Created: 03/Apr/17 Updated: 20/Apr/17 Resolved: 20/Apr/17 |
|
| Status: | Resolved |
| Project: | netvirt |
| Component/s: | General |
| Affects Version/s: | Boron |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Nikolas Hermanns | Assignee: | Vyshakh Krishnan |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 8142 |
| Priority: | High |
| Description |
|
Hey, We see big issues in using stable/boron (nearly SR-3) in the OPNFV test pipeline. This is the test what is done: 2017-03-31 15:51:41,889 - keystoneauth.identity.v2 - DEBUG - Making authentication request to http://192.168.37.10:5000/v2.0/tokens 2017-03-31 15:51:53,095 - keystoneauth.identity.v2 - DEBUG - Making authentication request to http://192.168.37.10:5000/v2.0/tokens The VM we are using does 3 dhcp request. That is around 120 seconds. All internal transport tunnels are created before already. When I login into the vm alter then 120 seconds and I do "ifup eth0" then I get directly a ip. Attached you find the flows from ovs when it is not working and when we have waited 2 minutes. The flow from the controller and the flows from the compute are both shown. What can be seen is that not even table0 contains the inport flow. One more interesting thing is either the dhcp request goes through directly (less than 10 seconds) or it needs this long time. |
| Comments |
| Comment by Nikolas Hermanns [ 03/Apr/17 ] |
|
Attachment odl-dhcp-issue.zip has been added with description: tables |
| Comment by Kency Kurian [ 04/Apr/17 ] |
|
Hi Nikolas, Could you please let us know how many VMs are being spawned. Is it failing for all the VMs initially and works fine when ifup eth0 is done after 2 minutes. >>One more interesting thing is either the dhcp request goes through directly (less than 10 seconds) or it needs this long time. << Do you mean to say that DHCP request is actually being send within 10 secs and not within 120 secs as we expect? |
| Comment by Nikolas Hermanns [ 04/Apr/17 ] |
|
Hey Kency, thanks for looking into this. So I spawn 2 VMs both have the same issue is a probability from around 70%. Any additional VM has the same issue. If in the rare cases that the flows are pushed directly the DHCP request gets answered directly then I see that the VM is botted up in about 10 seconds. So 2 in short cases: You see in the attachment that I added 2 times all the flow tables. That is both times from 1. The not working one is directly after the VM is spawned. The working on is after ~2 mins. Thanks! Nikolas |
| Comment by Nikolas Hermanns [ 05/Apr/17 ] |
|
I have a new finding! in ovs logs I can see a lot of this messages: This is related to this bug: After a vm is delete groups and flows are not cleanup. On a fresh deployed system with a clean odl and a clean ovs everything seems to be working. When creating and removeing vms and networks after a while we get a group entry in ovs which is bad: , } So the group is not correctly synced! So I think we have still 2 issues here: |
| Comment by Nikolas Hermanns [ 05/Apr/17 ] |
|
Just to state that one more time: OFPST_GROUP_DESC reply (OF1.3) (xid=0x2): check here the group_id=210001 it does not say anything behind. In odl it says: |
| Comment by Periyasamy Palanisamy [ 05/Apr/17 ] |
|
In tables_not_working log, I don't see any flows/groups for neutron VMs present in both compute nodes and seeing flows/groups only for transparent port created on the DPN for the flat provider network. Looks like there is no VMs spawned. At this stage, group_id=210001 is empty because of no VMs. What do you see in odl for group id 210001 ? Does it exist in inventory config ds or not ? when you add new VM back into this network, Are you seeing only ELAN Local BC group (210001) not updated with bucket? what do you see in wireshark trace ? is it group_mod or group-add ? what about other flows for the VM ? |
| Comment by Nikolas Hermanns [ 05/Apr/17 ] |
|
This is what I see in odl as already written in the comment above: , } Have in mind that I reproduced the error few times and what you do see here is not the original issue when the bug was found. |
| Comment by Nikolas Hermanns [ 05/Apr/17 ] |
|
So very short before the vm is stated I see the following in ovs: 3 computes: in odl: , , } see full inventory in logs. |
| Comment by Nikolas Hermanns [ 05/Apr/17 ] |
|
Attachment full-inventory.zip has been added with description: full-inventory before vm is booted |
| Comment by Periyasamy Palanisamy [ 06/Apr/17 ] |
|
group_id=210002,type=all,bucket=actions=group:210001,bucket=actions=load:0x300->NXM_NX_REG6[],resubmit(,220) It looks like the above groups are created for ELAN instance for flat provider network type and doesn't have any VMs in it. Can we look at tcpdump to know what kind of group request is sent ? |
| Comment by Nikolas Hermanns [ 06/Apr/17 ] |
|
Reproduced again this time fetching logs and thread dump while and after hanging odl: https://drive.google.com/open?id=0B_Rr7XjF0yoHc2EwT1psVGQwRVE |
| Comment by Periyasamy Palanisamy [ 07/Apr/17 ] |
|
I see there are 6 threads which blocked in BgpConfigurationManager while advertising route to bgp. Looks like there is an issue with establishing neighbour with bgp peer. This causes a thread which invokes BgpConfigurationManager#replay holds bgpconfigmgr's instrinsic lock for longer time, eventually other threads trying to advertise routes, etc. are blocked on this lock. We need to address the following. 1. Need to resolve establishing BGP neighbor issue. in karaf.log, i see lot of errors like: Is there any issue with setting up 6wind quagga with ODL ? can you look into it ? 2. Holding lock bgpconfigmgr's intrinsic lock for all method invocations is incorrect. This will make system to be unusable at this situation. It has to be addressed. Suneelu/Siva, Can you have a look ? |
| Comment by Nikolas Hermanns [ 07/Apr/17 ] |
|
Hey, godd finding! yes quagga bgp is not working yet on this machine. This was a bug raised internally for OPNFV. I will put more effort on this bug then now. You are right we need to remove this sync in addVRF. Can we still pull that in SR-3? Br Nikolas |
| Comment by Nikolas Hermanns [ 07/Apr/17 ] |
|
just for more information!
|
| Comment by Nikolas Hermanns [ 07/Apr/17 ] |
|
I checked. if effects master branch as well in the same amount. |
| Comment by Periyasamy Palanisamy [ 12/Apr/17 ] |
|
Vyshakh, Can you get https://git.opendaylight.org/gerrit/#/c/54578/ merged ? Also it has to be cherry picked into stable/boron. |