[NETVIRT-1022] CSIT Sporadic failures - major functionality breakage Created: 21/Nov/17 Updated: 10/Feb/18 Resolved: 10/Feb/18 |
|
| Status: | Verified |
| Project: | netvirt |
| Component/s: | General |
| Affects Version/s: | Carbon-SR3 |
| Fix Version/s: | Carbon-SR3 |
| Type: | Bug | Priority: | Highest |
| Reporter: | Jamo Luhrsen | Assignee: | Arunprakash D |
| Resolution: | Done | Votes: | 0 |
| Labels: | csit:exception, csit:failures | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
From time to time we find some jobs that really go south and have lots of failures. When 2017-11-21 00:15:51,069 | INFO | pool-43-thread-1 | NaptPacketInHandler | 336 - org.opendaylight.netvirt.natservice-impl - 0.6.0.SNAPSHOT | onPacketReceived : Retry Packet IN Queue Size : 0 in this job [0] that message was printed over 900 times something fundamental is broken in the environment and maybe the above log here is the karaf.log [1] for that job [0]. The log also has a lot of NullPointerExceptions, more debugging and analysis is needed to figure out what is really broken. [0] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/397/log_connectivity.html.gz |
| Comments |
| Comment by Hanamantagoud Kandagal [ 28/Nov/17 ] |
|
Hi Chetan / Karthikeyan , Please put the appropriate log level for the one which is appearing 900 times. It seems to be flooding the karaf logs. Please analyze other exceptions as well.
|
| Comment by Jamo Luhrsen [ 28/Nov/17 ] |
|
btw, I already have this patch which changes debug level from INFO to DEBUG |
| Comment by Karthikeyan Krishnan [ 05/Dec/17 ] |
|
Reason for failed TCs of FloatingIP Ping and SNAT TCP traffic: ================================================ Due to VM instance-2 is not getting IP address FloatingIP (DNAT) and SNAT (TCP) traffics are getting failed. Regarding SNAT packet is getting looped because of internal VM IP is not aware of the ODL-Controller. Any way packet will be getting looped by maximum for 4 Secs (Within 4 Secs if the SNAT session flow is not getting installed on the OVS switch, packet will be dropped by ODL-Controller). This logging(which is changed from DEBUG to INFO with patch - https://git.opendaylight.org/gerrit/#/c/63958/) will not impact any functionality.
|
| Comment by Sam Hague [ 12/Dec/17 ] |
|
With Jamo's https://git.opendaylight.org/gerrit/#/c/63958/ merged, we don't see the repeating log, but now what do we do with this bug? Karthikeyan, are you saying the issue is related to the issue where we don't get dhcp address in some subnets: https://jira.opendaylight.org/browse/NETVIRT-1038? That issue is still being worked. |
| Comment by Karthikeyan Krishnan [ 12/Dec/17 ] |
|
Still we have been observed VM Instance 1 and 2 didn't get IP address due to DHCP and the corresponding TC also been failed. Hence corresponding VMs NAT use cases also been failed. TC: Check Vm Instances Have Ip Address Latest CSIT Log: |
| Comment by Karthikeyan Krishnan [ 12/Dec/17 ] |
|
This issue is dependent on DHCP failure issue. |
| Comment by Karthikeyan Krishnan [ 26/Dec/17 ] |
|
Please refer the below details for passed (VM Instance 1) and failed (VM Instance 2) dump-flows/groups analysis based on the packet flow. Based on the below analysis if VM instance 2 IP address has been assigned properly through DHCP means both SNAT and DNAT traffic should work as expected. VM Instance 2 Fixed(local) IP: 41.0.0.9 (Actually IP is not been assigned to VM Instance 2) FIP Ping Request has reached to VM Instance 2 No Response from VM Instance 2: (FIP Response) group_id=225002,duration=111.080s,ref_count=1,packet_count=0,byte_count=0,bucket0:packet_count=0,byte_count=0 Passed TC: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/524/log_03_external_network.html.gz#s1-t14 FIP Ping Request has reached to VM Instance 1 Response from VM Instance 1: (FIP Response) group_id=225002,duration=88.877s,ref_count=1,packet_count=3,byte_count=294,bucket0:packet_count=3,byte_count=294 |
| Comment by Karthikeyan Krishnan [ 09/Jan/18 ] |
|
As per the local.conf and ml2.conf.ini configured information, CSIT job is running with Openstack DHCP service (q-dhcp) for allocating IP address to openstack compute VM instances. Also based on the compute-1 and compute-2 flows, we have not seen DHCP related flows (table 60 -> Punt to Controller) been installed (If ODL DHCP is enabled). So, i requested someone from DHCP to have a look into openstack DHCP service (q-dhcp) issue for further analysis. Since there is no role of ODL-DHCP in this problem scenario, we don’t suspect any flow related issue is causing the problem for allocating IP address to the VM instances. Compute-1 DHCP flow: (Only default flow is existing) cookie=0x6800000, duration=2716.022s, table=60, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17) Compute-2 DHCP flow: (Only default flow is existing) cookie=0x6800000, duration=2704.584s, table=60, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17) |
| Comment by Jamo Luhrsen [ 30/Jan/18 ] |
|
some discussion is happening over email as well. Do we need to move this bug to the openflowplugin project? I took the liberty to re-assign this to Arun, as that is who is currently debugging |
| Comment by Kit Lou [ 30/Jan/18 ] |
|
Do we have ETA on resolution? We need to consider unlocking the Carbon branch to allow other updates to come in. |
| Comment by Arunprakash D [ 31/Jan/18 ] |
|
The tunnel port on one of the controller/compute node is getting missed in the inventory. From the logs, we can see the tunnel port addition was received at openflowplugin and the same has been submitted to inventory by openflowplugin, but however there is an ERROR while submitting the transaction. This leads to the port being missed to add into the inventory. The exception shows the flow stats message is being added/removed from the inventory, but with stats disabled it is not expected to happen. Currently analysis further with few more logs enabled in openflowplugin stable/carbon branch. But the review is failing for unavailability of snapshot version in nexus. https://git.opendaylight.org/gerrit/#/c/67771/ Also, we are unable to reproduce the issue locally and has to be dependent on CSIT to fix the issue and it might take sometime. |
| Comment by Jamo Luhrsen [ 01/Feb/18 ] |
|
Arunprakash I used the distro created from your debug patch to run the job with TRACE logging and packet captures. Here are those logs: |
| Comment by Arunprakash D [ 01/Feb/18 ] |
|
Found out the root cause and raised new patch to address the same. [1] https://git.opendaylight.org/gerrit/#/c/67801/ CSIT output [2] with the patch [1] jluhrsen, please verify the csit results with patch [1]. klou, based on the confirmation from Jamo, we can merge the patch [1]. |
| Comment by Jamo Luhrsen [ 01/Feb/18 ] |
|
Awesome Arunprakash I'll be running the distro from your 67801 patch in the sandbox to get more runs in to build our confidence in
|
| Comment by Sam Hague [ 01/Feb/18 ] |
|
Arun, good stuff! Job [1] looked good, one tempest failure. Fired off [2] and [3] to get more coverage. But looking good. [2] https://jenkins.opendaylight.org/releng/job/openflowplugin-patch-test-netvirt-carbon/48/
|
| Comment by Sam Hague [ 04/Feb/18 ] |
|
There are still tons of table 50 flow exceptions is latest csit: But looks like the merge for https://git.opendaylight.org/gerrit/#/c/67801/ did not happen, so need to get that merge in. |
| Comment by Sam Hague [ 07/Feb/18 ] |
|
Hi Sam, Deletion logic for flow from operational inventory on flow removed message doesn’t exist in nitrogen and oxygen, so we don’t need this review in nitrogen and oxygen. Regards, Arun From: Sam Hague mailto:[shague@redhat.com Arun, Anil, the patch below was merged to carbon. Are we good that it is not needed on nitrogen or oxygen? Thanks, Sam |
| Comment by Kit Lou [ 09/Feb/18 ] |
|
Can we mark this issue as resolved now? Thanks! |