[NETVIRT-1022] CSIT Sporadic failures - major functionality breakage Created: 21/Nov/17  Updated: 10/Feb/18  Resolved: 10/Feb/18

Status: Verified
Project: netvirt
Component/s: General
Affects Version/s: Carbon-SR3
Fix Version/s: Carbon-SR3

Type: Bug Priority: Highest
Reporter: Jamo Luhrsen Assignee: Arunprakash D
Resolution: Done Votes: 0
Labels: csit:exception, csit:failures
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to NETVIRT-1038 vms fail to get dhcp address Resolved

 Description   

From time to time we find some jobs that really go south and have lots of failures. When
debugging and looking at the karaf.log there are many red flags, but one is the repeat
logging of this message:

2017-11-21 00:15:51,069 | INFO | pool-43-thread-1 | NaptPacketInHandler | 336 - org.opendaylight.netvirt.natservice-impl - 0.6.0.SNAPSHOT | onPacketReceived : Retry Packet IN Queue Size : 0

in this job [0] that message was printed over 900 times

something fundamental is broken in the environment and maybe the above log
message is benign. It does not appear very much in the jobs that are passing and
healthy.

here is the karaf.log [1] for that job [0]. The log also has a lot of NullPointerExceptions,
TransactionCommitFailedException, OptimisticLockFailedException, and UnsupportedOperationException.

more debugging and analysis is needed to figure out what is really broken.

[0] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/397/log_connectivity.html.gz
[1] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/397/odl_1/odl1_karaf.log.gz



 Comments   
Comment by Hanamantagoud Kandagal [ 28/Nov/17 ]

Hi Chetan / Karthikeyan ,

Please put the appropriate log level for the one which is appearing 900 times. It seems to be flooding the karaf logs.

Please analyze other exceptions as well.

 

Comment by Jamo Luhrsen [ 28/Nov/17 ]

btw, I already have this patch which changes debug level from INFO to DEBUG

Comment by Karthikeyan Krishnan [ 05/Dec/17 ]

Reason for failed TCs of FloatingIP Ping and SNAT TCP traffic:

================================================

Due to VM instance-2 is not getting IP address FloatingIP (DNAT) and SNAT (TCP) traffics are getting failed.

Regarding SNAT packet is getting looped because of internal VM IP is not aware of the ODL-Controller. Any way packet will be getting looped by maximum for 4 Secs (Within 4 Secs if the SNAT session flow is not getting installed on the OVS switch, packet will be dropped by ODL-Controller). This logging(which is changed from DEBUG to INFO with patch - https://git.opendaylight.org/gerrit/#/c/63958/) will not impact any functionality.

 

Comment by Sam Hague [ 12/Dec/17 ]

With Jamo's https://git.opendaylight.org/gerrit/#/c/63958/ merged, we don't see the repeating log, but now what do we do with this bug?

Karthikeyan, are you saying the issue is related to the issue where we don't get dhcp address in some subnets: https://jira.opendaylight.org/browse/NETVIRT-1038? That issue is still being worked.

Comment by Karthikeyan Krishnan [ 12/Dec/17 ]

Still we have been observed VM Instance 1 and 2 didn't get IP address due to DHCP and the corresponding TC also been failed. Hence corresponding VMs NAT use cases also been failed.

TC: Check Vm Instances Have Ip Address
--------------------------------------------------------
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/524/log_03_external_network.html.gz#s1-t6

Latest CSIT Log:
------------------------
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/524/

Comment by Karthikeyan Krishnan [ 12/Dec/17 ]

This issue is dependent on DHCP failure issue.

https://jira.opendaylight.org/browse/NETVIRT-1038

Comment by Karthikeyan Krishnan [ 26/Dec/17 ]

NETVIRT-1022 Analysis:
===================
Have analysed VM Instance 2 failure use cases and observed that VM instance 2 did not responded to any SNAT/DNAT request because of missing IP address of the VM due to DHCP failure known issue (https://jira.opendaylight.org/browse/NETVIRT-1038)

Please refer the below details for passed (VM Instance 1) and failed (VM Instance 2) dump-flows/groups analysis based on the packet flow. Based on the below analysis if VM instance 2 IP address has been assigned properly through DHCP means both SNAT and DNAT traffic should work as expected.

Failed TC: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/524/log_03_external_network.html.gz#s1-t15

VM Instance 2 Fixed(local) IP: 41.0.0.9 (Actually IP is not been assigned to VM Instance 2)
VM Instance 2 Floating-IP: 10.10.10.13

FIP Ping Request has reached to VM Instance 2
--------------------------------------------------------------------
cookie=0x8000003, duration=55.458s, table=21, n_packets=3, n_bytes=294, priority=42,ip,metadata=0x30d46/0xfffffe,nw_dst=10.10.10.13 actions=set_field:fa:16:3e:94:58:18->eth_dst,goto_table:25
cookie=0x8000004, duration=55.443s, table=25, n_packets=3, n_bytes=294, priority=10,ip,dl_dst=fa:16:3e:94:58:18,nw_dst=10.10.10.13 actions=set_field:41.0.0.9->ip_dst,write_metadata:0x30d54/0xfffffe,goto_table:27
cookie=0x8000004, duration=55.443s, table=27, n_packets=3, n_bytes=294, priority=10,ip,metadata=0x30d54/0xfffffe,nw_dst=41.0.0.9 actions=resubmit(,21)
cookie=0x8000003, duration=92.427s, table=21, n_packets=3, n_bytes=294, priority=42,ip,metadata=0x30d54/0xfffffe,nw_dst=41.0.0.9 actions=group:150002

No Response from VM Instance 2: (FIP Response)
-----------------------------------------------------------------------
cookie=0x8000004, duration=55.443s, table=26, n_packets=0, n_bytes=0, priority=10,ip,metadata=0x30d54/0xfffffe,nw_src=41.0.0.9 actions=set_field:10.10.10.13->ip_src,write_metadata:0x30d46/0xfffffe,goto_table:28
cookie=0x8000004, duration=55.443s, table=28, n_packets=0, n_bytes=0, priority=10,ip,metadata=0x30d46/0xfffffe,nw_src=10.10.10.13 actions=set_field:fa:16:3e:94:58:18->eth_src,resubmit(,21)
cookie=0x8000004, duration=109.905s, table=21, n_packets=0, n_bytes=0, priority=10,ip,metadata=0x30d46/0xfffffe actions=group:225002

group_id=225002,duration=111.080s,ref_count=1,packet_count=0,byte_count=0,bucket0:packet_count=0,byte_count=0

Passed TC: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/524/log_03_external_network.html.gz#s1-t14
VM Instance 1 Fixed(local) IP: 41.0.0.6
VM Instance 1 Floating-IP: 10.10.10.5

FIP Ping Request has reached to VM Instance 1
----------------------------------------------------------------------
cookie=0x8000003, duration=38.717s, table=21, n_packets=3, n_bytes=294, priority=42,ip,metadata=0x30d46/0xfffffe,nw_dst=10.10.10.5 actions=set_field:fa:16:3e:e9:c1:4e->eth_dst,goto_table:25
cookie=0x8000004, duration=38.704s, table=25, n_packets=3, n_bytes=294, priority=10,ip,dl_dst=fa:16:3e:e9:c1:4e,nw_dst=10.10.10.5 actions=set_field:41.0.0.6->ip_dst,write_metadata:0x30d54/0xfffffe,goto_table:27
cookie=0x8000004, duration=38.704s, table=27, n_packets=3, n_bytes=294, priority=10,ip,metadata=0x30d54/0xfffffe,nw_dst=41.0.0.6 actions=resubmit(,21)
cookie=0x8000003, duration=70.233s, table=21, n_packets=3, n_bytes=294, priority=42,ip,metadata=0x30d54/0xfffffe,nw_dst=41.0.0.6 actions=group:150007

Response from VM Instance 1: (FIP Response)
---------------------------------------------------------------
cookie=0x8000004, duration=38.704s, table=26, n_packets=3, n_bytes=294, priority=10,ip,metadata=0x30d54/0xfffffe,nw_src=41.0.0.6 actions=set_field:10.10.10.5->ip_src,write_metadata:0x30d46/0xfffffe,goto_table:28
cookie=0x8000004, duration=38.704s, table=28, n_packets=3, n_bytes=294, priority=10,ip,metadata=0x30d46/0xfffffe,nw_src=10.10.10.5 actions=set_field:fa:16:3e:e9:c1:4e->eth_src,resubmit(,21)
cookie=0x8000004, duration=87.466s, table=21, n_packets=3, n_bytes=294, priority=10,ip,metadata=0x30d46/0xfffffe actions=group:225002

group_id=225002,duration=88.877s,ref_count=1,packet_count=3,byte_count=294,bucket0:packet_count=3,byte_count=294

Comment by Karthikeyan Krishnan [ 09/Jan/18 ]

As per the local.conf and ml2.conf.ini configured information, CSIT job is running with Openstack DHCP service (q-dhcp) for allocating IP address to openstack compute VM instances.

Also based on the compute-1 and compute-2 flows, we have not seen DHCP related flows (table 60 -> Punt to Controller) been installed (If ODL DHCP is enabled). So, i requested someone from DHCP to have a look into openstack DHCP service (q-dhcp) issue for further analysis. Since there is no role of ODL-DHCP in this problem scenario, we don’t suspect any flow related issue is causing the problem for allocating IP address to the VM instances.

Compute-1 DHCP flow: (Only default flow is existing)

cookie=0x6800000, duration=2716.022s, table=60, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17)

Ref link: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/524/log_03_external_network.html.gz#s1-t15-k3-k1-k2-k1-k15-k4

Compute-2 DHCP flow: (Only default flow is existing)

cookie=0x6800000, duration=2704.584s, table=60, n_packets=0, n_bytes=0, priority=0 actions=resubmit(,17)

Ref link: https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-oxygen/524/log_03_external_network.html.gz#s1-t15-k3-k1-k3-k1-k15-k4

Comment by Jamo Luhrsen [ 30/Jan/18 ]

some discussion is happening over email as well.

Do we need to move this bug to the openflowplugin project?

I took the liberty to re-assign this to Arun, as that is who is currently debugging
things.

Comment by Kit Lou [ 30/Jan/18 ]

Do we have ETA on resolution?  We need to consider unlocking the Carbon branch to allow other updates to come in.

Comment by Arunprakash D [ 31/Jan/18 ]

 

The tunnel port on one of the controller/compute node is getting missed in the inventory.

 From the logs, we can see the tunnel port addition was received at openflowplugin and the same has been submitted to inventory  by openflowplugin, but however there is an ERROR while submitting the transaction.

This leads to the port being missed to add into the inventory. 

The exception shows the flow stats message is being added/removed from the inventory, but with stats disabled it is not expected to happen.

Currently analysis further with few more logs enabled in openflowplugin stable/carbon branch. But the review is failing for unavailability of snapshot version in nexus.

https://git.opendaylight.org/gerrit/#/c/67771/

Also, we are unable to reproduce the issue locally and has to be dependent on CSIT to fix the issue and it might take sometime.

Comment by Jamo Luhrsen [ 01/Feb/18 ]

Arunprakash I used the distro created from your debug patch to run the job with

TRACE logging and packet captures. Here are those logs:

  https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/8/jamo-netvirt-csit-1node-openstack-ocata-upstream-stateful-carbon/16/

Comment by Arunprakash D [ 01/Feb/18 ]

Found out the root cause and raised new patch to address the same.

[1]  https://git.opendaylight.org/gerrit/#/c/67801/

CSIT output [2] with the patch [1]

[2] https://jenkins.opendaylight.org/releng/job/netvirt-csit-1node-openstack-ocata-gate-stateful-carbon/266/console

jluhrsen, please verify the csit results with patch [1].

klou, based on the confirmation from Jamo, we can merge the patch [1].

Comment by Jamo Luhrsen [ 01/Feb/18 ]

Awesome Arunprakash

I'll be running the distro from your 67801 patch in the sandbox to get more runs in to build our confidence in
the fix.

 

Comment by Sam Hague [ 01/Feb/18 ]

Arun, good stuff!

Job [1] looked good, one tempest failure. Fired off [2] and [3] to get more coverage. But looking good.

[1] https://jenkins.opendaylight.org/releng/job/netvirt-csit-1node-openstack-ocata-gate-stateful-carbon/266

[2] https://jenkins.opendaylight.org/releng/job/openflowplugin-patch-test-netvirt-carbon/48/

[3] https://jenkins.opendaylight.org/releng/job/netvirt-csit-1node-openstack-ocata-gate-stateful-carbon/267

 

 

Comment by Sam Hague [ 04/Feb/18 ]

There are still tons of table 50 flow exceptions is latest csit:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-ocata-upstream-stateful-carbon/316/odl_1/odl1_exception.log.gz

But looks like the merge for https://git.opendaylight.org/gerrit/#/c/67801/ did not happen, so need to get that merge in.

Comment by Sam Hague [ 07/Feb/18 ]

Hi Sam,

Deletion logic for flow from operational inventory on flow removed message doesn’t exist in nitrogen and oxygen, so we don’t need this review in nitrogen and oxygen.

Regards,

Arun

From: Sam Hague mailto:[shague@redhat.com 
Sent: Wednesday, February 07, 2018 11:02 PM
To: D Arunprakash <d.arunprakash@ericsson.com>; Anil Vishnoi <vishnoianil@gmail.com>; Jamo Luhrsen <jluhrsen@redhat.com>; openflowplugin-dev <openflowplugin-dev@lists.opendaylight.org>; odl netvirt dev <netvirt-dev@lists.opendaylight.org>
Subject: https://git.opendaylight.org/gerrit/#/c/67801/ - why is it not a problem in nitrogen and oxygen?

Arun, Anil,

the patch below was merged to carbon. Are we good that it is not needed on nitrogen or oxygen?

Thanks, Sam

Comment by Kit Lou [ 09/Feb/18 ]

Can we mark this issue as resolved now?  Thanks!

Generated at Wed Feb 07 20:23:02 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.