[NETVIRT-788] CSIT Sporadic failures - ElanService suite - openstack instances failing to get dhcp address for more than 3m Created: 17/Jul/17  Updated: 30/Oct/17  Resolved: 09/Aug/17

Status: Resolved
Project: netvirt
Component/s: General
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Jamo Luhrsen Assignee: Faseela K
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File odl1_karaf.log.gz    
External issue ID: 8859

 Description   

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-learn-carbon/336/log.html.gz#s1-s6-s1-k1-k2



 Comments   
Comment by Jamo Luhrsen [ 17/Jul/17 ]

some discussion here:
https://lists.opendaylight.org/pipermail/netvirt-dev/2017-July/005001.html

but the gist of this bug is that it's coming periodically and when we hit
it, the openstack instances that are created don't get their dhcp ip address
in at least 3m [0], which is too long. In a passing case example [0] it still
is taking over 2m which may be considered too long as well. Potentially we
have some performance issue in programming flows in time?

[0] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-learn-carbon/336/log.html.gz#s1-s6-s1-k1-k2-k3-k12
[1] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-learn-carbon/67/log.html.gz#s1-s6-s1-k1-k2-k3-k12

Comment by Vishal Thapar [ 18/Jul/17 ]

Details of analysis captured in [2] and this looks like OpenStack issue.

tl;dr version:
The test fails coz 'nameserver' is not configured, which comes from Metadata service. VMs get IP fom DHCP, but not ssh keys, nameserver etc. Metadata proxy is not sending information about the VMs.

[2] https://lists.opendaylight.org/pipermail/netvirt-dev/2017-July/005001.html

Comment by Jamo Luhrsen [ 18/Jul/17 ]

for reference, this is an email exchange with some debugging info:
-------------------------------------------------------------------

Vishal,

I might be confused, but I think we had two issues last week, and the Elan Suite
issue is not the metadata issue where we got the dhcp lease but not the nameserver.
I think that was just a nitro issue in a gate job. again, high chance that I'm
just confused

but, check this link [i]. That's the console log of a VM in the elan suite after
3m of polling for it's dhcp lease. It looks like dhcp just failed.

snippet:

Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending discover...
Sending discover...
Usage: /sbin/cirros-dhcpc <up|down>
No lease, failing
WARN: /etc/rc3.d/S40-network failed
cirros-ds 'net' up at 187.58
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 187.79. request failed

Also, I spent some time trying to look for clues in the various logs. I didn't
find anything in the q-* logs which includes the metadata log. The karaf log
has some interesting things. I'm comparing with a job that passed this Elan
Suite Setup (#335 whereas [i] is #337). The failing case has a lot of these
retry and 'timed out' logs around ACL rule programming it seems. stuff like
this:

LockManager | 318 - org.opendaylight.genius.lockmanager-impl - 0.2.2.SNAPSHOT | Waiting for the lock acl.flow.priorities.pool.154761518883256.243.PERMITTCP_DESTINATION_1_0 is timed out. retrying again

LockManager | 318 - org.opendaylight.genius.lockmanager-impl - 0.2.2.SNAPSHOT | Already locked for acl.flow.priorities.pool.154761518883256.243.PERMITTCP_DESTINATION_1_0 after waiting 1000ms, try 12

that general pattern happens a few times once the Elan suite starts, and I didn't
see it in the passing case.

Also, I saw a bunch of the mdsalutil-api threads getting terminated in a row due to
some NPE, which was not there in the passing case. example:

ERROR | eChangeHandler-1 | AsyncDataTreeChangeListenerBase | 292 - org.opendaylight.genius.mdsalutil-api - 0.2.2.SNAPSHOT | Thread terminated due to uncaught exception: AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-1
java.lang.NullPointerException

Having said that, I want to point out that the passing case has it's own set
(a lot) of exceptions ERRORS that are occuring during this Elan Suite Setup,
and since it passes it makes me doubt that the errors listed above are very
helpful.

I'll pasted this thread in to the BZ in case others are following there.

Thanks,
JamO

[i] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-learn-carbon/337/log.html.gz#s1-s6-s1-k1-k2-k3-k12-k1-k2-k2-k15

On 07/17/2017 10:47 PM, Vishal Thapar wrote:
> Will take a look at it and see if it is perf degradation. I updated NETVIRT-788 with analysis.
>
> 8859 is an OpenStack issue as of now, metadata service to be more specific. So would appreciate help from folks more familiar with it as am clueless when it comes to metadata service.
>
> Regards,
> Vishal.
>
> ----Original Message----
> From: Jamo Luhrsen jluhrsen@gmail.com
> Sent: 18 July 2017 11:04
> To: Vishal Thapar <vishal.thapar@ericsson.com>; netvirt-dev@lists.opendaylight.org
> Subject: Re: [netvirt-dev] netvirt daily 1node CSIT report (7/17)
>
> Vishal,
>
> every job has this [a] which shows runtime for the full job. Also, we can drill down to each test case to see it's duration, like this one [b] that is checking that VMs get IP addresses in our L2 connectivity suite. That's interested, actually, as it seems that around build 814 (5 days ago) that test case had an increase in run time from aprox 1:20 to 1:50 (30s increase). not sure what to make of that...
>
> Thanks,
> JamO
>
> [a]
> https://jenkins.opendaylight.org/releng/user/jluhrsen/my-views/view/netvirt%20csit/job/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-stateful-carbon/buildTimeTrend
> [b]
> https://jenkins.opendaylight.org/releng/user/jluhrsen/my-views/view/netvirt%20csit/job/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-stateful-carbon/839/robot/netvirt-1node-openstack.txt/Connectivity/01%20L2%20Tests/Check%20Vm%20Instances%20Have%20Ip%20Address/durationGraph?maxBuildsToShow=0&hd=true
>
> On 07/17/2017 10:11 PM, Vishal Thapar wrote:
>> Jamo will do.
>>
>> One question, we recently had a few changes to fix CSIT exceptions. I expected a knock-on impact where things move a bit faster but couldn't figure out how. Is it possible to have a dashboard/graph for test runtimes, similar to fail/success one?
>>
>> Regards,
>> Vishal.
>>
>> ----Original Message----
>> From: netvirt-dev-bounces@lists.opendaylight.org
>> netvirt-dev-bounces@lists.opendaylight.org On Behalf Of Jamo
>> Luhrsen
>> Sent: 17 July 2017 23:41
>> To: netvirt-dev@lists.opendaylight.org
>> Subject: [netvirt-dev] netvirt daily 1node CSIT report (7/17)
>>
>> 75% of our jobs passed 100% in their last run, which is as good as I've seen it in a long time.
>>
>> for today:
>>
>> opened 8859
>> re-opened 7885
>>
>>
>> - NETVIRT-506 - test_port_security_macspoofing_port
>> [0]
>>
>> - NETVIRT-520 - Flow(s) missing in VPNService suite on compute node
>> [0]
>>
>> - NETVIRT-788 - ElanService suite - openstack instances failing to get dhcp address for more than 3m
>> [1]
>> * for this one, it's not coming every time obviously but I am wondering if by fixing
>> it we'll also be fixing other random failures. Even in the passing case it's taking
>> 2+ minutes to get ips, which is irritatingly long (as a user). Maybe something is
>> bogged down internally and if we figure that out other random CSIT issues could go
>> away.
>> @Vishal, I know you did more debugging on this last week. I put a link to one email
>> thread in the bug. Can you update the bug with any other details you know?
>>
>> - NETVIRT-780 - v6 instance connectivity failures in vpnservice suite
>> * all learn jobs
>>
>> Thanks,
>> JamO
>>
>> [0]
>> https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-ope
>> nstack-newton-nodl-v2-upstream-learn-carbon/333/log.html.gz
>> [1]
>> https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-ope
>> nstack-newton-upstream-learn-carbon/336/log.html.gz
>> _______________________________________________
>> netvirt-dev mailing list
>> netvirt-dev@lists.opendaylight.org
>> https://lists.opendaylight.org/mailman/listinfo/netvirt-dev
>>

Comment by Jamo Luhrsen [ 23/Jul/17 ]

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-stateful-carbon/467/log.html.gz#s1-s6

Comment by Jamo Luhrsen [ 24/Jul/17 ]

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-learn-carbon/342/log.html.gz#s1-s6-s1-k1-k2

Comment by Jamo Luhrsen [ 26/Jul/17 ]

here is a new example now that we have extra DEBUG logs introduced and
enabled:

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-carbon/76/log.html.gz#s1-s6

attached is the karaf.log from the start of the ELAN suite, but the above
logs link can be used to get to the full karaf.log if needed

Comment by Jamo Luhrsen [ 26/Jul/17 ]

Attachment odl1_karaf.log.gz has been added with description: karaf log from the start of the Elan suite

Comment by Jamo Luhrsen [ 26/Jul/17 ]

(In reply to Jamo Luhrsen from comment #6)
> here is a new example now that we have extra DEBUG logs introduced and
> enabled:
>
> https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> ocata-upstream-stateful-carbon/76/log.html.gz#s1-s6
>
> attached is the karaf.log from the start of the ELAN suite, but the above
> logs link can be used to get to the full karaf.log if needed

Ignore this please. The Elan Suite failed because of a mistake I had
in the code to disable DEBUG logging. I have a patch to fix that.

https://git.opendaylight.org/gerrit/#/c/60794/

Comment by Jamo Luhrsen [ 28/Jul/17 ]

I think we now have an example to look at after our better debugs have
been enabled in CSIT and the controller.

robot log:

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-stateful-snat-conntrack-carbon/125/log.html.gz#s1-s6-s1-k1-k2

karaf log:

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-stateful-snat-conntrack-carbon/125/odl1_karaf.log.gz

you can see a lot of those WARNing messages about waiting for locks
and retrying after the Elan Suite starts. If you grep for ROBOT|DEBUG in
the karaf log you will see a bunch of flow removal lines, but only a
single install message in the timeframe of the Elan Suite. I am stil
wondering if we are stuck (locked and retrying) so nothing gets set up.

Comment by Jamo Luhrsen [ 03/Aug/17 ]

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-stateful-carbon/478/log.html.gz

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-stateful-snat-conntrack-carbon/130/log.html.gz

Comment by Faseela K [ 07/Aug/17 ]

Fixing wrong table 200 flows with the below patch :

https://git.opendaylight.org/gerrit/#/c/61197/6

Comment by Faseela K [ 07/Aug/17 ]
  • table 220
Comment by Sam Hague [ 08/Aug/17 ]

There are multiple aspects to this. It can be any of the following, but might not be always due to the same issue, but you can let me know if any of the following symptoms are seen :
Any issues in Table 17 or Table 220 programming. For example last time when Manu debugged, the flows in Table 220 were something like below :
[1]cookie=0x6900000, duration=127.462s, table=220, n_packets=28, n_bytes=2096, priority=6,reg6=0x8200 actions=load:0x70008200->NXM_NX_REG6[],write_metadata:0x138f000000/0xfffffffffe,goto_table:241
[2]cookie=0x8000007, duration=127.523s, table=220, n_packets=0, n_bytes=0, priority=9,reg6=0x90008200 actions=output:42
[3]cookie=0x6900000, duration=126.910s, table=241, n_packets=1, n_bytes=370, priority=63010,udp,reg6=0x8200/0xfffff00,tp_src=67,tp_dst=68 actions=resubmit(,220)
For the same interface-tag -> 8200, the higher priority service is loading service-index=7, but the next priority service is matching for service-index=9, which will never match. Flows in Table 17 and Table 220 works on this service-binding logic, that for the same interface-tag,
There will be “n” entries in the tables where n = no of services bound on the interface. And the highest priority service will set the next priority in service-index before sending the packet to the service pipeline, so that on a resubmit the packet matches the next service-index in Table 17/220.
2. ConflictingModificationException for urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node where the flow-id has “DPNID:TABLE-ID:INTERFACE-NAME:SERVICE-INDEX” format, where TABLE-ID = 17 or 220
ConflcitingModificationException for urn:opendaylight:genius:bound-services-state

Comment by Sam Hague [ 08/Aug/17 ]

There are multiple aspects to this. It can be any of the following, but might not be always due to the same issue, but you can let me know if any of the following symptoms are seen :

1. Any issues in Table 17 or Table 220 programming. For example last time when Manu debugged, the flows in Table 220 were something like below :
[1]cookie=0x6900000, duration=127.462s, table=220, n_packets=28, n_bytes=2096, priority=6,reg6=0x8200 actions=load:0x70008200->NXM_NX_REG6[],write_metadata:0x138f000000/0xfffffffffe,goto_table:241
[2]cookie=0x8000007, duration=127.523s, table=220, n_packets=0, n_bytes=0, priority=9,reg6=0x90008200 actions=output:42
[3]cookie=0x6900000, duration=126.910s, table=241, n_packets=1, n_bytes=370, priority=63010,udp,reg6=0x8200/0xfffff00,tp_src=67,tp_dst=68 actions=resubmit(,220)
For the same interface-tag -> 8200, the higher priority service is loading service-index=7, but the next priority service is matching for service-index=9, which will never match. Flows in Table 17 and Table 220 works on this service-binding logic, that for the same interface-tag,
There will be “n” entries in the tables where n = no of services bound on the interface. And the highest priority service will set the next priority in service-index before sending the packet to the service pipeline, so that on a resubmit the packet matches the next service-index in Table 17/220.

2. ConflictingModificationException for urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node where the flow-id has “DPNID:TABLE-ID:INTERFACE-NAME:SERVICE-INDEX” format, where TABLE-ID = 17 or 220

3. ConflictingModificationException for urn:opendaylight:genius:bound-services-state

Comment by Jamo Luhrsen [ 08/Aug/17 ]

(In reply to Sam Hague from comment #14)
> There are multiple aspects to this. It can be any of the following, but
> might not be always due to the same issue, but you can let me know if any of
> the following symptoms are seen :
>
> 1. Any issues in Table 17 or Table 220 programming. For example last time
> when Manu debugged, the flows in Table 220 were something like below :
> [1]cookie=0x6900000, duration=127.462s, table=220, n_packets=28,
> n_bytes=2096, priority=6,reg6=0x8200
> actions=load:0x70008200->NXM_NX_REG6[],write_metadata:0x138f000000/
> 0xfffffffffe,goto_table:241
> [2]cookie=0x8000007, duration=127.523s, table=220, n_packets=0, n_bytes=0,
> priority=9,reg6=0x90008200 actions=output:42
> [3]cookie=0x6900000, duration=126.910s, table=241, n_packets=1, n_bytes=370,
> priority=63010,udp,reg6=0x8200/0xfffff00,tp_src=67,tp_dst=68
> actions=resubmit(,220)
> For the same interface-tag -> 8200, the higher priority service is loading
> service-index=7, but the next priority service is matching for
> service-index=9, which will never match. Flows in Table 17 and Table 220
> works on this service-binding logic, that for the same interface-tag,
> There will be “n” entries in the tables where n = no of services bound on
> the interface. And the highest priority service will set the next priority
> in service-index before sending the packet to the service pipeline, so that
> on a resubmit the packet matches the next service-index in Table 17/220.
>
> 2. ConflictingModificationException for
> urn:opendaylight:inventory?revision=2013-08-19)nodes/node/node where the
> flow-id has “DPNID:TABLE-ID:INTERFACE-NAME:SERVICE-INDEX” format, where
> TABLE-ID = 17 or 220
>
> 3. ConflictingModificationException for
> urn:opendaylight:genius:bound-services-state

the problem is that currently, we only see the symptom of this bug in automation.
the only check we have to catch this issue or similar issues, is higher up at the
level where VMs don't get IPs. Ideally, we can add checks for all 3 items above in
our CSIT code, but I'm not sure who can take that on. There are still many other
pending priorities for CSIT work.

Comment by Faseela K [ 09/Aug/17 ]

No need to add any more tests, if a test fails, on urther triaging any of the above symptoms are seen, assign it directly to me

Generated at Wed Feb 07 20:22:28 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.