|
Possible duplicate of https://bugs.opendaylight.org/show_bug.cgi?id=7854
https://git.opendaylight.org/gerrit/#/c/52175/ is already merged at the time of this run.
https://git.opendaylight.org/gerrit/#/c/52277/ isn't merged yet.
|
|
just to update that this is still happening:
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-mitaka-upstream-transparent-carbon/407/archives/log.html.gz
|
|
(In reply to Jamo Luhrsen from comment #2)
> just to update that this is still happening:
>
> https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> mitaka-upstream-transparent-carbon/407/archives/log.html.gz
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-mitaka-upstream-stateful-carbon/173/archives/log.html.gz
|
|
(In reply to Jamo Luhrsen from comment #3)
> (In reply to Jamo Luhrsen from comment #2)
> > just to update that this is still happening:
> >
> > https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> > mitaka-upstream-transparent-carbon/407/archives/log.html.gz
>
> https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> mitaka-upstream-stateful-carbon/173/archives/log.html.gz
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-mitaka-upstream-stateful-carbon/197/archives/log.html.gz#s1-s4-t5
|
|
Hi Jamo / Alon,
Is this still happening after the fix from Faseela for NETVIRT-495 here
Vivek
|
|
(In reply to Vivekanandan Narasimhan from comment #5)
> Hi Jamo / Alon,
>
> Is this still happening after the fix from Faseela for NETVIRT-495 here
>
> Vivek
yeah, this is still happening. recent failure:
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-learn-carbon/163/archives/log.html.gz#s1-s1-s1-t7-k10
|
|
In the TC (Check Vm Instances Have Ip Address (non-critical)), I don't see ODL DHCP agent is being used for assigning IP address and it is done from a L2 VM.
Also flows/groups (Ingress, Egress, ACL and ELAN) related to each VMs are being programmed correctly and it is done in the duration of 1 sec (saw it from duration property in flows/groups) which should be okay.
But we need to ensure DHCP VM is brought up first followed by other VMs. Can you confirm on how it is being done ?
|
|
Jamo, Peri,
In the latest link Jamo has posted, I think the "DHCP failure" is a false alarm.
All VMs got IP addresses - the reason the test has failed seems to be that NET2_DHCP_IP is appended to the NET2_VM_IPS list, and for some reason it is "Non"e. Then the test fails because it tries to ping the DHCP Server "None" ip.
|
|
test patch in progress to resolve this bug:
https://git.opendaylight.org/gerrit/#/c/53317/
|
|
the test patch to do better when collecting instance ip and dhcp nameserver
ip is merged, but CSIT has failures in this area still. Thus, for the sake of not creating a new bug, I'll keep tracking that here.
this failure does not find the nameserver:
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-stateful-boron/265/archives/log.html.gz#s1-s1-s1-t8
but, I'm not totally sure the problem because if you look at the
instance console logs after the problem you can see the nameserver
line. see here for example:
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-stateful-boron/265/archives/log.html.gz#s1-s1-s1-t8-k4-k4-k1-k5
but you can also see that the instance is not able to ping the gateway and
it's failing. So I wonder if there is real connectivity issues anyway?
It's possible that the console log output did not have the nameserver
line in it when the check was made earlier.
|
|
Jamo,
I think we can safely close this bug. The original problems causing this bug were fixed.
The failures are now caused by NETVIRT-551 (drop rule in table=51, which causes L2 connectivity issues to the VM). It is shown as "DHCP Failure" just because the nameserver print is delayed because there is no connectivity to the metadata server.
|
|
(In reply to Koby Aizer from comment #11)
> Jamo,
>
> I think we can safely close this bug. The original problems causing this bug
> were fixed.
>
> The failures are now caused by NETVIRT-551 (drop rule in table=51, which causes
> L2 connectivity issues to the VM). It is shown as "DHCP Failure" just
> because the nameserver print is delayed because there is no connectivity to
> the metadata server.
Koby,
what about this one?
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-mitaka-upstream-transparent-boron/518/archives/log.html.gz#s1-s1-s1-t8-k16
that one is a VM not getting it's ip, it seems.
|
|
Hi Jamo,
This one looks a bit different - but I think it is still related to NETVIRT-551. The VM DHCP request never got to the Controller node, because the ELAN remote BC group was missing the tunnel rule:
cookie=0x870138a, duration=189.719s, table=52, n_packets=9, n_bytes=1500, priority=5,metadata=0x138a000000/0xffff000001 actions=write_actions(group:210004)
group_id=210004,type=all,bucket=actions=group:210003
From Peri's last mail about this issue, I think it is the same root cause:
"For DHCP issue, It seems like VxLAN tunnel add/update DCNs (with auto-tunnel configuration) are not processed properly for certain ELANs which leads to:
ELAN remote DMAC flow BC group is programmed with drop action [1] ELAN remote BC group is not having buckets for tunnel ports [2]"
|
|
(In reply to Koby Aizer from comment #13)
> Hi Jamo,
>
> This one looks a bit different - but I think it is still related to Bug
> 8023. The VM DHCP request never got to the Controller node, because the ELAN
> remote BC group was missing the tunnel rule:
> cookie=0x870138a, duration=189.719s, table=52, n_packets=9, n_bytes=1500,
> priority=5,metadata=0x138a000000/0xffff000001
> actions=write_actions(group:210004)
> group_id=210004,type=all,bucket=actions=group:210003
>
> From Peri's last mail about this issue, I think it is the same root cause:
>
> "For DHCP issue, It seems like VxLAN tunnel add/update DCNs (with
> auto-tunnel configuration) are not processed properly for certain ELANs
> which leads to:
> ELAN remote DMAC flow BC group is programmed with drop action [1] ELAN
> remote BC group is not having buckets for tunnel ports [2]"
here's another one:
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-stateful-carbon/268/archives/log.html.gz#s1-s1-s2-t10
I tried to look at the flows to verify this belongs to 8023, and I was
looking for a mistake drop rule in table 52, but it was not there. maybe
it's not table 52? I didn't see the table 52 you put above, Koby, with
the bucket=actions=group detail either. I checked a passing run as
well and didn't see it.
either way, as long as we track it. but, I'd like to know more on how to
figure out what exactly to look for.
|
|
Hi Jamo,
It seems that the new report (268) is different The DHCP failure is happening in the VLAN network (and not VXLAN network). So this might be a different issue (because the issue we're talking about up until now was a race with the VXLAN tunnel creation.). I will try taking a look into this failure as well - we might need a different bug.
===
Just to complete the VXLAN-type failures analysis (for example report https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-mitaka-upstream-transparent-boron/518/archives/log.html.gz#s1-s1-s1-t8-k16)
There are two appearances of these failures:
1. table=51 drop rule towards the problematic VM mac address in the Control node (this rule is used to send the DHCP Response to the VM - it should have been an output to tunnel rule)
2. A missing "output to tunnel towards the control node" bucket in the ELAN Remote BC group in the compute node of the VM. (This bucket is used to broadcast the DHCP Request towards the DHCP server)
In the report I mentioned this issue is #2:
The problematic VM resides in OS_COMPUTE_2 and it is part of elanId=0x138a (you can tell by the metadata match):
cookie=0x803138a, duration=256.645s, table=51, n_packets=0, n_bytes=0, priority=20,metadata=0x138a000000/0xffff000000,dl_dst=fa:16:3e:f8:a0:f4 actions=load:0xf00->NXM_NX_REG6[],resubmit(,220)
You can see that the table=52 directs broadcast packets in that elan packets towards group:210004
cookie=0x870138a, duration=256.645s, table=52, n_packets=9, n_bytes=1500, priority=5,metadata=0x138a000000/0xffff000001 actions=write_actions(group:210004)
And in the "dump-groups" output you can see this group is missing an "output to tunnel towards compute node" bucket:
group_id=210004,type=all,bucket=actions=group:210003
Just for comparison, this how this group looks like the OS_COMPUTE_1 which is OK:
group_id=210004,type=all,bucket=actions=group:210003,bucket=actions=set_field:0x138a->tun_id,output:4,bucket=actions=set_field:0x138a->tun_id,output:5
|
|
Yes, The report for /518 csit is because of race condition during auto tunnel configuration. This should get fixed by following review:
Boron - https://git.opendaylight.org/gerrit/#/c/53963/
Master - https://git.opendaylight.org/gerrit/#/c/53958/
For the DHCP issue on the VLAN provider network (as per csit report /268), I observe the following and these are must fix:
1. ELAN Remote BC group for VLAN provider network is having buckets pointing to VxLAN tunnels example: group_id=210008,type=all,bucket=actions=group:210007,bucket=actions=resubmit(,220),load:0x1300->NXM_NX_REG6[],set_field:0x138c->tun_id,bucket=actions=resubmit(,220),load:0xa00->NXM_NX_REG6[]
2. ELAN MAC learning doesn't happen for VLAN provider network because of the following error (this is because of https://git.opendaylight.org/gerrit/#/c/52174/)
2017-03-28 16:20:31,872 | ERROR | pool-24-thread-1 | DOMNotificationRouterEvent | 144 - org.opendaylight.controller.sal-broker-impl - 1.5.0.SNAPSHOT | Delivery of notification org.opendaylight.controller.md.sal.binding.impl.LazySerializedDOMNotification@633698c0 caused an error in listener org.opendaylight.controller.md.sal.binding.impl.BindingDOMNotificationListenerAdapter@77f6d79b
java.lang.IllegalArgumentException: Cannot create IpAddress from 134235392
at org.opendaylight.yang.gen.v1.urn.ietf.params.xml.ns.yang.ietf.inet.types.rev130715.IpAddressBuilder.getDefaultInstance(IpAddressBuilder.java:43)[63:org.opendaylight.mdsal.model.ietf-inet-types-2013-07-15:1.2.0.SNAPSHOT]
at org.opendaylight.netvirt.elan.utils.ElanUtils.getSourceIpV4Address(ElanUtils.java:2212)[346:org.opendaylight.netvirt.elanmanager-impl:0.4.0.SNAPSHOT]
at org.opendaylight.netvirt.elan.utils.ElanUtils.getSourceIpAddress(ElanUtils.java:2231)[346:org.opendaylight.netvirt.elanmanager-impl:0.4.0.SNAPSHOT]
at org.opendaylight.netvirt.elan.internal.ElanPacketInHandler.onPacketReceived(ElanPacketInHandler.java:101)[346:org.opendaylight.netvirt.elanmanager-impl:0.4.0.SNAPSHOT]
at org.opendaylight.yangtools.yang.binding.util.NotificationListenerInvoker.invokeNotification(NotificationListenerInvoker.java:117)[47:org.opendaylight.mdsal.yang-binding:0.10.0.SNAPSHOT]
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMNotificationListenerAdapter.onNotification(BindingDOMNotificationListenerAdapter.java:44)[146:org.opendaylight.controller.sal-binding-broker-impl:1.5.0.SNAPSHOT]
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMNotificationRouterEvent.deliverNotification(DOMNotificationRouterEvent.java:56)[144:org.opendaylight.controller.sal-broker-impl:1.5.0.SNAPSHOT]
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMNotificationRouter$1.onEvent(DOMNotificationRouter.java:68)[144:org.opendaylight.controller.sal-broker-impl:1.5.0.SNAPSHOT]
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMNotificationRouter$1.onEvent(DOMNotificationRouter.java:65)[144:org.opendaylight.controller.sal-broker-impl:1.5.0.SNAPSHOT]
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)[131:com.lmax.disruptor:3.3.6]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)[:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)[:1.8.0_121]
|
|
raised a patch to fix 1& 2.
https://git.opendaylight.org/gerrit/#/c/54240/
|
|
Since above review is merged, I suggest Koby to look at any other problem in ELAN for VLAN provider network which causes DHCP issue.
|
|
Thanks Peri.
Hi Koby,
Can you please take this forward for VLAN-provider-networks based DHCP use-cases?
Vivek
|
|
Vivek/Peri,
We can consider the VLAN provider issues as "resolved" - those were caused by genius duplicate service bindings.
A workaround patch was merged (https://git.opendaylight.org/gerrit/#/c/54247/), and Faseela is working on a proper fix (https://bugs.opendaylight.org/show_bug.cgi?id=7451).
Given Peri also pushed his fixes - Would you like to close this bug?
|
|
(In reply to Koby Aizer from comment #20)
> Vivek/Peri,
>
> We can consider the VLAN provider issues as "resolved" - those were caused
> by genius duplicate service bindings.
> A workaround patch was merged
> (https://git.opendaylight.org/gerrit/#/c/54247/), and Faseela is working on
> a proper fix (https://bugs.opendaylight.org/show_bug.cgi?id=7451).
>
>
> Given Peri also pushed his fixes - Would you like to close this bug?
I think we can keep this bug open to track the other reasons this might be
causing failures. I saw the high level symptom (an instance didn't get
it's DHCP lease) here:
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-learn-carbon/210/archives/log.html.gz#s1-s1-s3-t5
I tried to check for any of the root causes discussed here already, but
didn't think they applied. This may be something new to analyze.
we can file a new bug, but seems like extra overhead at this point.
|
|
seen in vpnservice suite:
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-learn-boron/256/archives/log.html.gz#s1-s4-t5
|
|
Faseela, As per this CSIT report (/256), I don't see table 0 flow rule itself not programmed for any VMs and seeing interface state is not populated for VM interfaces (i.e. VLAN trunk interfaces). Can you have a look ?
|
|
Just had a quick look at the failing suite.
I do see a swap in parent-interface and interface-name in ietf-interfaces config DS for trunk-member interfaces between the failing and passing TCs.
And the parent-interface specified in failing TC does not match with the actual port name, and hence table 0 won't get created.
Failing one
===========
{
"enabled": true,
"name": "273764871643029:br-physnet1-pa:167",
"odl-interface:external": true,
"odl-interface:l2vlan-mode": "trunk-member",
"odl-interface:parent-interface": "273764871643029:br-physnet1-pa:trunk",
"odl-interface:vlan-id": 167,
"type": "iana-if-type:l2vlan"
}
,
Passing one
===========
{
"enabled": true,
"name": "97426923480895:br-physnet1-pa:trunk",
"odl-interface:external": true,
"odl-interface:l2vlan-mode": "trunk",
"odl-interface:parent-interface": "97426923480895:br-physnet1-pa",
"type": "iana-if-type:l2vlan"
}
,
|
|
Ignore my previous comment,
Just debugged the issue with Peri.
Neutron-ports list is showing all ports, even all ports are present in switch as well, however none of them are configured in config/ietf-interfaces.
Somebody with NeutronVpn expertise should have a look at this.
Thanks,
Faseela
|
|
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-stateful-snat-conntrack-boron/6/archives/log.html.gz#s1-s1-s1-t8
|
|
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-learn-carbon/236/archives/log.html.gz
|
|
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-learn-boron/285/archives/log.html.gz
|
|
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-transparent-boron/533/archives/log.html.gz
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-stateful-snat-conntrack-boron/15/archives/log.html.gz
|
|
[0] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-learn-carbon/30/log.html.gz
[1] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-carbon/30/log.html.gz
[2] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-stateful-snat-conntrack-carbon/30/log.html.gz
[3] https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-transparent-carbon/30/log.html.gz
|
|
closing this as we are not seeing this general problem any more. If/when we
do have a failure with instances not getting their ip addresses we can
re-open or even better we can file a new, more specific bug.
|
Generated at Wed Feb 07 20:21:43 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.