|
Adding email thread:
jira here:
https://jira.opendaylight.org/browse/NETVIRT-1549
we can maybe debug further there, but I'm starting to wonder if
we have two different bugs here.
in the one apex example I gave the failure was in the first initial
pnf ping which is expected to pass. In that job, it fails, but the
flow eventually gets programmed and further test cases are ok.
in the devstack job that we were looking at eventually it was
failing in the test case after the floating ip was assigned. in
that case, the flow never got programmed at all.
also, Aswin noted and I verified that the devstack job for
conntrack does not see this problem. BUT the apex job does
see it which is also conntrack, so I don't think we can say
that it's not there in conntrack.
It happened on the first gate job I ran in the sandbox just now,
so hopefully we can figure it out and resolve it so I can move
on with my plan to make the voting gate job.
JamO
On 1/7/19 9:36 AM, Aswin Suryanarayanan wrote:
>
>
> On Mon, Jan 7, 2019 at 10:45 PM Sam Hague <shague@redhat.com <[shague@redhat.com |mailto:shague@redhat.com]>> wrote:

>
>
>
> On Mon, Jan 7, 2019 at 11:41 AM Aswin Suryanarayanan <asuryana@redhat.com <[asuryana@redhat.com |mailto:asuryana@redhat.com]>> wrote:
>
> When I look at [1] I see that the ARP to the PNF(10.10.10.253) is never answered. Can we suspect anything wrong
> with the PNF? Is it another virtual node?
>
> Would increasing the number of pings help here? Or the ttl piece? One thing I did add in the ping patch was I added
> -W1 to the ping args - this sets the wait timeout to 1s. The timeout applies when no response is received, so it
> made sense to lower this to minimum because we retry via the keyword rather than let ping retry - gives us better
> control.
>
> The -t1 was always there and that means 1 ttl. Maybe that should be bumped and could help? Not sure how to verify
> this, but when you say the arp is never received I wonder if there is some other hop sometimes. How did you verify
> the ARP is not answered - via debug logs, mdsal, flows, punt to controller or tcpdump?
>
> Don't think it a timing issue. Even though ping test failed the if the PNF sends a response ODL will learn it even after
> the test case. I was expecting this log to be present for 10.10.10.253, which is not there.
>
> AbstractIpLearnNotificationHandler | 368 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | Received ARP/NA for sender MAC E6:3F:53:B4:D9:CF and sender IP 10.10.10.253 via interface 48400575920930:br-physnet1-pa:trunk
>
> The below confirms ARP was send.
>
> ArpUtils | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | SubnetRoutePacketInHandler: sendArpRequest dpnId 70852734155410, actions [], groupId 210022, senderIPAddress 10.10.10.5, targetIPAddress 10.10.10.253
>
>
> It could be a an issue with the pipeline too. But since the ARP learning is successful for the dc-gw(10.10.10.250) I
> tend to think pipeline should be fine. Because both of them hit almost the same flows.
>
>
> ping -W1 -t1 -c1 10.10.10.253
>
>
> Surprisingly conntrack job[2] seems to be not hitting this for the time being. 
>
> [1]https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-neon/830/odl_1/odl1_karaf.log.gz
> [2]https://jenkins.opendaylight.org/releng/job/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-gate-stateful-snat-conntrack-fluorine/
>
> On Mon, Jan 7, 2019 at 9:48 PM Jamo Luhrsen <jluhrsen@redhat.com <[jluhrsen@redhat.com |mailto:jluhrsen@redhat.com]>> wrote:
>
> yep, I noticed this too before the break. this is after Sam's fix to make our
> ping more redundant to get around the issue you saw before with a slight delay
> for flows to get programmed.
>
> JamO
>
> On 1/7/19 4:12 AM, Sam Hague wrote:
> > Aswin,
> >
> > looks like the PNF is randomly failing again. [1] is the report showing the below test failing in the job
> numbers
> > listed. Any ideas?
> >
> > Ping External Network PNF from Vm Instance 1 After Floating IP Ass... [733, 743, 802, 815, 819, 820, 828]
> >
> > Thanks, Sam
> >
> > [1]
> >
> https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-job-reports/87/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-neon.console.txt.gz
>
|
|
On Mon, Jan 7, 2019 at 10:35 PM Jamo Luhrsen <jluhrsen@redhat.com> wrote:
Aswin,
I saw this is happening in the apex/gate job too:
https://jenkins.opendaylight.org/releng/job/netvirt-csit-1node-0cmb-1ctl-2cmp-apex-queens-gate-snat-conntrack-neon/159/robot/openstack/02_external_network/Initial%20Ping%20To%20External%20Network%20PNF%20from%20Vm%20Instance%201/
The gate job issue was for the run I started. I have fixed it as part of this patch. This should remove some of the exceptions too.
[1]https://git.opendaylight.org/gerrit/#/c/73180/8/vpnmanager/impl/src/main/java/org/opendaylight/netvirt/vpnmanager/iplearn/AbstractIpLearnNotificationHandler.java
in that apex job, the PNF is on the control node which is also the odl node.
in the non-apex job, the PNF is on the control node, but odl is on it's own
node.
also, note that the apex job is running with conntrack.
should we just get a JIRA filed and start asking for others to help
too? I think it might be one of our final bugs to figure out before
making apex jobs fully gating.
Did the apex job failed recently? Agree it could be something else too we can file a Jira for this.
JamO
On 1/7/19 8:41 AM, Aswin Suryanarayanan wrote:
> When I look at [1] I see that the ARP to the PNF(10.10.10.253) is never answered. Can we suspect anything wrong with the
> PNF? Is it another virtual node?
>
> Surprisingly conntrack job[2] seems to be not hitting this for the time being. 
>
> [1]https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-neon/830/odl_1/odl1_karaf.log.gz
> [2]https://jenkins.opendaylight.org/releng/job/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-gate-stateful-snat-conntrack-fluorine/
>
> On Mon, Jan 7, 2019 at 9:48 PM Jamo Luhrsen <jluhrsen@redhat.com <jluhrsen@redhat.com >> wrote:
>
> yep, I noticed this too before the break. this is after Sam's fix to make our
> ping more redundant to get around the issue you saw before with a slight delay
> for flows to get programmed.
>
> JamO
>
> On 1/7/19 4:12 AM, Sam Hague wrote:
> > Aswin,
> >
> > looks like the PNF is randomly failing again. [1] is the report showing the below test failing in the job numbers
> > listed. Any ideas?
> >
> > Ping External Network PNF from Vm Instance 1 After Floating IP Ass... [733, 743, 802, 815, 819, 820, 828]
> >
> > Thanks, Sam
> >
> > [1]
> >
> https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-job-reports/87/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-neon.console.txt.gz
>
|
|
btw, aswins the apex job ODL karaf log is here
|
|
I could see the first ARP send is not replied and we are recieving it back via the external interface and we are ignoring it since we are the sender. So the flows should be fine.
2019-01-07T19:31:49,627 | INFO | pool-20-thread-1 | SubnetRoutePacketInHandler | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | SUBNETROUTE: onPacketReceived: Processing IP Packet received with Source IP 10.10.10.26 and Target IP 10.10.10.253 and vpnId 100001
2019-01-07T19:31:49,631 | INFO | pool-20-thread-1 | ArpUtils | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | SubnetRoutePacketInHandler: sendArpRequest dpnId 238484387869016, actions [], groupId 210012, senderIPAddress 10.10.10.26, targetIPAddress 10.10.10.253
2019-01-07T19:31:49,635 | ERROR | pool-20-thread-1 | SubnetRoutePacketInHandler | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | Vpn interface tun9c586aac495 doesn't exist.
2019-01-07T19:31:49,640 | INFO | pool-20-thread-1 | ArpNotificationHandler | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | ArpNotification Non-Gratuitous Request Received from interface 75753836899984:br-datacentre-:trunk and IP 10.10.10.26 having MAC FA:16:3E:25:14:FE target destination 10.10.10.253, ignoring..
2019-01-07T19:31:49,640 | INFO | pool-20-thread-1 | ArpNotificationHandler | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | ArpNotification Non-Gratuitous Request Received from interface 260122599241271:br-datacentre-:trunk and IP 10.10.10.26 having MAC FA:16:3E:25:14:FE target destination 10.10.10.253, ignoring..
When the same ARP is send again it works. Though the sender and other parameters remains the same.
2019-01-07T19:32:17,325 | INFO | pool-20-thread-1 | SubnetRoutePacketInHandler | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | SUBNETROUTE: onPacketReceived: Processing IP Packet received with Source IP 10.10.10.26 and Target IP 10.10.10.253 and vpnId 100001
2019-01-07T19:32:17,326 | INFO | pool-20-thread-1 | ArpUtils | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | SubnetRoutePacketInHandler: sendArpRequest dpnId 238484387869016, actions [], groupId 210012, senderIPAddress 10.10.10.26, targetIPAddress 10.10.10.253
2019-01-07T19:32:17,327 | ERROR | pool-20-thread-1 | SubnetRoutePacketInHandler | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | Vpn interface tun39aee49f130 doesn't exist.
2019-01-07T19:32:17,332 | INFO | pool-20-thread-1 | ArpNotificationHandler | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | ArpNotification Response Received from interface 238484387869016:br-datacentre-:trunk and IP 10.10.10.253 having MAC 1A:F5:1B:58:E8:89, learning MAC
2019-01-07T19:32:17,333 | INFO | pool-20-thread-1 | AbstractIpLearnNotificationHandler | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | Received ARP/NA for sender MAC 1A:F5:1B:58:E8:89 and sender IP 10.10.10.253 via interface 238484387869016:br-datacentre-:trun
Though we have many pings only one reached the controller. Rest all are dropped. This is due to the rule below which creates a drop rule with a 10s time out along with sending the packet to the controller .This is to make the controller less loaded and give it some time to learn the Mac via ARP. Once the ARP is resolved a new flow will be installed in table 21 and ping does not reach table 22. So all the ping retries are over before the 10s time and the controller tries only one ARP for the test case. Can we have some bigger interval between the pings so that controller gets another chance to ARP?
table=22, n_packets=1, n_bytes=98, priority=0,ip actions=CONTROLLER:65535,learn(table=22,hard_timeout=10,priority=10,cookie=0x8000010,eth_type=0x800,NXM_OF_IP_DST[],OXM_OF_METADATA[1..23])
|
|
Yes, we can easily bump the retries, from 8 to 16 for example. The current call has retries set to 8 and each try is 1s. That was a recent change so that makes sense this started around same time.
Question is why does the different path happen? And why randomly? Wonder if we need to configure the deployment of the pnf differently.
|
|
Ya yet figure out why this is random. But since we are seeing the ARP we send, we are sure ODL is sending it out. So the only thing to check is if ODL ignored the response for some reason or the PNF didn't send one. Do we get some packet count for the PNF?
2019-01-07T19:31:49,640 | INFO | pool-20-thread-1 | ArpNotificationHandler | 363 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | ArpNotification Non-Gratuitous Request Received from interface 260122599241271:br-datacentre-:trunk and IP 10.10.10.26 having MAC FA:16:3E:25:14:FE target destination 10.10.10.253, ignoring..
|
|
This is not reproducing any more. In more than 70 tries now it did not come back. This is regarding the
apex job failure. I'll let my looping sandbox job keep running, but at this point I don't know how much
time we want to sink in to it any more.
We have the other flavor of this bug (in the devstack, non-conntrack job) that we may wan to worry
about at some point, but even that hasn't happened in over a week.
|
|
The devstack flavor have the same root cause. In non-conntrack job there is only one test case for PNF rest all are skipped. So we never see the route learned as there is not second test case.
2019-01-07T11:06:43,240 | INFO | pool-14-thread-1 | SubnetRoutePacketInHandler | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | SUBNETROUTE: onPacketReceived: Processing IP Packet received with Source IP 10.10.10.13 and Target IP 10.10.10.253 and vpnId 100024
2019-01-07T11:06:43,249 | INFO | pool-14-thread-1 | ArpUtils | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | SubnetRoutePacketInHandler: sendArpRequest dpnId 70852734155410, actions [], groupId 210022, senderIPAddress 10.10.10.5, targetIPAddress 10.10.10.253
2019-01-07T11:06:43,264 | INFO | pool-14-thread-1 | ArpNotificationHandler | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | ArpNotification Non-Gratuitous Request Received from interface 273634256852812:br-physnet1-pa:trunk and IP 10.10.10.5 having MAC FA:16:3E:60:70:D0 target destination 10.10.10.253, ignoring..
2019-01-07T11:06:43,266 | INFO | pool-14-thread-1 | ArpNotificationHandler | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.8.0.SNAPSHOT | ArpNotification Non-Gratuitous Request Received from interface 206713479921264:br-physnet1-pa:trunk and IP 10.10.10.5 having MAC FA:16:3E:60:70:D0 target destination 10.10.10.253, ignoring..
|
|
@Jamo,
I could see the External Network PNF cases are stable in magnesium.
https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-1node-1cmb-0ctl-0cmp-openstack-rocky-upstream-stateful-snat-conntrack-magnesium/69/robot/openstack/03_external_network/
Do we still see this as an open issue ??
|
|
xcheara,
it's been a really long time since I paid attention to netvirt failures, but I found this job which seems to still
have it.
a couple more:
https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-magnesium/135/robot/
https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-1node-1cmb-0ctl-0cmp-openstack-queens-upstream-stateful-magnesium/70/robot/openstack/03_external_network/
|
Generated at Wed Feb 07 20:24:20 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.