[NETVIRT-1144] PNF failure: Ping External Network PNF from Vm Instance 1 Created: 09/Mar/18  Updated: 03/Dec/18  Resolved: 03/Dec/18

Status: Resolved
Project: netvirt
Component/s: General
Affects Version/s: None
Fix Version/s: Oxygen, Fluorine

Type: Bug Priority: High
Reporter: Sam Hague Assignee: Kiran N Upadhyaya
Resolution: Duplicate Votes: 0
Labels: csit:exception, patch_merged
Remaining Estimate: 0 minutes
Time Spent: 6 hours
Original Estimate: Not Specified

Attachments: File odl1_karaf.pnf.log.tar.xz    
Issue Links:
Relates
relates to NETVIRT-1399 .Ping External Network PNF from Vm In... Resolved

 Description   

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-queens-upstream-stateful-oxygen/210/odl_1/odl1_karaf.log.gz



 Comments   
Comment by Chetan Arakere Gowdru [ 09/Mar/18 ]

Hi Kiran,

 

I raised a review(https://git.opendaylight.org/gerrit/#/c/69263/). Now this Exception is masked and table=21->22(10.10.10.0/24) been programmed on all computes, but the PNF use-case is still failing and this looks to be different issue.

 

On analysis on dump, I see the FIB rule(for destination 10.10.10.253) is not getting hit even programmed, but the subsequent packets were also been getting punted to CSC

 

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-queens-gate-stateful-oxygen/54/robot-plugin/log_03_external_network.html.gz

 

Before PNF Test-Case Execution

 

cookie=0x8000003, duration=103.572s, table=21, n_packets=0, n_bytes=0, priority=34,ip,metadata=0x30d76/0xfffffe,nw_dst=10.10.10.0/24 actions=write_metadata:0x1393030d76/0xfffffffffe,goto_table:22

cookie=0x8000004, duration=76.026s, table=25, n_packets=6, n_bytes=588, priority=10,ip,dl_dst=fa:16:3e:79:e3:6a,nw_dst=10.10.10.10 actions=set_field:41.0.0.7->ip_dst,write_metadata:0x30d7c/0xfffffe,goto_table:27

cookie=0x8000004, duration=76.026s, table=26*, n_packets=6*, n_bytes=588, priority=10,ip,metadata=0x30d7c/0xfffffe,nw_src=41.0.0.7 actions=set_field:10.10.10.10->ip_src,write_metadata:0x30d76/0xfffffe,goto_table:28

cookie=0x8000004, duration=76.026s, table=27, n_packets=6, n_bytes=588, priority=10,ip,metadata=0x30d7c/0xfffffe,nw_dst=41.0.0.7 actions=resubmit(,21)

cookie=0x8000004, duration=76.026s, table=28, n_packets=6, n_bytes=588, priority=10,ip,metadata=0x30d76/0xfffffe,nw_src=10.10.10.10 actions=set_field:fa:16:3e:79:e3:6a->eth_src,resubmit(,21)

 

After PNF Test-Case Execution (3 packet sent to 10.10.10.253 via ping)

 

cookie=0x8000003, duration=12.320s, table=21, n_packets=0, n_bytes=0, priority=42,ip,metadata=0x30d76/0xfffffe,nw_dst=10.10.10.253 actions=set_field:f6:b3:8f:89:e1:2c->eth_dst,load:0x1d00->NXM_NX_REG6[],resubmit(,220) (Flow to reach destination been programmed, but not hit)

cookie=0x8000003, duration=128.005s, table=21, n_packets=3, n_bytes=294, priority=34,ip,metadata=0x30d76/0xfffffe,nw_dst=10.10.10.0/24 actions=write_metadata:0x1393030d76/0xfffffffffe,goto_table:22 (all 3 packets been punted to CSC)

cookie=0x8000004, duration=2450.898s, table=22, n_packets=3, n_bytes=294, priority=0 actions=CONTROLLER:65535

cookie=0x8000004, duration=100.459s, table=25, n_packets=6, n_bytes=588, priority=10,ip,dl_dst=fa:16:3e:79:e3:6a,nw_dst=10.10.10.10 actions=set_field:41.0.0.7->ip_dst,write_metadata:0x30d7c/0xfffffe,goto_table:27

cookie=0x8000004, duration=100.459s, table=26, n_packets=9, n_bytes=882, priority=10,ip,metadata=0x30d7c/0xfffffe,nw_src=41.0.0.7 actions=set_field:10.10.10.10->ip_src,write_metadata:0x30d76/0xfffffe,goto_table:28

 cookie=0x8000004, duration=100.459s, table=27, n_packets=6, n_bytes=588, priority=10,ip,metadata=0x30d7c/0xfffffe,nw_dst=41.0.0.7 actions=resubmit(,21)

cookie=0x8000004, duration=100.459s, table=28, n_packets=9, n_bytes=882, priority=10,ip,metadata=0x30d76/0xfffffe,nw_src=10.10.10.10 actions=set_field:fa:16:3e:79:e3:6a->eth_src,resubmit(,21)

 

When compared with pass case, only first packet punted to CSC and subsequent packet been hit the actual FIB rule(looks like the first packet is lost here too)

[1]https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-queens-gate-stateful-oxygen/11/robot-plugin/log_03_external_network.html.gz#s1-t17-k4-k1-k2-k1-k12-k4

 

cookie=0x8000003, duration=112.184s, table=21, n_packets=1, n_bytes=98, priority=34,ip,metadata=0x30d76/0xfffffe,nw_dst=10.10.10.0/24 actions=write_metadata:0x1393030d76/0xfffffffffe,goto_table:22

cookie=0x8000003, duration=5.254s, table=21, n_packets=2, n_bytes=196, priority=42,ip,metadata=0x30d76/0xfffffe,nw_dst=10.10.10.253 actions=set_field:7a:1e:56:99:cb:cc->eth_dst,load:0x1e00->NXM_NX_REG6[],resubmit(,220)

 

Request to let me know your inputs on this.

 

Thanks in Advance,

Chetan

Comment by Chetan Arakere Gowdru [ 09/Mar/18 ]

Hi Sam,

 

I don’t see the current PNF issue is related to any of these bug list(but might be due to delayed installation of FIB flow for dest-10.10.10.253)

 

@Kiran,

 

From the karaf logs, I see by the time FIB flow is processed, all the 3 ICMP packet been punted to CSC.

 

3 ping packets punted to CSC

2018-03-08T15:49:30,875 | INFO  | pool-50-thread-1 | SubnetRoutePacketInHandler       | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | SUBNETROUTE: onPacketReceived: Processing IPv4 Packet received with Source IP 10.10.10.10 and Target IP 10.10.10.253 and vpnId 100027

2018-03-08T15:49:31,877 | INFO  | pool-50-thread-1 | SubnetRoutePacketInHandler       | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | SUBNETROUTE: onPacketReceived: Processing IPv4 Packet received with Source IP 10.10.10.10 and Target IP 10.10.10.253 and vpnId 100027

2018-03-08T15:49:32,878 | INFO  | pool-50-thread-1 | SubnetRoutePacketInHandler       | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | SUBNETROUTE: onPacketReceived: Processing IPv4 Packet received with Source IP 10.10.10.10 and Target IP 10.10.10.253 and vpnId 100027

 

ARP response obtained immediated in 30 sec

2018-03-08T15:49:30,884 | INFO  | pool-50-thread-1 | ArpNotificationHandler           | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | Received ARP for sender MAC F6:B3:8F:89:E1:2C and sender IP 10.10.10.253 via interface 229136718462639:br-physnet1-pa:trunk

2018-03-08T15:49:30,884 | INFO  | pool-50-thread-1 | ArpNotificationHandler           | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | ARP being processed for Source IP 10.10.10.253

 

Creation of FIB entry for this IP(10.10.10.253) is been done after in 32 sec

2018-03-08T15:49:32,898 | INFO  | ForkJoinPool-1-worker-0 | FibUtil                          | 350 - org.opendaylight.netvirt.fibmanager-impl - 0.6.0.SNAPSHOT | addOrUpdateFibEntry: Created/Updated vrfEntry for rd d1075ac5-6277-4e63-bc18-13330d751f73 prefix 10.10.10.253/32 nexthop [] label 0 l3vni 0 origin LOCAL encapType null

2018-03-08T15:49:32,899 | INFO  | ForkJoinPool-1-worker-0 | VpnInterfaceManager              | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | update: new Adjacency 10.10.10.253/32 with nextHop null label null subnet Uuid [_value=d1075ac5-6277-4e63-bc18-13330d751f73] added to vpn interface 229136718462639:br-physnet1-pa:trunk on vpn dd4397b6-25c6-4532-84f6-cabc71760846 dpnId 229136718462639

 

So, I suspect due to delayed installation of FIB flows for PNF, all the packets(3) been punted to CSC and been lost.

 

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-queens-gate-stateful-oxygen/54/robot-plugin/log_03_external_network.html.gz

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-openstack-queens-gate-stateful-oxygen/54/odl_1/odl1_karaf.log.gz

 

Thanks,

Chetan

Comment by Chetan Arakere Gowdru [ 09/Mar/18 ]

The fix https://git.opendaylight.org/gerrit/#/c/69263/ is handling the OptimisticLockFailedException.

Post this fix, two CSIT jobs been ranned and this exception not observed.

But, the PNF UC is still failed during these run due to delayed installation of actual FIB flow for PNF(10.10.10.253) as commented above.

This needs to be investigated.

 

Comment by Daniel Farrell [ 09/Mar/18 ]

Thanks for the patch xcheara. Are you driving progress on resolving the remaining failure?

Comment by Chetan Arakere Gowdru [ 09/Mar/18 ]
kiran.n.upadhyaya@ericsson.com is further investigating on the issue of delayed installation of FIB flow.
Comment by Kit Lou [ 11/Mar/18 ]

https://git.opendaylight.org/gerrit/#/c/69263/ ready for merge

Comment by Daniel Farrell [ 12/Mar/18 ]

Patch is merged, thanks shague.

jluhrsen et al are worried that something may still be going wrong under the hood, may a perf regression (used to take 1 second, now takes 3 or more). Three or four times worse perf could be a big deal if it's code that might be very active in some use case and we're just not hitting that code much in our CSIT.

ecelgp - To check on perf regressions in flow programming. In quick check he doesn't see regression from Nitrogen to Oxygen for flow programming time.

shague - To work with kiranvasudeva to try to clarify exactly what logic path is taking so long, seems to be narrowly scoped problem.

Discussion about if this should be a blocker is ongoing. Will do more digging and see how it looks when we know more.

Comment by Kiran Upadhyaya [ 13/Mar/18 ]

FIxed the two second delay in PNF route learning here:

https://git.opendaylight.org/gerrit/#/c/69429/

 

Comment by Daniel Farrell [ 13/Mar/18 ]

eupakir - Awesome, thanks for tracking down the root cause.

shague - Please review 69429.

Comment by Jamo Luhrsen [ 19/Oct/18 ]

this is still happening.

examples here

Comment by Shaik Zakir Basha [ 22/Oct/18 ]

Hi Kiran/all,

From the logs, we notice that the remote FIB flow for ip 10.10.10.253 is not getting installed in failure case though the packet is punted to controller, ping has failed due to this

Please find below the flows in both success and failure cases

Success case:

cookie=0x8000003, duration=96.736s, table=21, n_packets=4, n_bytes=392, priority=34,ip,metadata=0x30d70/0xfffffe,nw_dst=10.10.10.0/24 actions=write_metadata:0x1393030d70/0xfffffffffe,goto_table:22

cookie=0x8000010, duration=2781.341s, table=22, n_packets=2, n_bytes=196, priority=0,ip actions=CONTROLLER:65535,learn(table=22,hard_timeout=10,priority=10,cookie=0x8000010,eth_type=0x800,NXM_OF_IP_DST[],OXM_OF_METADATA[1..23])

cookie=0x8000003, duration=11.138s, table=21, n_packets=2, n_bytes=196, priority=42,ip,metadata=0x30d70/0xfffffe,nw_dst=10.10.10.253 actions=set_field:2a:44:f6:95:27:78->eth_dst,load:0x1000->NXM_NX_REG6[],resubmit(,220)

Failure case:

cookie=0x8000003, duration=74.976s, table=21, n_packets=3, n_bytes=294, priority=34,ip,metadata=0x30d70/0xfffffe,nw_dst=10.10.10.0/24 actions=write_metadata:0x1393030d70/0xfffffffffe,goto_table:22

cookie=0x8000010, duration=2759.581s, table=22, n_packets=1, n_bytes=98, priority=0,ip actions=CONTROLLER:65535,learn(table=22,hard_timeout=10,priority=10,cookie=0x8000010,eth_type=0x800,NXM_OF_IP_DST[],OXM_OF_METADATA[1..23])

Regards,
Zakir

Comment by Chetan Arakere Gowdru [ 29/Oct/18 ]

Reassigned Bug to Kiran for further debugging on Subnet Route Handling Feature.

Comment by Jamo Luhrsen [ 30/Nov/18 ]

anything going on with this? we see it periodically still. Just saw it in our new apex gate job
which we want to make 100% stable so we can turn that in to a voting gate.

Comment by Kiran Upadhyaya [ 30/Nov/18 ]

Not yet jluhrsen. I haven't been able to further debug it.

Generated at Wed Feb 07 20:23:20 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.