[NETVIRT-1144] PNF failure: Ping External Network PNF from Vm Instance 1 Created: 09/Mar/18 Updated: 03/Dec/18 Resolved: 03/Dec/18 |
|
| Status: | Resolved |
| Project: | netvirt |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | Oxygen, Fluorine |
| Type: | Bug | Priority: | High |
| Reporter: | Sam Hague | Assignee: | Kiran N Upadhyaya |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | csit:exception, patch_merged | ||
| Remaining Estimate: | 0 minutes | ||
| Time Spent: | 6 hours | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
|
| Comments |
| Comment by Chetan Arakere Gowdru [ 09/Mar/18 ] |
|
Hi Kiran,
I raised a review(https://git.opendaylight.org/gerrit/#/c/69263/). Now this Exception is masked and table=21->22(10.10.10.0/24) been programmed on all computes, but the PNF use-case is still failing and this looks to be different issue.
On analysis on dump, I see the FIB rule(for destination 10.10.10.253) is not getting hit even programmed, but the subsequent packets were also been getting punted to CSC
Before PNF Test-Case Execution
cookie=0x8000003, duration=103.572s, table=21, n_packets=0, n_bytes=0, priority=34,ip,metadata=0x30d76/0xfffffe,nw_dst=10.10.10.0/24 actions=write_metadata:0x1393030d76/0xfffffffffe,goto_table:22 cookie=0x8000004, duration=76.026s, table=25, n_packets=6, n_bytes=588, priority=10,ip,dl_dst=fa:16:3e:79:e3:6a,nw_dst=10.10.10.10 actions=set_field:41.0.0.7->ip_dst,write_metadata:0x30d7c/0xfffffe,goto_table:27 cookie=0x8000004, duration=76.026s, table=26*, n_packets=6*, n_bytes=588, priority=10,ip,metadata=0x30d7c/0xfffffe,nw_src=41.0.0.7 actions=set_field:10.10.10.10->ip_src,write_metadata:0x30d76/0xfffffe,goto_table:28 cookie=0x8000004, duration=76.026s, table=27, n_packets=6, n_bytes=588, priority=10,ip,metadata=0x30d7c/0xfffffe,nw_dst=41.0.0.7 actions=resubmit(,21) cookie=0x8000004, duration=76.026s, table=28, n_packets=6, n_bytes=588, priority=10,ip,metadata=0x30d76/0xfffffe,nw_src=10.10.10.10 actions=set_field:fa:16:3e:79:e3:6a->eth_src,resubmit(,21)
After PNF Test-Case Execution (3 packet sent to 10.10.10.253 via ping)
cookie=0x8000003, duration=12.320s, table=21, n_packets=0, n_bytes=0, priority=42,ip,metadata=0x30d76/0xfffffe,nw_dst=10.10.10.253 actions=set_field:f6:b3:8f:89:e1:2c->eth_dst,load:0x1d00->NXM_NX_REG6[],resubmit(,220) (Flow to reach destination been programmed, but not hit) cookie=0x8000003, duration=128.005s, table=21, n_packets=3, n_bytes=294, priority=34,ip,metadata=0x30d76/0xfffffe,nw_dst=10.10.10.0/24 actions=write_metadata:0x1393030d76/0xfffffffffe,goto_table:22 (all 3 packets been punted to CSC) cookie=0x8000004, duration=2450.898s, table=22, n_packets=3, n_bytes=294, priority=0 actions=CONTROLLER:65535 cookie=0x8000004, duration=100.459s, table=25, n_packets=6, n_bytes=588, priority=10,ip,dl_dst=fa:16:3e:79:e3:6a,nw_dst=10.10.10.10 actions=set_field:41.0.0.7->ip_dst,write_metadata:0x30d7c/0xfffffe,goto_table:27 cookie=0x8000004, duration=100.459s, table=26, n_packets=9, n_bytes=882, priority=10,ip,metadata=0x30d7c/0xfffffe,nw_src=41.0.0.7 actions=set_field:10.10.10.10->ip_src,write_metadata:0x30d76/0xfffffe,goto_table:28 cookie=0x8000004, duration=100.459s, table=27, n_packets=6, n_bytes=588, priority=10,ip,metadata=0x30d7c/0xfffffe,nw_dst=41.0.0.7 actions=resubmit(,21) cookie=0x8000004, duration=100.459s, table=28, n_packets=9, n_bytes=882, priority=10,ip,metadata=0x30d76/0xfffffe,nw_src=10.10.10.10 actions=set_field:fa:16:3e:79:e3:6a->eth_src,resubmit(,21)
When compared with pass case, only first packet punted to CSC and subsequent packet been hit the actual FIB rule(looks like the first packet is lost here too)
cookie=0x8000003, duration=112.184s, table=21, n_packets=1, n_bytes=98, priority=34,ip,metadata=0x30d76/0xfffffe,nw_dst=10.10.10.0/24 actions=write_metadata:0x1393030d76/0xfffffffffe,goto_table:22 cookie=0x8000003, duration=5.254s, table=21, n_packets=2, n_bytes=196, priority=42,ip,metadata=0x30d76/0xfffffe,nw_dst=10.10.10.253 actions=set_field:7a:1e:56:99:cb:cc->eth_dst,load:0x1e00->NXM_NX_REG6[],resubmit(,220)
Request to let me know your inputs on this.
Thanks in Advance, Chetan |
| Comment by Chetan Arakere Gowdru [ 09/Mar/18 ] |
|
Hi Sam,
I don’t see the current PNF issue is related to any of these bug list(but might be due to delayed installation of FIB flow for dest-10.10.10.253)
@Kiran,
From the karaf logs, I see by the time FIB flow is processed, all the 3 ICMP packet been punted to CSC.
3 ping packets punted to CSC 2018-03-08T15:49:30,875 | INFO | pool-50-thread-1 | SubnetRoutePacketInHandler | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | SUBNETROUTE: onPacketReceived: Processing IPv4 Packet received with Source IP 10.10.10.10 and Target IP 10.10.10.253 and vpnId 100027 2018-03-08T15:49:31,877 | INFO | pool-50-thread-1 | SubnetRoutePacketInHandler | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | SUBNETROUTE: onPacketReceived: Processing IPv4 Packet received with Source IP 10.10.10.10 and Target IP 10.10.10.253 and vpnId 100027 2018-03-08T15:49:32,878 | INFO | pool-50-thread-1 | SubnetRoutePacketInHandler | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | SUBNETROUTE: onPacketReceived: Processing IPv4 Packet received with Source IP 10.10.10.10 and Target IP 10.10.10.253 and vpnId 100027
ARP response obtained immediated in 30 sec 2018-03-08T15:49:30,884 | INFO | pool-50-thread-1 | ArpNotificationHandler | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | Received ARP for sender MAC F6:B3:8F:89:E1:2C and sender IP 10.10.10.253 via interface 229136718462639:br-physnet1-pa:trunk 2018-03-08T15:49:30,884 | INFO | pool-50-thread-1 | ArpNotificationHandler | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | ARP being processed for Source IP 10.10.10.253
Creation of FIB entry for this IP(10.10.10.253) is been done after in 32 sec 2018-03-08T15:49:32,898 | INFO | ForkJoinPool-1-worker-0 | FibUtil | 350 - org.opendaylight.netvirt.fibmanager-impl - 0.6.0.SNAPSHOT | addOrUpdateFibEntry: Created/Updated vrfEntry for rd d1075ac5-6277-4e63-bc18-13330d751f73 prefix 10.10.10.253/32 nexthop [] label 0 l3vni 0 origin LOCAL encapType null 2018-03-08T15:49:32,899 | INFO | ForkJoinPool-1-worker-0 | VpnInterfaceManager | 367 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.SNAPSHOT | update: new Adjacency 10.10.10.253/32 with nextHop null label null subnet Uuid [_value=d1075ac5-6277-4e63-bc18-13330d751f73] added to vpn interface 229136718462639:br-physnet1-pa:trunk on vpn dd4397b6-25c6-4532-84f6-cabc71760846 dpnId 229136718462639
So, I suspect due to delayed installation of FIB flows for PNF, all the packets(3) been punted to CSC and been lost.
Thanks, Chetan |
| Comment by Chetan Arakere Gowdru [ 09/Mar/18 ] |
|
The fix https://git.opendaylight.org/gerrit/#/c/69263/ is handling the OptimisticLockFailedException. Post this fix, two CSIT jobs been ranned and this exception not observed. But, the PNF UC is still failed during these run due to delayed installation of actual FIB flow for PNF(10.10.10.253) as commented above. This needs to be investigated.
|
| Comment by Daniel Farrell [ 09/Mar/18 ] |
|
Thanks for the patch xcheara. Are you driving progress on resolving the remaining failure? |
| Comment by Chetan Arakere Gowdru [ 09/Mar/18 ] |
kiran.n.upadhyaya@ericsson.com
|
| Comment by Kit Lou [ 11/Mar/18 ] |
|
https://git.opendaylight.org/gerrit/#/c/69263/ ready for merge |
| Comment by Daniel Farrell [ 12/Mar/18 ] |
|
Patch is merged, thanks shague. jluhrsen et al are worried that something may still be going wrong under the hood, may a perf regression (used to take 1 second, now takes 3 or more). Three or four times worse perf could be a big deal if it's code that might be very active in some use case and we're just not hitting that code much in our CSIT. ecelgp - To check on perf regressions in flow programming. In quick check he doesn't see regression from Nitrogen to Oxygen for flow programming time. shague - To work with kiranvasudeva to try to clarify exactly what logic path is taking so long, seems to be narrowly scoped problem. Discussion about if this should be a blocker is ongoing. Will do more digging and see how it looks when we know more. |
| Comment by Kiran Upadhyaya [ 13/Mar/18 ] |
|
FIxed the two second delay in PNF route learning here: https://git.opendaylight.org/gerrit/#/c/69429/
|
| Comment by Daniel Farrell [ 13/Mar/18 ] |
|
eupakir - Awesome, thanks for tracking down the root cause. shague - Please review 69429. |
| Comment by Jamo Luhrsen [ 19/Oct/18 ] |
|
this is still happening. |
| Comment by Shaik Zakir Basha [ 22/Oct/18 ] |
|
Hi Kiran/all, From the logs, we notice that the remote FIB flow for ip 10.10.10.253 is not getting installed in failure case though the packet is punted to controller, ping has failed due to this Please find below the flows in both success and failure cases Success case: cookie=0x8000003, duration=96.736s, table=21, n_packets=4, n_bytes=392, priority=34,ip,metadata=0x30d70/0xfffffe,nw_dst=10.10.10.0/24 actions=write_metadata:0x1393030d70/0xfffffffffe,goto_table:22 cookie=0x8000010, duration=2781.341s, table=22, n_packets=2, n_bytes=196, priority=0,ip actions=CONTROLLER:65535,learn(table=22,hard_timeout=10,priority=10,cookie=0x8000010,eth_type=0x800,NXM_OF_IP_DST[],OXM_OF_METADATA[1..23]) cookie=0x8000003, duration=11.138s, table=21, n_packets=2, n_bytes=196, priority=42,ip,metadata=0x30d70/0xfffffe,nw_dst=10.10.10.253 actions=set_field:2a:44:f6:95:27:78->eth_dst,load:0x1000->NXM_NX_REG6[],resubmit(,220) Failure case: cookie=0x8000003, duration=74.976s, table=21, n_packets=3, n_bytes=294, priority=34,ip,metadata=0x30d70/0xfffffe,nw_dst=10.10.10.0/24 actions=write_metadata:0x1393030d70/0xfffffffffe,goto_table:22 cookie=0x8000010, duration=2759.581s, table=22, n_packets=1, n_bytes=98, priority=0,ip actions=CONTROLLER:65535,learn(table=22,hard_timeout=10,priority=10,cookie=0x8000010,eth_type=0x800,NXM_OF_IP_DST[],OXM_OF_METADATA[1..23]) Regards, |
| Comment by Chetan Arakere Gowdru [ 29/Oct/18 ] |
|
Reassigned Bug to Kiran for further debugging on Subnet Route Handling Feature. |
| Comment by Jamo Luhrsen [ 30/Nov/18 ] |
|
anything going on with this? we see it periodically still. Just saw it in our new apex gate job |
| Comment by Kiran Upadhyaya [ 30/Nov/18 ] |
|
Not yet jluhrsen. I haven't been able to further debug it. |