[NETVIRT-515] CSIT Sporadic failures - tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_reboot Created: 07/Mar/17  Updated: 15/Dec/17  Resolved: 27/Sep/17

Status: Resolved
Project: netvirt
Component/s: General
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Jamo Luhrsen Assignee: Chetan Arakere Gowdru
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File test_server_connectivity_reboot.txt    
External issue ID: 7919

 Description   

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-upstream-learn-boron/163/archives/log.html.gz#s1-s2-s1-t12



 Comments   
Comment by Jamo Luhrsen [ 08/Mar/17 ]

(In reply to Jamo Luhrsen from comment #0)
> https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-
> newton-upstream-learn-boron/163/archives/log.html.gz#s1-s2-s1-t12

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-mitaka-upstream-learn-boron/224/archives/log.html.gz#s1-s2-s1-t12

Comment by Chetan Arakere Gowdru [ 04/Apr/17 ]

Attachment test_server_connectivity_reboot.txt has been added with description: logs specific to TC - test_server_connectivity_reboot

Comment by Chetan Arakere Gowdru [ 04/Apr/17 ]

Hi Koby/Vivek,

As pointed out earlier, it looks like as part of FIP VM response is getting hit by (26->46) as a result of which the packet where punted to controller and lost.(I can see packets continuously punted to controller for the full 2 min duration of retry)

Ex:(flows from my local system):

1) cookie=0x8000004, duration=6693.738s, table=26, n_packets=0, n_bytes=0, priority=10,ip,metadata=0x30d42/0xfffffe,nw_src=10.0.0.5 actions=set_field:172.160.0.100->ip_src,write_metadata:0x30d48/0xfffffe,goto_table:28
2) cookie=0x8000006, duration=6694.508s, table=26, n_packets=0, n_bytes=0, priority=5,ip,metadata=0x30d42/0xfffffe actions=goto_table:46
3) cookie=0x81286aa, duration=6971.619s, table=46, n_packets=0, n_bytes=0, priority=5,ip,metadata=0x30d42/0xfffffe actions=CONTROLLER:65535,write_metadata:0x30d42/0xfffffe

If such packet where required to be punted controller, then the possibility is missing rule-1(internal-ip to external-ip DNAT translation rule)

From the NAT logs, I can clearly see request to program rule-1 is been sent, but I do observe openflow error during is TC run(I’m not sure if these are related to rule-1 programming)

NAT logs
Line 340: 2017-03-08 15:22:30,302 | INFO | pool-47-thread-2 | FloatingIPListener | 326 - org.opendaylight.netvirt.natservice-impl - 0.3.3.SNAPSHOT | NAT Service : Bulding DNAT Flow entity for ip 10.10.10.49
Line 341: 2017-03-08 15:22:30,302 | INFO | pool-47-thread-2 | FloatingIPListener | 326 - org.opendaylight.netvirt.natservice-impl - 0.3.3.SNAPSHOT | NAT Service : Bulding DNAT Flow entity for ip 10.10.10.49
Line 342: 2017-03-08 15:22:30,303 | INFO | pool-47-thread-2 | FloatingIPListener | 326 - org.opendaylight.netvirt.natservice-impl - 0.3.3.SNAPSHOT | NAT Service : Building PSNAT Flow entity for ip 10.100.0.3
Line 343: 2017-03-08 15:22:30,303 | INFO | pool-47-thread-2 | FloatingIPListener | 326 - org.opendaylight.netvirt.natservice-impl - 0.3.3.SNAPSHOT | Building SNAT Flow entity for ip 10.100.0.3
Line 344: 2017-03-08 15:22:30,304 | INFO | pool-47-thread-2 | FloatingIPListener | 326 - org.opendaylight.netvirt.natservice-impl - 0.3.3.SNAPSHOT | NAT Service : Updating operational DS for floating ip config : 10.100.0.3 with label 70059
Line 345: 2017-03-08 15:22:30,305 | INFO | pool-47-thread-2 | NatUtil | 326 - org.opendaylight.netvirt.natservice-impl - 0.3.3.SNAPSHOT | NAT Service : ADD: Adding Fib entry rd 94f9c934-83ab-4193-bc92-baac147ce211 prefix 10.10.10.49/32 nextHop 10.29.12.117 label 70059
Line 346: 2017-03-08 15:22:30,313 | INFO | pool-47-thread-2 | NatUtil | 326 - org.opendaylight.netvirt.natservice-impl - 0.3.3.SNAPSHOT | NAT Service : ADD: Added Fib entry rd 94f9c934-83ab-4193-bc92-baac147ce211 prefix 10.10.10.49/32 nextHop 10.29.12.117 label 70059
Line 347: 2017-03-08 15:22:30,313 | INFO | pool-47-thread-2 | VpnFloatingIpHandler | 326 - org.opendaylight.netvirt.natservice-impl - 0.3.3.SNAPSHOT | create terminatingServiceAction on DpnId = 229155452844485 and serviceId = 70059 and actions = {}
Line 352: 2017-03-08 15:22:30,317 | INFO | pool-47-thread-2 | VpnFloatingIpHandler | 326 - org.opendaylight.netvirt.natservice-impl - 0.3.3.SNAPSHOT | Successfully installed custom FIB routes for prefix 10.10.10.49

Openflow Error
2017-03-08 15:22:30,797 | WARN | entLoopGroup-9-2 | OFDecoder | 263 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.8.3.SNAPSHOT | Message deserialization failed
java.lang.IllegalStateException: Deserializer for key: msgVersion: 4 objectClass: org.opendaylight.yang.gen.v1.urn.opendaylight.openflow.oxm.rev150225.match.entries.grouping.MatchEntry msgType: 15 oxm_field: 127 experimenterID: null was not found - please verify that all needed deserializers ale loaded correctly
at org.opendaylight.openflowjava.protocol.impl.deserialization.DeserializerRegistryImpl.getDeserializer(DeserializerRegistryImpl.java:69)[263:org.opendaylight.openflowjava.openflow-protocol-impl:0.8.3.SNAPSHOT]
at

Unfortunately in these temptest TC, Teardown logs were taken post clean-up, there are not much useful to debug further on this issue.

With the current inputs, I’m not able to further triage on this issue. Any inputs apricated.

PFA(Logs specific this TC run)

Thanks,
Chetan

Comment by Chetan Arakere Gowdru [ 11/Apr/17 ]

Hi Chetan,

I know you're still into debugging tempest tests failures. It seems that most issues around tempest tests are resolved, and I have a feeling the rest of the bug fixes has contributed to that (fixes for table=21 rules, DHCP, etc).

However, I can still see "total meldowns" here and there (Jamo has reported a few), like this report:
https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-learn-carbon/209/archives/log.html

Just wanted to update you on my analysis so far just in case you'll run into debugging those (we're entering a holiday this week, so won't be able to continue).

The tests are basically failing on the inability to connect to VM FIP.
During the times the test tries to connect to the VM, I’m constantly seeing the ARP counters in the ODL increasing:
2017-04-09 17:36:40,781 | INFO | CountersThread | CountersDumperThread | 294 - org.opendaylight.infrautils.counters-impl - 1.1.0.SNAPSHOT | auc.arp_req_rcv_notification: +4, auc.arp_req_rcv: +4
2017-04-09 17:36:41,781 | INFO | CountersThread | CountersDumperThread | 294 - org.opendaylight.infrautils.counters-impl - 1.1.0.SNAPSHOT | auc.arp_req_rcv_notification: +3, auc.arp_req_rcv: +3
2017-04-09 17:36:42,781 | INFO | CountersThread | CountersDumperThread | 294 - org.opendaylight.infrautils.counters-impl - 1.1.0.SNAPSHOT | auc.arp_req_rcv_notification: +3, auc.arp_req_rcv: +3

My feeling is that one of relevant ARP responder rules is missing, and therefore when the connection to the VM is attempted ARP requests are constantly sent and punted to the controller.
It's impossible to tell from the counters whether those ARP requests are directed to the FIP address OR to the VM router interface, but my guess is those are directed to the VM router interface (because there are prints on the FIP ARP responder rules being installed).

I guess adding more logs around the installation of ARP responder rules for router interfaces might help here, or maybe even adding temporary INFO prints for ARP requests punted to the controller until we resolve those tempest issues.

Figured out this might save you some time in case you're looking into those failures this week.

Have a nice week,
Koby

Comment by Chetan Arakere Gowdru [ 11/Apr/17 ]

Hi Koby/All,

In order further investigate on this issue, I have added info logs as suggested.

1) Added info logs for Arp Responder Install flow(to print flow-name which will have the FIP/VM router interface ip ex: "Arp:tbl_81:lport_2:gw_192.168.225.5")
2) Change log from trace to info when ARP requests are punted to controller.

I hope these info logs are good enough to identify if there are ARP responder rules got missed resulting in this failure.

https://git.opendaylight.org/gerrit/#/c/54593/

Thanks,
Chetan

Comment by Jamo Luhrsen [ 25/Apr/17 ]

tempest failures appear to mostly have been resolved. We can re-open any
of the tempest bugs if/when they appear again in CSIT.

Comment by Jamo Luhrsen [ 05/May/17 ]

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-transparent-boron/526/archives/log.html.gz#s1-s2-s1

Comment by Jamo Luhrsen [ 03/Aug/17 ]

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-newton-nodl-v2-upstream-transparent-carbon/595/log.html.gz#s1-s2-s1-t12

Comment by Jamo Luhrsen [ 08/Aug/17 ]

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-1node-openstack-ocata-upstream-learn-carbon/89/log.html.gz#s1-s2-s1-t12

Comment by Jamo Luhrsen [ 27/Sep/17 ]

no longer seen in CSIT

Generated at Wed Feb 07 20:21:45 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.