[OPNFLWPLUG-790] failed to update port status - all new OVS ports fail from this point Created: 09/Oct/16  Updated: 27/Sep/21  Resolved: 19/Jun/17

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Arnon Yahel Assignee: Arnon Yahel
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File karaf.logs.tar     Zip Archive karaf.zip     Zip Archive karaf.zip     Zip Archive qsvc.zip    
Issue Links:
Duplicate
is duplicated by NETVIRT-260 Attaching new interface to VM does no... Resolved
is duplicated by UNIMGR-27 sometimes the CPE port remains in dow... Resolved
External issue ID: 6908

 Description   

openflowplugin fails to update port status on creation of VM using netvirt.
it always happens on the same compute host. VMs created on other computes work fine.

this is the relevant lines in the jaraf.log
2016-10-09 07:07:17,263 | WARN | ntLoopGroup-9-15 | DeviceContextImpl | 265 - org.opendaylight.openflowplugin.impl - 0.3.0.SNAPSHOT | Error processing port status message for port 13 on device Uri [_value=openflow:79818282140722] : java.lang.IllegalStateException: Previous transaction member-1-datastore-operational-fe-0-chn-591-txn-6 is not ready yet
2016-10-09 07:07:17,364 | WARN | Thread-16795 | DeviceContextImpl | 265 - org.opendaylight.openflowplugin.impl - 0.3.0.SNAPSHOT | Error processing port status message for 0 on port 13 on device Uri [_value=openflow:79818282140722] : java.lang.IllegalStateException: Previous transaction member-1-datastore-operational-fe-0-chn-591-txn-6 is not ready yet
2016-10-09 07:07:17,475 | WARN | Thread-16795 | DeviceContextImpl | 265 - org.opendaylight.openflowplugin.impl - 0.3.0.SNAPSHOT | Error processing port status message for 1 on port 13 on device Uri [_value=openflow:79818282140722] : java.lang.IllegalStateException: Previous transaction member-1-datastore-operational-fe-0-chn-591-txn-6 is not ready yet
2016-10-09 07:07:17,576 | WARN | Thread-16795 | DeviceContextImpl | 265 - org.opendaylight.openflowplugin.impl - 0.3.0.SNAPSHOT | Error processing port status message for 2 on port 13 on device Uri [_value=openflow:79818282140722] : java.lang.IllegalStateException: Previous transaction member-1-datastore-operational-fe-0-chn-591-txn-6 is not ready yet
2016-10-09 07:07:17,576 | WARN | Thread-16795 | DeviceContextImpl | 265 - org.opendaylight.openflowplugin.impl - 0.3.0.SNAPSHOT | Failed to update port status for port 13 on device Uri [_value=openflow:79818282140722] even after 3 retries



 Comments   
Comment by Arnon Yahel [ 09/Oct/16 ]

Attachment karaf.logs.tar has been added with description: karaf logs

Comment by dotans [ 08/Nov/16 ]

I also encountered that problem:
repetetive creating of VMs failed

Comment by dotans [ 08/Nov/16 ]

Attachment qsvc.zip has been added with description: q-svc log (fail @08/11, 9:22)

Comment by dotans [ 08/Nov/16 ]

Attachment karaf.zip has been added with description: Karaf log (fail @08/11, 9:22)

Comment by dotans [ 13/Nov/16 ]

This bug is reproduced when using this script:
for i in

{1..1000}

;do
date;
neutron net-create net1;
neutron subnet-create net1 192.168.1.0/24 --name subnet1;
nova boot --image cirros-0.3.4-x86_64-uec --flavor m1.nano --nic net-name=net1 VM$i;
IP_NETNS=`ip netns | head -n 1`;
sleep 15;
echo sudo ip netns exec ${IP_NETNS} ping -c 3 192.168.1.3;
sudo ip netns exec ${IP_NETNS} ping -c 3 192.168.1.3;
nova delete VM$i;
neutron net-delete net1;
done
At some point, the update port failure appears and from that point, each new VM is not pingable.

Comment by Koby Aizer [ 16/Nov/16 ]

Attachment karaf.zip has been added with description: karaf.log - TransactionChainManager in DEBUG level

Comment by Alon Kochba [ 06/Dec/16 ]

openflowplugin can't figure it out - need to try to recreate with a running tcpdump and debug logs for whole openflowplugin.impl
Shuva, I think this happened when using only 2 computes - will ask Dotan to try to recreate, have you tried his script though?

From Shuva:
https://lists.opendaylight.org/pipermail/openflowplugin-dev/2016-December/006436.html

"Sorry for the late response, got pulled into multiple things.
The actual issue was reported as a part of OPNFLWPLUG-760(https://bugs.opendaylight.org/show_bug.cgi?id=6595) wherein the OpenFlow plugin failed to process a port status message because the message came from a switch before it started the service as MASTER for the switch.

Analysing the logs attached to this bug I was not able to come to any conclusion. To analyse the issue further we need the logs by setting the log level as
log:set DEBUG org.opendaylight.openflowplugin.impl

I assume that we see the issue when the number of switches(DPNs) are high, hence we need to check what is the typical delay between the MASTERSHIP grant and the ports coming up and adjust the delay accordingly.

We can also get the relevant information in a wireshark capture that would demarcate when the ROLE_REQ is processed by the switch and when the PORT STATUS is sent."

Comment by Alon Kochba [ 11/Dec/16 ]

This was reproduced using Dotan's instructions, with only 1 compute (one devstack machine that is "all in one" - controller+compute).

An updated karaf.log with all openflowplugin set to DEBUG was provided via mail on 12/8 (too large to attach here), as requested by Shuva.

Updating to blocker to at least get openflowplugin's feedback as this bug has blocked us from proceeding several times. Thanks.

Comment by A H [ 13/Dec/16 ]

Is this only for Carbon or does it apply to Boron-SR2 as well?

Comment by Alon Kochba [ 13/Dec/16 ]

(In reply to A H from comment #10)
> Is this only for Carbon or does it apply to Boron-SR2 as well?

This applies to Boron SR2.

When it occurs, that DPN is stuck in a zombie state and cannot be used further.
Waiting for analysis/feedback from openflowplugin, details and reproduction info provided to Shuva below and by Koby, at least to figure out a workaround for SR2.

Comment by A H [ 19/Dec/16 ]

As per the discussion on the following email thread below, this bug is being retargeted for Boron SR3:

https://lists.opendaylight.org/pipermail/openflowplugin-dev/2016-December/006490.html

Comment by Abhijit Kumbhare [ 11/Jan/17 ]

Is this still blocking?

Comment by Koby Aizer [ 11/Jan/17 ]

I think we can say this is no longer blocking (at least for netvirt), since we have a workaround (disabling statistics polling in openflowplugin). I believe it is still critical, as statistics collection is on by default, and users may encounter those issues which make the entire OVS unusable.

CC'd Sam & Alon - maybe they'd like to comment as well.

Comment by Tomas Slusny [ 21/Mar/17 ]

Changing importance to critical from blocking as Koby said it is non blocking anymore.

Comment by Sam Hague [ 22/Mar/17 ]

Is this on boron or just master?

Is the workaround a simple config option?

What deployment and setup does this bug happen?

Answers to the above would help to decide lowering the priority.

Comment by Tomas Slusny [ 28/Mar/17 ]

Can you test this again on this patch please? https://git.opendaylight.org/gerrit/#/c/53905/

I added collecting of early port status updates and after initial statistics gathering is done these collected port status updates will be written to operational datastore.

Comment by dotans [ 28/Mar/17 ]

I will test this patch.
usually, it takes about 1 day to reproduce.

Comment by dotans [ 02/Apr/17 ]

I'm sorry for the delay-
I encountered several problems on testing it.
As for now, it looks good.
I will give another update when finish testing it

Comment by Tomas Slusny [ 03/Apr/17 ]

Okay great

Comment by dotans [ 04/Apr/17 ]

It looks good.
The message of "failed to update port status" was not reproduced
Thanks

Comment by Tomas Slusny [ 24/Apr/17 ]

So patch was merged, is it working and can this one be closed?

Comment by Tomas Slusny [ 24/Apr/17 ]

Last patch will probably not work properly with statistics gathering disabled, so I need to make another one that tries to solve this better.

Comment by Tomas Slusny [ 27/Apr/17 ]

GERRIT: https://git.opendaylight.org/gerrit/#/c/55868/

Comment by Tomas Slusny [ 17/May/17 ]

Patch was merged, but there is also cherry-pick for stable/carbon, so it needs to be updated and merged too before this can be closed: https://git.opendaylight.org/gerrit/#/c/56132/

Comment by Tomas Slusny [ 05/Jun/17 ]

Closing this.

Comment by Tomas Slusny [ 05/Jun/17 ]

Forgot that this is still waiting for carbon merge, so leaving this in review.

Generated at Wed Feb 07 20:33:24 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.