[OPNFLWPLUG-290] Flows failed to get install in the same openflow table when programmed from different Threads simultaneously. Created: 20/Sep/14  Updated: 27/Sep/21  Resolved: 19/Feb/15

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Madhu Venugopal Assignee: Madhu Venugopal
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Duplicate
is duplicated by OPNFLWPLUG-272 no implement:Sending BarrierMessage a... Resolved
External issue ID: 1997

 Description   

When an openflow node connects to the controller, ovsdb's net-virt-provider bundle programs the switch with a set of pipeline flows (From Table 0 -> 110 in increments of 10, except Table 10 and 70). These pipeline flows are vital for openstack operation.

But, randomly some of the flows fails to get programmed. 1 of the cases that we narrowed down is on Table 0 flow. This randomness happens based on the timing of another flow being installed from another thread. This is the LLDP flow programmed by another thread. When these 2 threads happen to operate simultaneously (or almost), one of the flows gets lost somewhere in the system.

Actual flows programmed in OVS : https://gist.github.com/05b25649edd8ec6fa91c
[Table 10 and 70 will be missing and that is intentional]. Table 0 pipeline flow is missing.

The strangeness of the situation is that, the Table0 pipeline flow (DEFAULT_PIPELINE_FLOW_0) that is missing in the actual flow-table in OVS is seen in the config tree & operational tree.

Config : http://localhost:8181/restconf/config/opendaylight-inventory:nodes
=> https://gist.github.com/cef3a914ca18cb76e330

Operational : http://localhost:8181/restconf/operational/opendaylight-inventory:nodes => https://gist.github.com/bfc9e48b3d4fde91cb11



 Comments   
Comment by Madhu Venugopal [ 20/Sep/14 ]

Reproduction steps :
https://lists.opendaylight.org/pipermail/ovsdb-dev/2014-September/000720.html

Comment by Ed Warnicke [ 20/Sep/14 ]

Madhu,

Question... are you guys writing only to the config tree?

I ask, because writing to the operational tree is normally done by the statsmanager in response to the switch reporting back the flow. The only way off the top of my head I could think of that we would see the flow in the operational tree if it wasn't on the switch is if you guys are writing to both config and operational store (not accusing, just trying to eliminate possibilities).

Comment by Madhu Venugopal [ 20/Sep/14 ]

OVSDB debug logs : https://gist.github.com/17f081e1dfe37c9ee56d
The above logs are from writeFlow() method in AbstractServiceInstance.java

CheckedFuture<Void, TransactionCommitFailedException> commitFuture = modification.submit();
try

{ commitFuture.get(); // TODO: Make it async (See CONTROLLER-624) logger.debug("Transaction success for write of Flow "+flowBuilder.getFlowName()); Thread.sleep(500); }

catch (Exception e)

{ logger.error(e.getMessage(), e); modification.cancel(); }
Comment by Madhu Venugopal [ 20/Sep/14 ]

(In reply to Ed Warnicke from comment #2)
> Madhu,
>
> Question... are you guys writing only to the config tree?

Yes. we write ONLY to the config tree.

>
> I ask, because writing to the operational tree is normally done by the
> statsmanager in response to the switch reporting back the flow. The only
> way off the top of my head I could think of that we would see the flow in
> the operational tree if it wasn't on the switch is if you guys are writing
> to both config and operational store (not accusing, just trying to eliminate
> possibilities).

I know, that is why I don't know what to do with this issue .
Everything seems normal and appropriate from the state perspective, but the flows are missing.

Comment by Tony Tkacik [ 20/Sep/14 ]

Quick questions: Are you able to see all your flows in configuration store? IF yes, this is bug of Openflowplugin/FRM/Stats Manager and not MD-SAL Infrastructure.

This multiple threads are writing to datastore?
Or which multiple threads are you referencing?

E.g callback from datastore and LLDP discovery thread?

Comment by Madhu Venugopal [ 20/Sep/14 ]

Reproduced again.. this time with pcap :

Flows : https://gist.github.com/1926c569ff46bfcfac74
(Missing DEFAULT_PIPELINE_FLOW_0 flow)

Config : https://gist.github.com/aaaa5d72a499fca19127

Operational : https://gist.github.com/592be004ee3fd21dc1c8

OVSDB Debug logs : https://gist.github.com/e31255c6ea9a82196e45

PCAP : https://www.dropbox.com/s/9rgolqu0x7enc9p/pipeline_flows_2.pcap?dl=0
[Filter with openflow_v4.type == 14 to see all the flowmods & you will see that the DEFAULT_PIPELINE_FLOW_0 (table =0 , goto_table=20) is missing.

Comment by Madhu Venugopal [ 20/Sep/14 ]

(In reply to Tony Tkacik from comment #5)
> Quick questions: Are you able to see all your flows in configuration store?
Yes.

> IF yes, this is bug of Openflowplugin/FRM/Stats Manager and not MD-SAL
> Infrastructure.

Could be. But when am opening the bug, it is difficult for me to spot it and hence opened it against MD-SAL. Please move it to appropriate component once the
root-cause is identified.

>
> This multiple threads are writing to datastore?
> Or which multiple threads are you referencing?

Both these threads are internal to OVSDB codebase.
But you can think as 2 different threads both registering to the inventory update event which triggers these flow programming.

>
> E.g callback from datastore and LLDP discovery thread?

Comment by Keith Burns [ 21/Sep/14 ]

Was requested by Ed Warnicke to provide some data on this bug from the GroupBasedPolicy ProofOfConcept perspective, which we believe is hitting the same issue.

The net that we have discovered is that flows are missing in our Proof of Concept (POC) which involves two OVS instances (each on a separate VM).

Essentially, VM:odlgbp1 runs the OpenDaylight controller AND an OVS instance odlgbp2 runs an OVS instance (both in mininet).

OVA's of both these VMs are available in my DropBox and access can be granted on request and Thomas and I are also available to share our test environment and run additional tests.

Some of our debug steps have been captured in the following google doc (please note, this is our "scratchpad" so we can document some troubleshooting history, not a clean definitive document) :
https://docs.google.com/document/d/1cTnBBdQFaoR1hvsdHFuKlzcQKU1bvxvT8lCtBaVWTjQ/edit?usp=sharing

To simplify:

1. We see the requisite expected flows via REST. ASSUMPTION: Flows are correctly placed into the CONFIG datastore.
2. We see the requisite expected flows via Karaf debugging (see link above for specific debug levels) and have correlated the flowIDs between REST and Karaf logs via a python script. ASSUMPTION: Requisite flows are being passed to FRM.
3. Via jConsole, we see the correct number of messages being ENQUEUED/DEQUEUED. ASSUMPTION: Requisite flows are being passed through OpenFlowPlugin
4. Wireshark. Have attached wireshark trace, still parsing this.

Per Thomas' comments below: The net of this is that it appears that the flows make it to openflowjava, but get lost somewhere thereafter.

From Thomas Bachman:
*********************
TARBALL of data here:https://drive.google.com/a/noironetworks.com/#folders/0BztNICcppsJKa1B6WXlFYmFtWFE
The tarball has the following directories:

configstore/
jconsole/
karaf-logs/
ovs-logs
wireshark/

The configstore dir contains the result of REST queries to /restconf/config/opendaylight-inventory:nodes/node/openflow:<n>.

The jconsole directory contains a text file of the stats from the msg-spy-service-impl RuntimeBean.

The karaf-log directory contains the karaf log files obtained during the run (note: things to look for are "xid" for openflowjava and "Calling the FlowMod RPC method on MessageDispatchService AddFlow" for openflowplugin).

The ovs-logs dir contains the result of running ovs-ofctl
dump-flows [s1|s2] -O OpenFlow13 on each vSwitch.

The wireshark dir contains the pcapng file and a text file containing the openflow messages seen on the wire between the two vSwitches.

The net of this is that it appears that the flows make it to openflowjava, but get lost somewhere thereafter.
*********************

Comment by Ed Warnicke [ 21/Sep/14 ]

Madhu,

Could you attach:

a) Your config datastore output from RESTCONF
b) Your expected flows on the switch
c) Your actual flows on the switch
d) Your wireshark capture showing missing flows

to the bug so we can look at that case?

Comment by Madhu Venugopal [ 21/Sep/14 ]

(In reply to Ed Warnicke from comment #9)
> Madhu,
>
> Could you attach:
>
> a) Your config datastore output from RESTCONF
> b) Your expected flows on the switch
> c) Your actual flows on the switch
> d) Your wireshark capture showing missing flows
>
> to the bug so we can look at that case?

I already did that yesterday :

https://bugs.opendaylight.org/show_bug.cgi?id=1997#c6

Comment by Ed Warnicke [ 21/Sep/14 ]

Madhu,

Apologies You are correct, you did that yesterday

Comment by Keith Burns [ 21/Sep/14 ]

> The wireshark dir contains the pcapng file and a text file containing the
> openflow messages seen on the wire between the two vSwitches.

Applying following filter:
openflow_v4.type == 14 && (openflow_v4.flowmod.priority ==1 || openflow_v4.flowmod.priority ==100 || openflow_v4.flowmod.priority ==110 || openflow_v4.flowmod.priority ==111 || openflow_v4.flowmod.priority ==112 || openflow_v4.flowmod.priority ==120 || openflow_v4.flowmod.priority ==121 || openflow_v4.flowmod.priority ==132 || openflow_v4.flowmod.priority ==140 || openflow_v4.flowmod.priority ==150 || openflow_v4.flowmod.priority ==200 || openflow_v4.flowmod.priority ==50 || openflow_v4.flowmod.priority ==64999 || openflow_v4.flowmod.priority ==65000)

and manually counting the expanded OpenFlow 1.3 messages in each TCP message we get 116 across both switches.

ie. Filtering on FlowMODs (14) and also the priorities we set in the GBP project, we should have 114 flows (57 per OVS instances). We are still tracking down the extra 1 flow per switch but suspect it is administration outside of our project.

BUT

Executing <sudo ovs-ofctl dump-flows <switch> -O Openflow13 | wc -l > on each OVS we get:
host1 (s1): 55 (we discount the header as wc -l returns 56 which inc. a header)
host2 (s2): 57 (as above)

This is perplexing as it would appear we are getting 116 flowmods (as per filter above) on the wire but only 112 flows total installed.

Comment by Michal Rehak [ 22/Sep/14 ]

moved barrier after message:
https://git.opendaylight.org/gerrit/#/c/11457/

Comment by Ed Warnicke [ 22/Sep/14 ]

Fix: https://git.opendaylight.org/gerrit/#/c/11457/

Comment by Michal Rehak [ 23/Sep/14 ]

Mentioned patch eliminates collisions during add session and remove session phases. Those can occur simultaneously and a collision usually results into NPE in remove session thread and registered but invalid session.

Those can cause message lost and failing rpcs.

Please retest and eventually mark as resolved.

Generated at Wed Feb 07 20:32:04 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.