[OPNFLWPLUG-762] When OpenFlow-capable switch goes down while RPC {add,remove,update}-flow is invoked, RESTCONF sockets are leaked Created: 02/Sep/16  Updated: 27/Sep/21  Resolved: 08/Sep/16

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Alexis de Talhouët Assignee: Andrej Leitner
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File add-flow.sh    
External issue ID: 6625

 Description   

When OpenFlow-capable switch goes down while invoking RPC

{add,remove,update}

-flow, RESTCONF socket are leaked, ending in CLOSE_WAIT state.


HOW TO REPRODUCE:
1. start OFP
2. feature:install odl-openflowplugin-flow-services-ui
3. connect a switch (OvS for instance)
4. run a script adding flow continuously (e.g. script attached)
5. disconnect the OvS

--> Observe result are:
a. the terminal window looping on the script is hanging on latest sent request
b. lsof -i :8181 or ss -nat | grep 8181 or whatever command to see open socket for port 8181
b. i. socket will be in ESTABLISHED state while a. is true
ii. stop the loop in the terminal --> the socket will be in CLOSE_WAIT
java 37558 adetalhouet 55u IPv6 0x2ef40e38b9d5893f 0t0 TCP localhost:8181->localhost:50860 (CLOSE_WAIT)

In a scaled environment, while provisioning switches, network can flap, thus failing those operations.

Even though nobody has complained about this, I believe this is a critical BUG as a fairly easy reproducible bug.



 Comments   
Comment by Alexis de Talhouët [ 02/Sep/16 ]

Attachment add-flow.sh has been added with description: add-flow script

Comment by Alexis de Talhouët [ 02/Sep/16 ]

Proposed fix:

https://git.opendaylight.org/gerrit/#/c/45112/

Comment by Alexis de Talhouët [ 02/Sep/16 ]

I better fix should be provided to make this configurable.

Comment by Andrej Leitner [ 06/Sep/16 ]

Hi Alexis,
as you have already seen, at first I reworked your patch to use timeout as config parameter. However at this time I do not think setting timer to fail rpc future after certain limit is good idea. What about heavy traffic and longer response time? It could accidentally fail rpc requests we don't want to fail. We need to fail them only if we know what we are doing (e.g. device disconnected. not sure why they do not fail themselves). I will look at that.

Comment by Alexis de Talhouët [ 06/Sep/16 ]

Andrej, I understand what I've done isn't the right thing, as I believe I underlined this as well
For sure a good fix would be to find where the resources aren't closed, triggering the failure of on-going RPCs.
That said I have a really little knowledge of the OFP architecture, thus the purpose of my patch was just to outline where the issue resides, letting its resolution to expert like you.
If you want me to test a potential fix, let me know.

Comment by Andrej Leitner [ 07/Sep/16 ]

Hi Alexis,
more patches were merged yesterday for blocking bugs in openflowplugin (and related bugs in openflowjava). I built locally actual master of ofjava and ofplugin and tested your scenario. After mininet disconnect:

(ad a.) I do not see looping script hanging. It continues sending requests and getting (unsuccesful) responses.

(ad b. ii.) After stopping loop socket changed to CLOSE_WAIT for a while and closed then.

Could you test it again with latest codebase?

Question for me is who should take care of flying requests during device disconnect. Now I see as the error in last rest response OutboundQueueException from openflowjava with Device disconnected message. If we would actively fail requests from openflowplugin we can get around the exception and fail only with error message.

Comment by Alexis de Talhouët [ 07/Sep/16 ]

> Could you test it again with latest codebase?

I recompiled ofj and ofp and re-tested the scenario but I'm still facing same issue. I was never able to get it working as you did. Maybe artifacts weren't yet published to nexus.
I'll give it another try tomorrow.

Comment by Alexis de Talhouët [ 08/Sep/16 ]

I have just retried with fresh recompilation of ofj and ofp, I can say that this BUG is indeed fix.

AFAIK it is related the fix is related to this patch: https://git.opendaylight.org/gerrit/#/c/45231/

So this BUG might have been a duplicate of OPNFLWJAVA-79 with a different impact.

Thanks Andrej.

Generated at Wed Feb 07 20:33:19 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.