[OPNFLWPLUG-1049] Switch handshaking loops indefinitely Created: 15/Nov/18 Updated: 27/Sep/21 Resolved: 14/Nov/19 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | openflowplugin-impl |
| Affects Version/s: | Nitrogen-SR1, Fluorine |
| Fix Version/s: | None |
| Type: | Bug | Priority: | High |
| Reporter: | Leonardo Milleri | Assignee: | Somashekhar Javalagi |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
I'm running ODL fluorine stable release (opendaylight-0.9.0.tar.gz) and facing this problem: when a simulated switch (mininet) connects and disconnects very frequently, the openflowplugin enters in a loop and cannot recover the handshaking properly. This also affects the connection with some other switches. I can reproduce the same issue in a network of real switches (Extreme/Edgecore switches). When this problem happens, there is also a huge memory leak resulting in having the number of DeviceContextImpl instances growing indefinitely. Steps to reproduce: 1) run opendaylight-0.9.0 2) feature:install features-openflowplugin 3) run mininet (sudo mn --topo linear,20 --switch ovsk,protocols=OpenFlow13 --mac --controller remote,port=6633,ip=127.0.0.1) 4) simulate a switch disconnection by running the command "./changectrl.sh 10000 0.1" (script in attachments) 5) wait 1-2 minutes, you should see odl trying indefinitely to regain the connection 6) stop the script, the memory leak is now growing (you can check the number of instances of DeviceContextImpl by running "jcmd <pid> GC.class_histogram | grep -e "org.opendaylight.openflowplugin.impl.device.DeviceContextImpl$")
In attachment also the karaf.log |
| Comments |
| Comment by Luis Gomez [ 25/Nov/18 ] |
|
FYI I downloaded latest fluorine controller from: and installed feature: odl-openflowplugin-flow-services-rest Then I started mininet and run attached script on a VM with OVS 2.8.1. After the above I am not able to reproduce, maybe the CPU plays a role here. I am using a 4 CPU VM for controller + 2 CPU VM for mininet. |
| Comment by Leonardo Milleri [ 26/Nov/18 ] |
|
I've just tried with the latest fluorine controller (0.9.2-SNAPSHOT) and I can still reproduce the problem. The first time was ok, but then I tried two more times and in both cases the problem was reproduced. To be honest, it is better to wait till the last moment, when the changectrl scripts ends. I'm running mininet (Open vSwitch 2.9.0) in the same machine of the ODL controller (dell laptop precision 3530)
|
| Comment by Luis Gomez [ 26/Nov/18 ] |
|
OK, I changed the test setup to run in my laptop and now I see couple of issues: 1) Switches different from the flapping one (s1) start to disconnect & reconnect after the default inactivity probe (3 sec) kicks in. This aggravates the issue but it can be easily suppressed by setting OVS inactivity probe to some value bigger than controller default inactivity probe (15 sec). You can use following script after mininet is up to set inactivity probe: #!/bin/bash x=`sudo ovs-vsctl --columns=_uuid list Controller | awk '{print $NF}'` echo $x for i in $x do sudo ovs-vsctl set Controller $i inactivity_probe=20 done 2) After applying the above and letting the flap run for a while I see that after stopping the flap, the switch s1 cannot connect anymore to controller. This is something we have to fix as for now the only workaround is to restart the controller. leonardo.milleri, can you try with the inactivity probe configuration I posted and check if you see the same (switch s1 cannot connect anymore to controller) or any other faulty behavior? |
| Comment by Leonardo Milleri [ 27/Nov/18 ] |
|
Thank you Luis. Today for some reasons I can't reproduce the problem with inactivity_probe=3. As far as concern inactivity_probe=20, this is also working perfectly for me, eventually all the switches are connected (including s1). Can you please provide some more details about how the inactivity probe can affect the openflow connection? This can help on reproducing the problem as well. I'll carry on doing some others tests and let you know
|
| Comment by Leonardo Milleri [ 28/Nov/18 ] |
|
Adding some more information: I've just seen the problem in a real network of edgecore switches running pica8 NOS, opendaylight version is nitrogen-SR1. The following log fragment is related to the issue: 2018-11-28 05:10:11,251 | WARN | tLoopGroup-16-14 | ClusterSingletonServiceGroupImpl | 397 - org.opendaylight.mdsal.singleton-dom-impl - 2.3.1 | Service group openflow:2465031308174919169 stopping unregistered service org.opendaylight.openflowplugin.impl.lifecycle.ContextChainImpl@111fb98b |
| Comment by Luis Gomez [ 28/Nov/18 ] |
|
If you are running OpenFlow in a real network the first thing to do is the set the switches inactivity_probe or equivalent timer in switch (e.g. time the switch waits before sending echo request to controller when it does not receive any packet from controller) to more than 15 sec. The reason is when controller is busy it can miss to respond to the switch echo request and that causes switch disconnect and further problems when switch reconnect again. |
| Comment by Somashekhar Javalagi [ 02/Dec/18 ] |
|
leonardo.milleri, The script in the attached file has only connection command for the switch, should it be modified to trigger disconnect as well. |
| Comment by Luis Gomez [ 03/Dec/18 ] |
|
Somashekhar, if you open the script, the line that says: |
| Comment by Leonardo Milleri [ 05/Dec/18 ] |
|
ecelgp, do you know if anyone is working on this issue in ODL and how long would it take to fix it? Anything can I help with? In the meantime, I'll try to understand the implications of the workaround (inactivity_probe) and if it'll sort out any effects. |
| Comment by Somashekhar Javalagi [ 10/Dec/18 ] |
|
leonardo.milleri Can you please try testing the scenario with the gerrit review to check if issue still comes? |
| Comment by Luis Gomez [ 10/Dec/18 ] |
|
Somashekhar, FYI I quickly tried your patch distribution: And I still see after running the flap script for a while the switch s1 (openflow:1) cannot connect anymore to controller. |
| Comment by Leonardo Milleri [ 12/Dec/18 ] |
|
I tried the fix on top of the stable-fluorine branch and I could be able to reproduce the problem when running the script the third time (for the first 2 attempts I couldn't reproduce it). Instead, it seems the master branch is not having the same problem (run the script 5 times), or at least it is more robust. Are you aware of any commits in the master branch actually fixing/mitigating the issue? |
| Comment by Somashekhar Javalagi [ 17/Dec/18 ] |
|
leonardo.milleri we are planning to introduce device connection hold time, which is the minimum amount of time switch has to wait until it gets connected again, to reduce load on controller. We will record the switch last connected time. If switch is connected again within the hold time, then the connection will not be accepted. |
| Comment by Leonardo Milleri [ 18/Dec/18 ] |
|
Thank you, I'll import the attached changes and retest |
| Comment by Anil Vishnoi [ 07/Jan/19 ] |
|
ecelgp If you get a chance can you please test the latest patch for this issue? |
| Comment by Anil Vishnoi [ 28/Jan/19 ] |
|
Discussed dampening mechanism for 3-node cluster setup. Having a local dampening mechanism (connection dampening in the context of single node ) and global dampening mechiansm (connection dampening across the three node cluster) would be a great value add. |
| Comment by Somashekhar Javalagi [ 30/Jul/19 ] |
|
Hi ecelgp For this issue, we are expecting csit to pass to proceed further. I am seeing some of openflowplugin sodium csit jobs failing consistently only with reason ConnectionError: HTTPConnectionPool(host='10.30.170.90', port=8181). Are these seen before? I just ran csit on dummy test review and below are logs. https://jenkins.opendaylight.org/releng/job/openflowplugin-patch-test-core-sodium/83/
|
| Comment by Luis Gomez [ 30/Jul/19 ] |
|
This is due to this regression: |
| Comment by Somashekhar Javalagi [ 14/Nov/19 ] |
|
Merged in master branch, which is magnesium |