[OPNFLWPLUG-796] Flow matching function (operational flow reconciliation) is not stable Created: 11/Oct/16 Updated: 27/Sep/21 Resolved: 13/Apr/17 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Luis Gomez | Assignee: | Luis Gomez |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 6917 |
| Description |
|
To reproduce: 1) start mininet 1 switch: sudo mn --controller=remote,ip=127.0.0.1 --topo tree,1 --switch ovsk,protocols=OpenFlow13 2) Push this flow: 3) Check flow is reconciled in operational: GET restconf/operational/opendaylight-inventory:nodes/node/openflow:1/table/2 , }, , , } 4) Delete flow: 5) repeat 2-4 until you see alien ID is step 3) (normally 4-5 times) , }, , , } |
| Comments |
| Comment by Luis Gomez [ 11/Oct/16 ] |
|
BTW this is reprodcible with latest Boron. |
| Comment by Shuva Jyoti Kar [ 11/Oct/16 ] |
|
One question that i have is does the flow with alien id in the Oper-DS remain forever? |
| Comment by Luis Gomez [ 11/Oct/16 ] |
|
Once you get an alien ID further flow delete/push attempts also produce alien ID, if this is your question. |
| Comment by Luis Gomez [ 11/Oct/16 ] |
|
BTW this issue could be the same affecting the FRS test (random alien ID). |
| Comment by Shuva Jyoti Kar [ 11/Oct/16 ] |
|
(In reply to Luis Gomez from comment #3) There is a window in which the stats polling fetches the flow from the switch faster than the rpc add-flow callback succeeds resulting in the alien id. If we increase the stats polling interval, the window reduces. Manually its really difficult to hit that , its quite surprising to see that you are able to reproduce it deterministically. |
| Comment by Luis Gomez [ 11/Oct/16 ] |
|
I think there is something else:
|
| Comment by Luis Gomez [ 12/Oct/16 ] |
|
Another observation is that I have to give 5 secs in automation between delete and create the flow to consitently reproduce. |
| Comment by Shuva Jyoti Kar [ 12/Oct/16 ] |
|
I tried it around 10/11 times manually today and could reproduce it only once. |
| Comment by Luis Gomez [ 12/Oct/16 ] |
|
FYI I wrote a test that consistently reproduces the issue: https://git.opendaylight.org/gerrit/#/c/46797/ Hopefully this gets merged soon. |
| Comment by A H [ 13/Oct/16 ] |
|
Based on comment #1, I believe this bug is targeted for Boron. |
| Comment by Luis Gomez [ 14/Oct/16 ] |
|
Shuva, is there a way to disable the stats polling? if this is really interfering we can run the test without stats collection and see what happens. |
| Comment by Shuva Jyoti Kar [ 14/Oct/16 ] |
|
We can use PUT @ http://localhost:8181/restconf/config/openflow-provider-config:openflow-provider-config/ with the following configuration { } To turn stats polling off. Please do it before connecting switches since the last time i checked it would restart the modules and with switches connected, it can result in huge errors. FYI https://git.opendaylight.org/gerrit/#/c/46718/ |
| Comment by Luis Gomez [ 14/Oct/16 ] |
|
Hi Shuva, When I disable stats polling I do not get anything in inventory or topology when switch connects, I think something is broken when doing that because I also see this exception in karaf console: opendaylight-user@root>Exception in thread "Thread-297" io.netty.channel.unix.Errors$NativeIoException: bind() failed: Address already in use |
| Comment by Shuva Jyoti Kar [ 14/Oct/16 ] |
|
did you change it before connecting the switch ? |
| Comment by Luis Gomez [ 14/Oct/16 ] |
|
Yes, also in a second attempt I do not get the exception but still the switch does not show in topology or inventory. Have you tried this lately? I am using latest Boron. |
| Comment by Shuva Jyoti Kar [ 14/Oct/16 ] |
|
had tried it about a month back, will try and keep you posted |
| Comment by Shuva Jyoti Kar [ 14/Oct/16 ] |
|
(In reply to Shuva Jyoti Kar from comment #16) Also try provisioning a flow to see if that triggers an update ? |
| Comment by Luis Gomez [ 14/Oct/16 ] |
|
Just tried with boron release and same issue so I opened this: https://bugs.opendaylight.org/show_bug.cgi?id=6941 BR/Luis |
| Comment by Luis Gomez [ 14/Oct/16 ] |
|
For this specific issue I have 2 more findings:
In partcular this is what the automation does (consistently fails): Loop on: Sleep 5 |
| Comment by Shuva Jyoti Kar [ 15/Oct/16 ] |
|
I tried the 5 sec delay between deletion and repush , and i could see the alien flow id once, but the next sec i retried it was showing the actual one. I tried about 23 times today, no luck. |
| Comment by Shuva Jyoti Kar [ 15/Oct/16 ] |
|
I tried with this flow <?xml version="1.0" encoding="UTF-8" standalone="no"?> |
| Comment by Luis Gomez [ 15/Oct/16 ] |
|
Shuva I was wrong, the issue shows normally with big flows (lot of matches like the one I posted in the bug) and I was also able to reproduce it by just adding (no deleting) different big flows every 8 secs. I will update the CI test to show this too. |
| Comment by Shuva Jyoti Kar [ 15/Oct/16 ] |
|
No issues Luis. I will try it with the flow that you had reported initially with and then come back tomorrow. thanks |
| Comment by Luis Gomez [ 15/Oct/16 ] |
|
Here it is: |
| Comment by Shuva Jyoti Kar [ 17/Oct/16 ] |
|
(In reply to Luis Gomez from comment #24) I am trying with ovs2.4 and getting OFP_BMC_BAD_WILDCARDS errors. Any other example flow that will work with ovs2.4 ? |
| Comment by Miroslav Macko [ 17/Oct/16 ] |
|
(In reply to Shuva Jyoti Kar from comment #25) Hello Shuva, You can use flow from LUIS on the very top. You just need to change <ipv6-source>1234:5678:9ABC:DEF0:FDCD:A987:6543:210F/76</ipv6-source> to <ipv6-source>1234:5678:9ABC:DEF0:FDC0::/76</ipv6-source> Miro <flow xmlns="urn:opendaylight:flow:inventory"> |
| Comment by Shuva Jyoti Kar [ 17/Oct/16 ] |
|
Strangely enough 13/13 times i tried this but donot see this in my set-up, with 8 secs delay between additions, and a total 10-15 flows . @Miroslav any points you have |
| Comment by Miroslav Macko [ 17/Oct/16 ] |
|
Hello Shuva, I think this happen when flow is removed from config, and statistics(in aprox. 3 seconds interval) bring the flow. It is not present in DeviceFlowRegistry, so it will create flow with alien id. I was able to reproduce. I will attach the karaf log. We have an idea, what could fix it. I will try. If you have some idea, please feel free to do it. Thanks, |
| Comment by Miroslav Macko [ 17/Oct/16 ] |
|
Attachment device-flow-registry-karaf.log has been added with description: karaf log |
| Comment by Shuva Jyoti Kar [ 17/Oct/16 ] |
|
(In reply to Miroslav Macko from comment #28) Yes Miroslav. i too guessed the same. Please try your fix, i am yet to arrive at a solution to address it permanently. However i feel with stats turned off, we might not see this problem |
| Comment by Luis Gomez [ 17/Oct/16 ] |
|
Shuva, please save the flow miroslav sent as f21.xml and run this script in same folder: #!/bin/bash At least in my laptop I am getting alien ID very consistent before script reaches 5 attempts. |
| Comment by Luis Gomez [ 17/Oct/16 ] |
|
I think the failure is even more reproducible when we add some fix sleep after stats has been updated in controller, like we do in the robot test. |
| Comment by Luis Gomez [ 17/Oct/16 ] |
|
Definitely it has to be with stats collection, if I use this script sleeping 6 secs after stats are received for deleted flow, I hit the issue much more often: #!/bin/bash |
| Comment by Shuva Jyoti Kar [ 18/Oct/16 ] |
|
Luis, With https://git.opendaylight.org/gerrit/#/c/47004/, turning off stats would be easy. And with stats off we might not see this issue |
| Comment by Miroslav Macko [ 18/Oct/16 ] |
|
Hi guys, @Shuva if stats are turned off, flow will get to the operational only with reconciliation. Right? Is it ok? @Luis I have pushed the patch https://git.opendaylight.org/gerrit/#/c/47085/ Could you please test it? Thank you, |
| Comment by Shuva Jyoti Kar [ 18/Oct/16 ] |
|
(In reply to Miroslav Macko from comment #35) If stats are turned off, then by current design they will appear in the Oper-Ds on rpcs success. It would be better not to change that behaviour since it gives a fair guarantee to applications whether the flow succeeded or not. Could not access your patch, could you please add me to it. |
| Comment by Miroslav Macko [ 18/Oct/16 ] |
|
Hello Shuva, But this is not added by RPCs. It is put to config and it is added to operational by FRM or FRS using statistics. Isn't it? Thanks, |
| Comment by Luis Gomez [ 18/Oct/16 ] |
|
Hi Miroslav, this is the result of your patch: It does fix the alien ID issue but for some reason it breaks the RPC update operation. |
| Comment by Miroslav Macko [ 19/Oct/16 ] |
|
Hello Luis, I see. I will check it. Thanks a lot. Miro |
| Comment by Shuva Jyoti Kar [ 19/Oct/16 ] |
|
(In reply to Miroslav Macko from comment #37) You are correct Miroslav! |
| Comment by Shuva Jyoti Kar [ 19/Oct/16 ] |
|
(In reply to Luis Gomez from comment #38) Luis, in the CSIT TCs, do we do some kind of a retry to check if the flow is in the operational DS? Or is it a one time check? |
| Comment by Miroslav Macko [ 19/Oct/16 ] |
|
Hi guys, I have pushed the patch: https://git.opendaylight.org/gerrit/#/c/47138/ It should work already. But I need to check behavior with statistics turned off yet. I will do it tomorrow. Thanks. |
| Comment by Luis Gomez [ 19/Oct/16 ] |
|
In the failing test the flows seem to be updated but no flow shows up in operational after 3 sec. There are multiple retries after during 5 secs (check updated flow) but in the last retry I still see no flow. I can retrigger the test to see if this consisten or just a transient failure. |
| Comment by Luis Gomez [ 19/Oct/16 ] |
|
transient discarded, something in Miroslav's patch impacts stats collection after RPC update operation. BR/Luis |
| Comment by Miroslav Macko [ 19/Oct/16 ] |
|
(In reply to Luis Gomez from comment #44) Hello Luis, This test(175) ran on this patch? https://git.opendaylight.org/gerrit/#/c/47138/ Thanks, |
| Comment by Luis Gomez [ 19/Oct/16 ] |
|
No, still the old patch, you want to run the test with the new patch? I can do that too. |
| Comment by Miroslav Macko [ 20/Oct/16 ] |
|
Hello Luis, It would be fine, if you will test it. It should fix RPC updates in test: /test/csit/suites/openflowplugin/Flows_Additional_TCs/Stat_Manager_extended/020_SM_sal_add_upd_del_flows.robot But I need to test use case when statistics are off. I will let you know, when it is done. Thanks a lot, |
| Comment by Miroslav Macko [ 20/Oct/16 ] |
|
Hello Luis, I have pushed the patch: https://git.opendaylight.org/gerrit/#/c/47138/ Could you please test it? I am not seeing the problem with "add port" locally. Please let me know, it the problem persists. Thank you, |
| Comment by Shuva Jyoti Kar [ 23/Oct/16 ] |
|
Pushed on master: and stable/bo |
| Comment by Luis Gomez [ 24/Oct/16 ] |
|
Yes, this is fixed now. |
| Comment by Luis Gomez [ 08/Dec/16 ] |
|
We have to reopen this bug for stable boron as current fix has been reverted because it changes plugin behavior. |
| Comment by A H [ 13/Dec/16 ] |
|
Are there plans to fix it now, or can we retarget for Boron-SR3 for more time? |
| Comment by Luis Gomez [ 13/Dec/16 ] |
|
Lower down priority as this does not impact forwarding, just operational info + some conditions are required in order to reproduce (e.g. big flows pushed at very specific interval) |
| Comment by Tomas Slusny [ 24/Feb/17 ] |
|
Pushed correct fix to Gerrit: https://git.opendaylight.org/gerrit/#/c/52237/ Tested it with Luis reproduction steps, and without this patch, the bug was still present, but after my changes, I was not able to reproduce it anymore. Also, tested the case, that previous patch was not checking for, and that is manually adding flow via OVS CLI, and it got properly propagated to operational. Luis, can you check this and verify that it resolves this issue? |
| Comment by Luis Gomez [ 24/Feb/17 ] |
|
Hi Tomas, to verify you just need to run the test-openflowplugin-core, and check this job: https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-1node-flow-services-only-carbon/532/archives/log.html.gz It seems the issue is still there (see the |
| Comment by Luis Gomez [ 24/Feb/17 ] |
|
Hi Tomas, to verify you just need to run the test-openflowplugin-core, and check this job: https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-1node-flow-services-only-carbon/532/archives/log.html.gz It seems the issue is still there (see the |
| Comment by Luis Gomez [ 13/Mar/17 ] |
|
I think current patch is good. |
| Comment by Tomas Slusny [ 16/Mar/17 ] |
|
Patch was merged, so I think that this one can be closed, right? |
| Comment by Luis Gomez [ 19/Mar/17 ] |
|
Tomas, can you cherry pick for stable/boron? |
| Comment by Luis Gomez [ 19/Mar/17 ] |
|
Reopen so we do not forget to cherry-pick. |
| Comment by Tomas Slusny [ 20/Mar/17 ] |
|
Created cherry-pick for stable/boron here: https://git.opendaylight.org/gerrit/#/c/53545/ |
| Comment by Abhijit Kumbhare [ 23/Mar/17 ] |
|
Raise to blocker - so that the patch can be added in SR 3. |
| Comment by Luis Gomez [ 28/Mar/17 ] |
|
Changing milestone to SR4. |
| Comment by Tomas Slusny [ 03/Apr/17 ] |
|
Patch that fixes performance regression: https://git.opendaylight.org/gerrit/#/c/53972/ Can be cherry-picked to boron (requires also re-revert of original patch). |
| Comment by Luis Gomez [ 03/Apr/17 ] |
|
Sounds good, we can merge all these changes after boron is unlocked. |
| Comment by Jozef Bacigal [ 13/Apr/17 ] |
|
Merged into Boron. Closing bug. |