[OPNFLWPLUG-918] Regression: Controller fails to delete 100K flows from switches Created: 03/Jul/17  Updated: 27/Sep/21

Status: Confirmed
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: Nitrogen, Oxygen, Fluorine
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Luis Gomez Assignee: Anil Vishnoi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Duplicate
is duplicated by OPNFLWPLUG-890 ODL disconnects from switch during pe... Resolved
External issue ID: 8787

 Description   

This issue is happening in Carbon/Nitrogen (not in Boron), it is tracked here:

To reproduce:

1) Start mininet linear 32:
sudo mn --controller 'remote,ip=192.168.0.1,port=6633' --topo linear,32

2) Push 100K flows (script available in int/test repo):
python odl_tester.py --threads 5 --flows 100000 --no-delete --fpr 200

Observe after 100K flows are added and controller is stable CPU is still very high.

3) Remove flows from inventory:
DELETE http://controller:8181/restconf/config/opendaylight-inventory:nodes/

Observe after 100K flows are removed and controller is stable, there are still flows in operational that never get removed.

 

CSIT job for this test : ttps://jenkins.opendaylight.org/releng/job/openflowplugin-csit-1node-periodic-scale-stats-collection-daily-only-carbon/



 Comments   
Comment by Luis Gomez [ 04/Jul/17 ]

Forgot to add the link:

https://jenkins.opendaylight.org/releng/job/openflowplugin-csit-1node-periodic-scale-stats-collection-daily-only-carbon/

Comment by Tomas Slusny [ 12/Jul/17 ]

So it looks like on nitrogen this is already working, and on carbon it will be working after this cherry-pick will be merged: https://git.opendaylight.org/gerrit/#/c/60196/1

Comment by Tomas Slusny [ 12/Jul/17 ]

Oh sorry, wrong 100k flows bug, this was meant for 6755 one.

Comment by Luis Gomez [ 31/Jul/17 ]

This issue only happens in Carbon now and it is FRM related because flows are not removed from switch when we delete them in the inventory.

Comment by Abhijit Kumbhare [ 28/Aug/17 ]

This is a regression for Carbon (throughout Carbon) - works in Boron and in Nitrogen.

Comment by Tomas Slusny [ 13/Sep/17 ]

This is sporadically failing on Nitrogen too. Anyway, here is patch that will resolve this issue: https://git.opendaylight.org/gerrit/#/c/63101/

Comment by Luis Gomez [ 14/Sep/17 ]

Yeah, also if you look at the karaf logs when it fails, there are few switch disconnects after DS is cleared in case of nitrogen and a bunch of disconnects in case of carbon. This means the controller gets so busy with the DS clear operation that it misses the ECHO requests from switches and these start to disconnect and reconnect aggravating the problem. If we do not fix this for nitrogen, i will open a blocker for carbon as the issue is much more apparent in carbon.

Comment by Luis Gomez [ 17/Sep/17 ]

BTW, to support my observation I started couple of tests with switch ECHO message disabled, and both passed in carbon and nitrogen:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-gate-scale-stats-collection-daily-only-carbon/154/

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-gate-scale-stats-collection-daily-only-nitrogen/281/

BR/Luis

Comment by Tomas Slusny [ 18/Sep/17 ]

This was caused because of heavy load on Netty thread, what was caused probably by single-layer serialization. Explanation: with multi-layer, we are doing conversion on different thread and then serializing simple OFJ data structure. In single-layer, we are doing everything on Netty thread. Overall, it is faster by around 1/3 based on YourKit observations what I did during 100k flow test locally, but all load is on Netty thread, so it disconnects. My patch that I mentioned before solves this by pre-serializing data coming via single-layer and then sending raw bytes to netty thread, what causes almost no load on Netty thread.

Comment by Luis Gomez [ 18/Sep/17 ]

Right, you patch works in oxygen:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-gate-scale-stats-collection-daily-only-oxygen/13/

Do you want to cherry-pick to other branches now to test? or later when your patch is merged in master?

Comment by Tomas Slusny [ 22/Sep/17 ]

After it will be merged probably.

Comment by Tomas Slusny [ 25/Sep/17 ]

Created cherry-pick for stable/nitrogen: https://git.opendaylight.org/gerrit/#/c/63491/

Cherry-pick to stable/carbon will come after that.

Comment by Tomas Slusny [ 25/Sep/17 ]

stable/carbon: https://git.opendaylight.org/gerrit/#/c/63496/

Comment by Tomas Slusny [ 26/Sep/17 ]

Alright, updated the patch for stable/carbon, again I forgot to actually load the configuration flag to enable the preserialization.

Comment by Sam Hague [ 26/Sep/17 ]

Tomas, does this patch also cover what was in https://git.opendaylight.org/gerrit/#/c/62792/? that patch also had bug-8787 and was abandoned.

That abandoned patch was also supposed to fix https://bugs.opendaylight.org/show_bug.cgi?id=7826, but it didn't so also wondering if 7826 is fixed in this patch now?

Comment by Tomas Slusny [ 26/Sep/17 ]

No, the flow-related patch tried to improve performance of flow deletion, but it did not (and also caused more issues) so I abandoned it as current method for flow deletion what we are using seems more performant. Also, that patch was never supposed to solve OPNFLWPLUG-858 i just told Sunil about it because he was looking at device registries and I did some modifications to registries and deletion of marks there.

This current patch that solves this issue do not modifies the registry at all, so it will not help the OPNFLWPLUG-858 after it will be merged.

Comment by Luis Gomez [ 27/Sep/17 ]

Last patch seems to work so we can close this bug after it is merged.

Comment by Tomas Slusny [ 29/Sep/17 ]

The patches was not working for extensions, so updated them to include real calls to serialization registry. Unfortunately that required changes in OFJ, and for stable/carbon it is stil separate project, so here is additional patch for OFJ for stable/carbon: https://git.opendaylight.org/gerrit/#/c/63843/

Comment by Luis Gomez [ 04/Oct/17 ]

We have this issue in all branches because no patch has been merged yet.

Comment by Abhijit Kumbhare [ 16/Oct/17 ]

Moving it to Arun.

Comment by Abhijit Kumbhare [ 30/Oct/17 ]

Anil reviewing these set of patches:

https://git.opendaylight.org/gerrit/#/q/topic:bug/8787+status:open

Comment by Luis Gomez [ 06/Mar/18 ]

From last test results:

  • Carbon controller is not able to delete flows from switches: this is where we see switches disconnecting.
  • Nitrogen and Oxygen controller delete flows from switches but stats are not correct (show flows in operational inventory).
Comment by Luis Gomez [ 11/Jun/18 ]

This job started to work after adjusting 2 things:

  • poll stats interval: 3 -> 10 sec.
  • disable switch echo messages.

I think we still have issue of echo messages not responded when controller is busy, so I will leave this bug open for now reducing severity to Major.

Comment by Anil Vishnoi [ 11/Jun/18 ]

Ideally this test should with without the above mentioned workaround. This bug is kept open to address this issue without the above workaround.

Comment by Anil Vishnoi [ 25/Jun/18 ]

Given that we have a workaround for this issue, reducing the priority.

Generated at Wed Feb 07 20:33:43 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.