[OPNFLWPLUG-875] Switch scalability regression due to missing table miss flows Created: 29/Mar/17  Updated: 27/Sep/21  Resolved: 16/Oct/17

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Luis Gomez
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8103

 Description   

This is detected here for both carbon and boron:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-sw-scalability-daily-only-carbon/plot/Switch%20Scalability/

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-sw-scalability-daily-only-boron/

Basically the scalability test fails because missing table miss flows required for topology discovery.



 Comments   
Comment by Tomas Slusny [ 30/Mar/17 ]

Isn't this duplicate of 7770?

Comment by Luis Gomez [ 30/Mar/17 ]

Very possible, we can fix 7770 and then recheck this one.

Comment by Luis Gomez [ 30/Mar/17 ]

Note also that 7770 is cluster releated while this one happens with single instance.

Comment by D Arunprakash [ 28/Apr/17 ]

Hi Luis,
Could you please let us know, from which build number you have seen this issue.

Now the scalability TCs are passing in both boron and carbon.

Regards,
Arun

Comment by Luis Gomez [ 28/Apr/17 ]

Yes they are passing but see the regression from 500 to 200:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-sw-scalability-daily-only-boron/plot/Switch%20Scalability/

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-sw-scalability-daily-only-carbon/plot/Switch%20Scalability/

The regresion happend long back by the beginning of the year. BTW If I remove the topology test, the test goes back to 500 switches because issue is related to missing table miss flows.

BR/Luis

Comment by D Arunprakash [ 03/May/17 ]

Thanks Luis.

Since its new to me, could you please help me in the steps to disable the topology tests and run the regression.

Regards,
Arun

Comment by Luis Gomez [ 09/May/17 ]

To reproduce the issue you can just generate multiple iterations of mininet linear topology with 100, 200, 300 nodes, etc... After a while, you will observe not all links are properly discovered. Nodes are good though.

Comment by Luis Gomez [ 10/May/17 ]

I always thought the issue was in table miss application but actually the issue could be more this ERROR present in all the karaf logs [1] for this test:

2017-05-10 07:13:04,862 | ERROR | pool-27-thread-1 | OutboundQueueProviderImpl | 193 - org.opendaylight.openflowplugin.impl - 0.4.0.Carbon | No queue present, failing request

[1] https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-1node-periodic-sw-scalability-daily-only-carbon/259/archives/odl1_karaf.log.gz

Comment by Luis Gomez [ 10/May/17 ]

Note this issue can be connected to: https://bugs.opendaylight.org/show_bug.cgi?id=8401

Comment by Luis Gomez [ 10/May/17 ]

Note that karaf log for scale/perf test is set to ERROR only. I can also enable full debug if required.

Comment by D Arunprakash [ 11/May/17 ]

Luis,
Could you please enable the debug logs and run the test again, so we can confirm if its same issue which was reported in

https://bugs.opendaylight.org/show_bug.cgi?id=8401

Regards,
Arun

Comment by Luis Gomez [ 14/May/17 ]

Log is too big to attach, you can grab from here:

https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-1node-periodic-sw-scalability-daily-only-carbon/272/archives/odl1_karaf.log.gz

After looking at it, I do not see it is the same as OPNFLWPLUG-887.

BR/Luis

Comment by D Arunprakash [ 15/May/17 ]

Flowing exception seen several times in the karaf log

2017-05-14 18:54:15,488 | WARN | entLoopGroup-7-5 | OFFrameDecoder | 182 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.9.0.SNAPSHOT | Unexpected exception from downstream.
io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer
at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source)
2017-05-14 18:54:15,490 | WARN | entLoopGroup-7-5 | OFFrameDecoder | 182 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.9.0.SNAPSHOT | Closing connection.
2017-05-14 18:54:15,491 | INFO | entLoopGroup-7-5 | SystemNotificationsListenerImpl | 193 - org.opendaylight.openflowplugin.impl - 0.4.0.SNAPSHOT | ConnectionEvent: Connection closed by device, Device:/10.29.12.106:46524, NodeId:openflow:78

2017-05-14 19:06:26,955 | WARN | entLoopGroup-7-2 | OFFrameDecoder | 182 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.9.0.SNAPSHOT | Unexpected exception from downstream.
io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: Connection reset by peer
at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown Source)
2017-05-14 19:06:26,956 | WARN | entLoopGroup-7-2 | OFFrameDecoder | 182 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.9.0.SNAPSHOT | Closing connection.
2017-05-14 19:06:26,956 | INFO | entLoopGroup-7-2 | SystemNotificationsListenerImpl | 193 - org.opendaylight.openflowplugin.impl - 0.4.0.SNAPSHOT | ConnectionEvent: Connection closed by device, Device:/10.29.12.106:47065, NodeId:openflow:271

I'm seeing the below error as well...
=============================================

2017-05-14 19:06:24,160 | ERROR | pool-31-thread-1 | OutboundQueueProviderImpl | 193 - org.opendaylight.openflowplugin.impl - 0.4.0.SNAPSHOT | No queue present, failing request
2017-05-14 19:06:24,160 | WARN | pool-31-thread-1 | RpcContextImpl | 193 - org.opendaylight.openflowplugin.impl - 0.4.0.SNAPSHOT | Xid cannot be reserved for new RequestContext, node:openflow:166
2017-05-14 19:06:24,161 | ERROR | pool-31-thread-1 | OutboundQueueProviderImpl | 193 - org.opendaylight.openflowplugin.impl - 0.4.0.SNAPSHOT | No queue present, failing request
2017-05-14 19:06:24,161 | WARN | pool-31-thread-1 | RpcContextImpl | 193 - org.opendaylight.openflowplugin.impl - 0.4.0.SNAPSHOT | Xid cannot be reserved for new RequestContext, node:openflow:29
2017-05-14 19:06:24,162 | ERROR | pool-31-thread-1 | OutboundQueueProviderImpl | 193 - org.opendaylight.openflowplugin.impl - 0.4.0.SNAPSHOT | No queue present, failing request
2017-05-14 19:06:24,165 | WARN | pool-31-thread-1 | RpcContextImpl | 193 - org.opendaylight.openflowplugin.impl - 0.4.0.SNAPSHOT | Xid cannot be reserved for new RequestContext, node:openflow:180
2017-05-14 19:06:24,165 | ERROR | pool-31-thread-1 | OutboundQueueProviderImpl | 193 - org.opendaylight.openflowplugin.impl - 0.4.0.SNAPSHOT | No queue present, failing request

Comment by Luis Gomez [ 16/May/17 ]

OK, I just reproduced the issue in my laptop using 300 switches. I can see all switches come good but some do not have the table miss flow. There is nothing relevant in karaf.log when this happens. Since table miss flow is a test application, I will modify the robot test to skip this app and use NB pushed table miss flows instead.

Comment by Anil Vishnoi [ 16/May/17 ]

I believe following patch should fix the "No queue present" issue reported in above comment.

https://git.opendaylight.org/gerrit/#/c/56927/

Comment by Luis Gomez [ 16/May/17 ]

Good point, let me try with latest carbon to see if I see any better.

Comment by Luis Gomez [ 16/May/17 ]

And it works in carbon, I do not see table miss flows missing with latest code

Anil, is it possible to cherry-pick the change to boron?

Comment by Luis Gomez [ 16/May/17 ]

I was too fast to say it works, it seems it is better now, but still same logging:

https://logs.opendaylight.org/sandbox/jenkins091/openflowplugin-csit-1node-periodic-sw-scalability-daily-only-carbon/2/archives/odl1_karaf.log.gz

Comment by Luis Gomez [ 20/May/17 ]

So after reproducing this locally, I can see the main issue is still missing table miss flow after ~200 switches. Nothing in the logs give any hint of what the problem is.

Comment by Luis Gomez [ 22/May/17 ]

To reproduce, just generate linear,300 or 400 topology and check flows in mininet (dpctl dump-aggregate -O OpenFlow13), there should be 1 table miss flow per switch, instead some switch miss the flow and therefore topology verification fails in this test.

Comment by Luis Gomez [ 23/May/17 ]

So the real issue seems to be flow installation after switches connect (i.e. flow reconciliation). No matter if I pre-program flows in DS or use table miss flow feature, I always miss flows when many switches connect in short time. The only workaround is to install the table miss flows after all switches are fully connected to ODL. I created this patch to prove so:

https://git.opendaylight.org/gerrit/#/c/57660

Comment by Sunil Kumar M S [ 15/Jun/17 ]

Hello Luis,

I tried to reproduce the issue with steps provided by you, i tried gradually increasing the number of switches till 250 before my laptop ran out of resource at 300.
I have cherry picked this patch on top of master branch:
https://git.opendaylight.org/gerrit/#/c/57814/
I observed that table-miss-flow are programmed on all the switches connected.

Please try with the patch and let us know on your findings.

Thanks.

Comment by Tomas Slusny [ 26/Jun/17 ]

So that patch mentioned by Sunil was merged on both carbon and nitrogen and it should fix also this issue, so can you confirm this Luis?

Comment by Luis Gomez [ 03/Jul/17 ]

When I test in my setup, 300 switches works good but when I go to 400, I still see table miss flows missing.

Comment by Luis Gomez [ 25/Jul/17 ]

Also this test shows table miss flows are not even stable:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-flow-services-only-carbon/

Luis

Comment by Abhijit Kumbhare [ 18/Sep/17 ]

Not a blocker for Nitrogen. However can be checked after https://bugs.opendaylight.org/show_bug.cgi?id=9089 is merged.

Comment by Tomas Slusny [ 28/Sep/17 ]

So OPNFLWPLUG-939 was merged, can you recheck this one Luis?

Comment by Luis Gomez [ 16/Oct/17 ]

OK, I just tested in my laptop and this seems to be fixed now.

Generated at Wed Feb 07 20:33:37 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.