[CONTROLLER-1550] Perf test 100K flow installation does not always work in Cluster Created: 20/Sep/16  Updated: 25/Jul/23  Resolved: 05/Oct/18

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Sanjib Mohapatra Assignee: Luis Gomez
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File karaf.rar    
External issue ID: 6755

 Description   

Basic flow installation is not working in boron. Refer below regression report.

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-daily-only-boron/97/robot/

Steps where it is failing:

openflowplugin-bulkomatic-clustering-daily.txt.Cluster HA Data Recovery BulkFlow 2Node Cluster.Verify Flows In Switch Before Cluster Restart
Tags:
Message:
Keyword 'MininetKeywords.Mininet Sync Status' failed after retrying for 1 minute 40 seconds. The last error was: '*** s1 ------------------------------------------------------------------------
OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=0
mininet>' contains 'flow_count=1000' 0 times, not 1 time.
no 0:01:41.536 2
openflowplugin-bulkomatic-clustering-daily.txt.Cluster HA Data Recovery BulkFlow 2Node Cluster.Verify Flows In Switch Before Follower Restart
Tags:
Message:
Keyword 'MininetKeywords.Mininet Sync Status' failed after retrying for 1 minute 40 seconds. The last error was: '*** s1 ------------------------------------------------------------------------
OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=0
mininet>' contains 'flow_count=1000' 0 times, not 1 time.

Scenario Execution Step:
==========================

1. Get Inventory Follower and Leader information

2. Start Mininet Connect To Follower Node1

3. Add Bulk Flow ( 1000 flows )From Follower Node1

4. Get Bulk Flows and Verify In Cluster config DS all flows are installed

5. Verify Flows In Switch , The test is failing in this step. no flows seen in OVS switch.

Attached the follower node1 karaf logs



 Comments   
Comment by Sanjib Mohapatra [ 20/Sep/16 ]

Attachment karaf.rar has been added with description: follower node1 karaf logs

Comment by Tomas Slusny [ 20/Sep/16 ]

Isn't this duplicate of this one? https://bugs.opendaylight.org/show_bug.cgi?id=6552

Comment by Luis Gomez [ 20/Sep/16 ]

I do not think there is regression here, the failures are in the cluster performance test that we have to separate in its own job and adjust so that it passes.

Comment by Luis Gomez [ 20/Sep/16 ]

Reducing severity and change title to describe exactly what is going on.

Comment by Luis Gomez [ 22/Sep/16 ]

The test is sporadically failing with just 32 switches and 2K flows/switch (64K flows):

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-carbon/

Therefore rising to critical for now.

Comment by Luis Gomez [ 08/Oct/16 ]

I am adding more memory to see if it helps:

https://git.opendaylight.org/gerrit/#/c/46644/

Comment by Abhijit Kumbhare [ 23/Feb/17 ]

Luis,

Did adding more memory to the test (as per your last comment) help?

Abhijit

Comment by Luis Gomez [ 23/Feb/17 ]

No it did not

Comment by Luis Gomez [ 23/Feb/17 ]

For this particular bug I have action point of collecting cluster releated ERRORs in karaf log and open bugs to cluster project.

Comment by Luis Gomez [ 06/Mar/17 ]

It seems that Boron is passing fine but not Carbon. For the latest I opened this bug in controller project to better understand the ERRORs:

https://bugs.opendaylight.org/show_bug.cgi?id=7901

Comment by Abhijit Kumbhare [ 23/Mar/17 ]

Anil Vishnoi looking at this. The patch for this is failing netvirt test cases - and was reverted by https://git.opendaylight.org/gerrit/#/c/53532/.

Comment by Luis Gomez [ 14/Apr/17 ]

My current observation is that stats collection for 100K flows work most of the times in Boron:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-daily-only-boron

But never works in Carbon:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-daily-only-carbon/

So we may be facing a perf regression in Carbon.

BR/Luis

Comment by Jozef Bacigal [ 09/May/17 ]

Luis, the test are showing

Variable '${temp_json_config_get}' not found.

It is some issue in test itself ?

Comment by Luis Gomez [ 09/May/17 ]

I actually posted the wrong links, these are the goods:

Boron:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-boron

Carbon:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-carbon/

BR/Luis

Comment by Miroslav Macko [ 29/May/17 ]

Hi Luis,

When I tried to reproduce this locally, I have received following warning:
(after all flows added to the config)

15159:2017-05-29 14:40:34,956 | WARN | entLoopGroup-5-2 | OperationLimiter | 207 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Failed to acquire operation permit for transaction member-1-datastore-operational-fe-0-chn-12-txn-734-0

It does not appear in Jenkins logs. Have you seen it already somewhere?

It is happening already when I try to add 1000 flows on 10 switches.

Flows are getting to the config and also to the switch.

But then mentioned WARN appear and later some switches are disconnected.

I have connected only one controller based on Jenkins test (one of the Followers), and it was timed out.

The Master was lost. And since no other controller was connected, we are finished...

I am wondering, if this can happen on Jenkins also, since logs don't seem to be same.

Thanks,
Miro

Comment by Luis Gomez [ 30/May/17 ]

This is because in scalability/perf tests we use log:set ERROR. I will file a patch to set WARN on this test.

Comment by Luis Gomez [ 30/May/17 ]

https://git.opendaylight.org/gerrit/#/c/58016

Comment by Miroslav Macko [ 01/Jun/17 ]

Hi Luis,

Great. Thanks a lot.

One run was completed already, but there are no logs.

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-carbon/259/

I believe it happen just occasionally, since there are also other builds without logs. Right?

Hope next run logs should be there.

Thanks,
Miro

Comment by Luis Gomez [ 01/Jun/17 ]

Right, next build worked: https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-carbon/260/

Comment by Miroslav Macko [ 09/Jun/17 ]

Hello Luis,

We need to check results of this test:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-nitrogen/

But it looks like it is not working properly. In the graph it is showing pass green, tests are red and logs are not there.

Could you please check what can be wrong?

Thanks a lot,
Miro

Comment by Luis Gomez [ 09/Jun/17 ]

There is already a thread on this, the issue is impacting more projects (not just ofp):

https://lists.opendaylight.org/pipermail/integration-dev/2017-June/009906.html

I will let you know when it gets fixed.

Comment by Luis Gomez [ 03/Jul/17 ]

Hi Miroslav, Nitrogen pass some times now:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-nitrogen/

But Carbon fails all the time:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-carbon/

Issue sems to be flow stats collection.

Comment by Tomas Slusny [ 12/Jul/17 ]

So it looks like on nitrogen this is already working, and on carbon it will be working after this cherry-pick will be merged: https://git.opendaylight.org/gerrit/#/c/60196/

Comment by Tomas Slusny [ 17/Jul/17 ]

Cherry-pick was merged and looks like it is working on carbon too now (https://jenkins.opendaylight.org/releng/job/openflowplugin-csit-3node-periodic-gate-bulkomatic-clustering-perf-daily-only-carbon/98/) so closing this bug.

Comment by Luis Gomez [ 17/Jul/17 ]

Yes that is fine, we have all branches behaving the same, but do we want to leave this open to track the issue of sporadic failure in this test? also looking at karaf.log when it fails, it looks like we are hitting cluster issues:

2017-07-17 18:04:34,185 | WARN | tp1066220241-427 | BrokerFacade | 276 - org.opendaylight.netconf.sal-rest-connector - 1.5.2.SNAPSHOT | Error reading /(urn:opendaylight:inventory?revision=2013-08-19)nodes from datastore OPERATIONAL
ReadFailedException

{message=Error checking ReadData for path /(urn:opendaylight:inventory?revision=2013-08-19)nodes, errorList=[RpcError [message=Error checking ReadData for path /(urn:opendaylight:inventory?revision=2013-08-19)nodes, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.29.12.99:2550/), Path(/user/shardmanager-operational/member-3-shard-inventory-operational/shard-inventory-member-1:datastore-operational@0:8_1309#1833809227)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadData".]]}
Comment by Luis Gomez [ 25/Jul/17 ]

Just realized there is a regression if we compare to Boron:

https://jenkins.opendaylight.org/releng/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-boron/

Or even nitrogen:

https://jenkins.opendaylight.org/releng/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-nitrogen/

So reopening the bug for this:

https://jenkins.opendaylight.org/releng/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-carbon/

Luis

Comment by Luis Gomez [ 21/Sep/17 ]

Looking at a failed job log, I think this is cluster issue so moving bug to the right project:

2017-09-17 15:53:44,930 | WARN | tp1225015068-434 | BrokerFacade | 276 - org.opendaylight.netconf.sal-rest-connector - 1.5.2.SNAPSHOT | Error reading /(urn:opendaylight:inventory?revision=2013-08-19)nodes from datastore OPERATIONAL
ReadFailedException

{message=Error checking ReadData for path /(urn:opendaylight:inventory?revision=2013-08-19)nodes, errorList=[RpcError [message=Error checking ReadData for path /(urn:opendaylight:inventory?revision=2013-08-19)nodes, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.29.14.94:2550/), Path(/user/shardmanager-operational/member-3-shard-inventory-operational/shard-inventory-member-2:datastore-operational@0:6_870#-1165819976)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadData".]]}

at org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$1.onComplete(RemoteTransactionContext.java:184)[196:org.opendaylight.controller.sal-distributed-datastore:1.5.2.SNAPSHOT]
at akka.dispatch.OnComplete.internal(Future.scala:258)[170:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.OnComplete.internal(Future.scala:256)[170:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)[170:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)[170:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)[166:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)[170:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)[170:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)[170:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)[170:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)[166:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)[170:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)[170:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)[170:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[166:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[166:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[166:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[166:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.29.14.94:2550/), Path(/user/shardmanager-operational/member-3-shard-inventory-operational/shard-inventory-member-2:datastore-operational@0:6_870#-1165819976)]] after [5000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadData".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[170:com.typesafe.akka.actor:2.4.18]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[170:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[166:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[166:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[166:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)[170:com.typesafe.akka.actor:2.4.18]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)[170:com.typesafe.akka.actor:2.4.18]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)[170:com.typesafe.akka.actor:2.4.18]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)[170:com.typesafe.akka.actor:2.4.18]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_141]
2017-09-17 15:53:47,162 | WARN | lt-dispatcher-47 | ClusterCoreDaemon | 171 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.14.161:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.29.14.94:2550, status = Up)]. Node roles [member-2]

Comment by Luis Gomez [ 21/Sep/17 ]

Forgot the link to the failing job:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/openflowplugin-csit-3node-periodic-bulkomatic-clustering-perf-daily-only-carbon/365/

Comment by Vratko Polak [ 21/Sep/17 ]

> AskTimeoutException

Let me guess. Ongoing transactions (read by Restconf in this case) while isolation (or JVM kill) happens?
That is expected to fail while using ask-based protocol. Try using tell-based protocol (by prepending integration/test/csit/suites/controller/dom_data_broker/restart_odl_with_tell_based_true.robot in the testplan).

TODO: Add a link to an official documentation on tell-based protocol here.

Comment by Luis Gomez [ 21/Sep/17 ]

This test does not take down any cluster instance, it just pushes a bunch of flows in config DS in a member that is not leader or switch master (worst case).

Comment by Luis Gomez [ 27/Sep/17 ]

Changing to tell-based protocol seems to help in carbon but not in nitrogen, not sure why. I will prepare a patch to change the test anyway.

Comment by Luis Gomez [ 27/Sep/17 ]

https://git.opendaylight.org/gerrit/#/c/63592/

Comment by OpenDaylight Release [ 03/May/18 ]

Luis, moving this to you for now since Sanjib is no longer active. 

Comment by Luis Gomez [ 03/May/18 ]

I am not sure we can do much here: same perf 100K flows test works for single node but obviously performance takes a hit when cluster is used due to the backend data replication.

Comment by Luis Gomez [ 24/May/18 ]

I lower priority of this and I will adjust the test to pass with less flows. After that I will close the ticket.

Comment by Luis Gomez [ 11/Jun/18 ]

Test passes with 80K flows. We will leave like this for now.

Generated at Wed Feb 07 19:55:50 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.