[CONTROLLER-1330] Clustering errors under concurrent load Created: 21/May/15  Updated: 19/Oct/17  Resolved: 21/May/15

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Post-Helium
Fix Version/s: None

Type: Bug
Reporter: Gary Wu Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 3329

 Description   

I set up a simple 3 node cluster and ran the flow_config_blaster.py script in the integration repo against it. This script allows you to create flows using multiple threads. I found that whenever I use multiple threads (e.g. 10 threads), a substantial majority of the add-flow requests would fail. It seems that only the first request succeeds, while other concurrent requests get stuck. This is with the latest integration build from master.

The leader node produces log entries like these:

2015-05-20 13:07:18,443 | WARN | lt-dispatcher-17 | ConcurrentDOMDataBroker | 212 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Tx: DOM-336 Error during phase CAN_COMMIT, starting Abort
akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-inventory-config#-1821580364)] after [5000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)[197:com.typesafe.akka.actor:2.3.10]
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)[197:com.typesafe.akka.actor:2.3.10]
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)[194:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)[194:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)[197:com.typesafe.akka.actor:2.3.10]
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)[197:com.typesafe.akka.actor:2.3.10]
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)[197:com.typesafe.akka.actor:2.3.10]
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)[197:com.typesafe.akka.actor:2.3.10]
at java.lang.Thread.run(Thread.java:745)[:1.7.0_75]
2015-05-20 13:07:18,446 | ERROR | lt-dispatcher-17 | LocalThreePhaseCommitCohort | 212 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Failed to prepare transaction member-1-txn-344 on backend
akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-inventory-config#-1821580364)] after [5000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)[197:com.typesafe.akka.actor:2.3.10]
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)[197:com.typesafe.akka.actor:2.3.10]
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)[194:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)[194:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)[197:com.typesafe.akka.actor:2.3.10]
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)[197:com.typesafe.akka.actor:2.3.10]
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)[197:com.typesafe.akka.actor:2.3.10]
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)[197:com.typesafe.akka.actor:2.3.10]
at java.lang.Thread.run(Thread.java:745)[:1.7.0_75]

While the follower nodes produce log entries like this:

2015-05-20 13:07:14,093 | WARN | lt-dispatcher-17 | Shard | 205 - org.opendaylight.controller.sal-akka-raft - 1.2.0.SNAPSHOT | ApplyState took more time than expected. Elapsed Time = 72 ms ApplyState = ApplyState

{identifier='null', replicatedLogEntry.index =1771, startTime=1828178749622}

Sequential add-flow (e.g. one thread only) works fine. Multithreaded add-flow works fine when the cluster has only a single node.



 Comments   
Comment by Gary Wu [ 21/May/15 ]

Sequential test case that works fine:
flow_config_blaster.py --flows 10 --threads 1

Parallel test case that fails:
flow_config_blaster.py --flows 1 --threads 10

Three node cluster with every shard replicated onto every node.

Comment by Gary Wu [ 21/May/15 ]

The problem seems to have gone away with today's build:

https://nexus.opendaylight.org/service/local/repositories/opendaylight.snapshot/content/org/opendaylight/integration/distribution-karaf/0.3.0-SNAPSHOT/distribution-karaf-0.3.0-20150521.163820-1722.zip

Comment by Moiz Raja [ 21/May/15 ]

Most likely due to this fix https://git.opendaylight.org/gerrit/#/c/20890/

Generated at Wed Feb 07 19:55:16 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.