[CONTROLLER-1072] Clustering: akka.pattern.AskTimeoutException when sending large amounts of BGP data to EXABGP Created: 16/Dec/14  Updated: 04/Jun/15  Resolved: 04/Jun/15

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: Post-Helium
Fix Version/s: None

Type: Bug
Reporter: Jozef Behran Assignee: Harman Singh
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File 15k.cfg.xz     File test.tar.gz    
Issue Links:
Duplicate
duplicates CONTROLLER-1333 Clustering : Performance issues in BG... Resolved
External issue ID: 2518
Priority: High

 Description   

The following exception occurs when I try to send 100000 routes through BGP into any clustered setup. Tested on https://jenkins.opendaylight.org/integration/view/Integration%20jobs/job/integration-master-project-centralized-integration/lastSuccessfulBuild/artifact/distributions/extra/karaf/target/distribution-karaf-0.3.0-SNAPSHOT.tar.gz:

akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-1-shard-default-operational/shard-member-1-txn-88#1797666910)] after [5000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)[312:com.typesafe.akka.actor:2.3.4]
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)[312:com.typesafe.akka.actor:2.3.4]
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)[309:org.scala-lang.scala
-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)[309:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)[312:com.typesafe.akka.actor:2.3.4]
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)[312:com.typesafe.akka.actor:2.3.4]
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)[312:com.typesafe.akka.actor:2.3.4]
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)[312:com.typesafe.akka.actor:2.3.4]
at java.lang.Thread.run(Unknown Source)[:1.7.0_67]

This problem does not occur when I use a non-clustered setup.



 Comments   
Comment by Jozef Behran [ 16/Dec/14 ]

Additional information:

  • Topology provider was turned ON during all tests.
  • It is triggred even without topology (RIB only is enough).
Comment by Jozef Behran [ 16/Dec/14 ]

After long investigation I realized that only about 6000 routes will go through before these timeouts show up. And it does not matter whether I run the topology provider or not. At 7500 I start to get this AKKA timeout error.

Comment by Jozef Behran [ 17/Dec/14 ]

To hit this bug configure EXABGP with 10000 routes and then point it to an ODL instance configured with clustering enabled. I advise to switch topology off as RIB is just enough to hit the bug. The attachment is an EXABGP configuration file with 15000 routes which is fairly large but not too large.

Comment by Jozef Behran [ 17/Dec/14 ]

Attachment 15k.cfg.xz has been added with description: EXABGP configuration with 15000 routes

Comment by Moiz Raja [ 17/Dec/14 ]

How do we unzip this xz file ? tar -xzf does not work with it

Comment by Moiz Raja [ 17/Dec/14 ]

I had to download Ez7z on mac to unzip the xz attachment.

Comment by Vratko Polak [ 18/Dec/14 ]

(In reply to Moiz Raja from comment #4)
> How do we unzip this xz file ? tar -xzf does not work with it

The 'z' from -xzf is for .gz only. The letter for .xz is 'J'.
Every linux-based tar program I have seen works well with just "tar -xf", no mater if targeted to .tar, .tar.gz (.tgz) or .tar.xz (.txz).

Comment by Vratko Polak [ 18/Dec/14 ]

(In reply to Vratko Polák from comment #6)
> (In reply to Moiz Raja from comment #4)
> > How do we unzip this xz file ? tar -xzf does not work with it

Oh, there was no .tar before .xz
In that case, "unxz" command is available in Linux distributions, usually in a package named xz-utils.

Comment by Moiz Raja [ 14/Jan/15 ]

https://git.opendaylight.org/gerrit/#/c/14155/ - master

Comment by Jozef Behran [ 03/Mar/15 ]

The change increased resiliency against large batches of data but after trying to push 1 million prefixes I still get akka.TimeOutException. Or a "dead letter encountered" log record.

After reviewing the fix I found a suspicious line (in TransactionProxy.java):

this.operationLimiter = new Semaphore(actorContext.getTransactionOutstandingOperationLimit());

I suspect the limit here is too large. It occurs to me that it does not give AKKA enough "breathing space" so when e.g. disk swapping occurs at a time when there is a bunch of huge transactions in the queue and another bunch waiting at the limiter to be transmitted, then AKKA starts getting timeouts.

However consultation with other developers revealed that there are multiple timeouts and other buffering limits involved so I am not sure at all what is wrong here and/or how to fix it.

Increasing the akka timeout to 900 seconds makes the problem go away.

Comment by Moiz Raja [ 04/Mar/15 ]

Jozef,

Couple of questions,

1. Which akka timeout did you increase to 900 seconds to fix this issue? Was it the operation timeout or was it the transaction timeout?

2. How do we reproduce this problem with 1 Million prefixes?

3. How much time does it take to ingest 1 Million prefixes?

Comment by Jozef Behran [ 09/Mar/15 ]

After more testing I discovered that this exception occurs reliably when I try to push 2 million of prefixes into BGP. Changing the AKKA timeout does not affect anything in this case. The exception will occur after roughly 5 minutes even when I set the timeout to 15 minutes. Additionally I realized that the exception can occur roughly once in 5 tries even when running on 1 million.

Answers:

1. I increased "operational-timeout-in-seconds" in module "distributed-operational-store-module" and "distributed-config-store-module". I have no idea how to set "transaction timeout" to anything because the config files here are different from what I could find by searching "akka" in Google. However I found some config file which mentions "akka" and "timeout" in one section, so I am going to take look on that.

2. I need to build a testcase and it will take a while (the test I use right now uses a repackaged build with custom configuration, I need to extract the customizations etc). Once done I will attach it here.

3. About 4 minutes with IMDS. When CDS does not fail, then it also takes about 4 minutes. When it fails, it may take up to 1 hour (while it generates multi-GB log file).

Comment by Jozef Behran [ 10/Mar/15 ]

Steps to reproduce with 1M of routes:

1. Extract ODL tarball into your home directory.
2. Enter the directory that was made by step 2 and run bin/karaf.
3. Enter "feature:install odl-bgpcep-bgp-all".
4. Enter "feature:install odl-restconf-noauth".
5. Install clustering according to your wishes (persistence, replication, etc).
6. Enter "logout" and wait until karaf exits.
7. Copy the file "41-bgp-example.xml" from the attached package into etc/opendaylight/karaf (overwrite the file with the same name that is there).
8. Run bin/karaf again and wait about 5 minutes for ODL to boot (you can use "top" to shorten the wait, watch for the CPU usage of the Java process to drop below about 10% and stay there).
9. In another terminal extract the play.py from the attached package and then run "python play.py --gencount=1000000".

Notes:

  • The "41-bgp-example.xml" file from the package sets up topology updating. That makes it much more likely to hit the bug.
  • If you intent to change the content of "41-bgp-example.xml", do a DIFF betwen the original file and the one in the package.
  • In the last step you can specify any count you want in the --gencount argument, up to 180 million.
Comment by Jozef Behran [ 10/Mar/15 ]

Attachment test.tar.gz has been added with description: Package with tools for testing ODL with up to 180 million prefixes

Comment by Jozef Behran [ 10/Mar/15 ]

According to Vratko, here is a faster path to hit the bug:

1. Extract ODL tarball into your home directory.
2. Enter the directory that was made by step 2.
3. Copy the file "41-bgp-example.xml" from the attached package into directory etc/opendaylight/karaf (create the directory if it does not exist).
4. Run bin/karaf
5. Enter "feature:install odl-bgpcep-bgp-all".
6. Enter "feature:install odl-restconf-noauth".
7. Install clustering according to your wishes (persistence, replication, etc).
8. In another terminal extract the play.py from the attached package and then run "python play.py --gencount=1000000".

Additionally, try 2 million routes if you have difficulty hitting the bug.

Comment by Harman Singh [ 14/Apr/15 ]

Hi josef,

I could not make your instructions work. The play.py scripts fails with following error, when i try to use it

Traceback (most recent call last):
File "play.py", line 366, in <module>
Main()
File "play.py", line 319, in Main
FromODL,ToODL = ConnectToODL(args.myip, args.myport, args.peerip, args.peerport, CtlLog)
File "play.py", line 192, in ConnectToODL
ODL.connect((peerip, int(peerport)))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 224, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno 60] Operation timed out

Can you give me pointers what i need to do to reproduce it? I followed your comments written above.

Comment by Jozef Behran [ 30/Apr/15 ]

(sorry for very late reply, I was overbooked during this month)

1. Did you try to run the "play.py" command multiple times?
2. Did you try to restart ODL and try again? I sometimes experienced ODL to hang on installation of some feature, maybe in your case it got stuck around the code that handles the connection.
3. Do you have a link to the ODL build you are trying to test?

Basically, the problem is that I could not reproduce this "Operation timed out" error on my setup.

Comment by Moiz Raja [ 30/Apr/15 ]

Jozef, in the meantime we did get over the problem of executing play.py. It works

Comment by Moiz Raja [ 04/Jun/15 ]

Will track this issue as part of 3340 as it's newer and has more relevant info

Generated at Wed Feb 07 21:52:03 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.