[CONTROLLER-1072] Clustering: akka.pattern.AskTimeoutException when sending large amounts of BGP data to EXABGP Created: 16/Dec/14 Updated: 04/Jun/15 Resolved: 04/Jun/15 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | mdsal |
| Affects Version/s: | Post-Helium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Jozef Behran | Assignee: | Harman Singh |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| External issue ID: | 2518 | ||||||||
| Priority: | High | ||||||||
| Description |
|
The following exception occurs when I try to send 100000 routes through BGP into any clustered setup. Tested on https://jenkins.opendaylight.org/integration/view/Integration%20jobs/job/integration-master-project-centralized-integration/lastSuccessfulBuild/artifact/distributions/extra/karaf/target/distribution-karaf-0.3.0-SNAPSHOT.tar.gz: akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-1-shard-default-operational/shard-member-1-txn-88#1797666910)] after [5000 ms] This problem does not occur when I use a non-clustered setup. |
| Comments |
| Comment by Jozef Behran [ 16/Dec/14 ] |
|
Additional information:
|
| Comment by Jozef Behran [ 16/Dec/14 ] |
|
After long investigation I realized that only about 6000 routes will go through before these timeouts show up. And it does not matter whether I run the topology provider or not. At 7500 I start to get this AKKA timeout error. |
| Comment by Jozef Behran [ 17/Dec/14 ] |
|
To hit this bug configure EXABGP with 10000 routes and then point it to an ODL instance configured with clustering enabled. I advise to switch topology off as RIB is just enough to hit the bug. The attachment is an EXABGP configuration file with 15000 routes which is fairly large but not too large. |
| Comment by Jozef Behran [ 17/Dec/14 ] |
|
Attachment 15k.cfg.xz has been added with description: EXABGP configuration with 15000 routes |
| Comment by Moiz Raja [ 17/Dec/14 ] |
|
How do we unzip this xz file ? tar -xzf does not work with it |
| Comment by Moiz Raja [ 17/Dec/14 ] |
|
I had to download Ez7z on mac to unzip the xz attachment. |
| Comment by Vratko Polak [ 18/Dec/14 ] |
|
(In reply to Moiz Raja from comment #4) The 'z' from -xzf is for .gz only. The letter for .xz is 'J'. |
| Comment by Vratko Polak [ 18/Dec/14 ] |
|
(In reply to Vratko Polák from comment #6) Oh, there was no .tar before .xz |
| Comment by Moiz Raja [ 14/Jan/15 ] |
| Comment by Jozef Behran [ 03/Mar/15 ] |
|
The change increased resiliency against large batches of data but after trying to push 1 million prefixes I still get akka.TimeOutException. Or a "dead letter encountered" log record. After reviewing the fix I found a suspicious line (in TransactionProxy.java): this.operationLimiter = new Semaphore(actorContext.getTransactionOutstandingOperationLimit()); I suspect the limit here is too large. It occurs to me that it does not give AKKA enough "breathing space" so when e.g. disk swapping occurs at a time when there is a bunch of huge transactions in the queue and another bunch waiting at the limiter to be transmitted, then AKKA starts getting timeouts. However consultation with other developers revealed that there are multiple timeouts and other buffering limits involved so I am not sure at all what is wrong here and/or how to fix it. Increasing the akka timeout to 900 seconds makes the problem go away. |
| Comment by Moiz Raja [ 04/Mar/15 ] |
|
Jozef, Couple of questions, 1. Which akka timeout did you increase to 900 seconds to fix this issue? Was it the operation timeout or was it the transaction timeout? 2. How do we reproduce this problem with 1 Million prefixes? 3. How much time does it take to ingest 1 Million prefixes? |
| Comment by Jozef Behran [ 09/Mar/15 ] |
|
After more testing I discovered that this exception occurs reliably when I try to push 2 million of prefixes into BGP. Changing the AKKA timeout does not affect anything in this case. The exception will occur after roughly 5 minutes even when I set the timeout to 15 minutes. Additionally I realized that the exception can occur roughly once in 5 tries even when running on 1 million. Answers: 1. I increased "operational-timeout-in-seconds" in module "distributed-operational-store-module" and "distributed-config-store-module". I have no idea how to set "transaction timeout" to anything because the config files here are different from what I could find by searching "akka" in Google. However I found some config file which mentions "akka" and "timeout" in one section, so I am going to take look on that. 2. I need to build a testcase and it will take a while (the test I use right now uses a repackaged build with custom configuration, I need to extract the customizations etc). Once done I will attach it here. 3. About 4 minutes with IMDS. When CDS does not fail, then it also takes about 4 minutes. When it fails, it may take up to 1 hour (while it generates multi-GB log file). |
| Comment by Jozef Behran [ 10/Mar/15 ] |
|
Steps to reproduce with 1M of routes: 1. Extract ODL tarball into your home directory. Notes:
|
| Comment by Jozef Behran [ 10/Mar/15 ] |
|
Attachment test.tar.gz has been added with description: Package with tools for testing ODL with up to 180 million prefixes |
| Comment by Jozef Behran [ 10/Mar/15 ] |
|
According to Vratko, here is a faster path to hit the bug: 1. Extract ODL tarball into your home directory. Additionally, try 2 million routes if you have difficulty hitting the bug. |
| Comment by Harman Singh [ 14/Apr/15 ] |
|
Hi josef, I could not make your instructions work. The play.py scripts fails with following error, when i try to use it Traceback (most recent call last): Can you give me pointers what i need to do to reproduce it? I followed your comments written above. |
| Comment by Jozef Behran [ 30/Apr/15 ] |
|
(sorry for very late reply, I was overbooked during this month) 1. Did you try to run the "play.py" command multiple times? Basically, the problem is that I could not reproduce this "Operation timed out" error on my setup. |
| Comment by Moiz Raja [ 30/Apr/15 ] |
|
Jozef, in the meantime we did get over the problem of executing play.py. It works |
| Comment by Moiz Raja [ 04/Jun/15 ] |
|
Will track this issue as part of 3340 as it's newer and has more relevant info |