[CONTROLLER-1333] Clustering : Performance issues in BGP scale test Created: 22/May/15 Updated: 10/Jun/15 Resolved: 10/Jun/15 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Post-Helium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Dana Kutenicsova | Assignee: | Moiz Raja |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 3340 | ||||||||
| Description |
|
After connecting BGP to a feed, it timeouts after about 3 minutes with ask.timeout exceptions. Build is: distribution-karaf-0.3.0-20150522.085838-1738. I'll send steps to reproduce in an email. Here's the exception: 2015-05-22 07:48:52,740 | INFO | oupCloseable-3-1 | AbstractBGPSessionNegotiator | 227 - org.opendaylight.bgpcep.bgp-rib-impl - 0.4.0.SNAPSHOT | BGP Session established successfully. |
| Comments |
| Comment by Moiz Raja [ 27/May/15 ] |
|
I enabled debug logging, grepped for YimeoutException and then grepped on some of the transactions that were failing. I see a pattern (which I cannot explain yet). Most of the transactions that fail with the timeout happen because "FinfPrimaryShard" fails with a timeout. See this snippet -> http://pastebin.com/yBh2nrSs A couple of curious things, a. The AkkaTimeout shows the actor to be the shard This does not make sense because the find primary request is sent to the ShardManager which is the parent of Shard. |
| Comment by Moiz Raja [ 27/May/15 ] |
|
I reran the test and now I see the timeouts to be 5 seconds with the actor still being the shard - puzzling - I'm wondering if there is some mixup happening here and the timeout is being delivered to the wrong handler. |
| Comment by Moiz Raja [ 27/May/15 ] |
|
Based on what I am seeing I think the following patch (which was merged May 26th) will help because it avoids looking up the primary shard unless ShardManager actually removes it from the PrimaryShardInfo cache. https://git.opendaylight.org/gerrit/#/c/20571/ Dana, if you could try this out it would be good. I'm hitting some trouble building an integration distribution otherwise would have tried it myself. |
| Comment by Tom Pantelis [ 27/May/15 ] |
|
That is odd. What's even odder is that the ActorContext uses the shard-initialization-timeout for FindPrimary which is set to 5 min. Did you have ActorContext and ShardManager debug on as well? If the PrimaryShardInfo future wasn't cached yet then we should've seen see the "Sending message FindPrimary..." output from ActorContext. It seems the PrimaryShardInfo future was not yet cached b/c the "Find primary for shard default failed" output was logged from an akka dispatcher thread which indicates it was invoked via the OnComplete callback. The AbstractTransactionContextFactory first checks Future#isCompleted to elide the async OnComplete wait on the Future so if it was already completed then the output would've been logged on the "oupCloseable-3-1" thread. Getting back to the AskTimeoutException indicating the "default" shard actor path, a possible explanation is that the call to actorSelection in ActorSelection#onPrimaryShardFound is what threw the AskTimeoutException. However, where did the 30000 ms timeout some from? Also the "findPrimaryShardAsync received ..." debug output should've been seen (assuming you had ActorContext debug on). In the paste bin, the NoOpTransactionContext didn't print the AskTimeoutException stack trace - possibly b/c it was already printed earlier too many times. Hopefully that would give an indication as to the source of the ask failure. |
| Comment by Moiz Raja [ 27/May/15 ] |
|
Since bugzilla won't let me I've uploaded the whole debug enabled log file here https://cisco.box.com/s/eesdiopbocetz2qkv4orjy3inknde7v1. Note that in this log you will see Akka timeouts happening at 5 seconds but the actor is still the Shard. |
| Comment by Tom Pantelis [ 27/May/15 ] |
|
Notice the time interval between these 2 log entries for tx member-1-txn-4668: 2015-05-26 21:11:39,003 | DEBUG | oupCloseable-3-1 | SingleCommitCohortProxy | 169 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Tx member-1-txn-4668 canCommit Over 7 minutes passed between the time it was created and readied by the caller and the time it failed to find the primary shard. The ActorContext sent out 2 FindPrimary messages ~9 sec after the tx was created: 2015-05-26 21:11:48,516 | DEBUG | lt-dispatcher-39 | ActorContext | 169 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Sending message class org.opendaylight.controller.cluster.datastore.messages.FindPrimary to Actorakka://opendaylight-cluster-data/user/shardmanager-operational#967672581 I assume 1 of them was for this tx. It appears the 9 sec gap is primarily due to waiting for the previous ready Future to complete. The requests were processed by the ShardManager and the responses came back quickly to the ActorContext: 2015-05-26 21:11:48,517 | DEBUG | lt-dispatcher-39 | ShardManager | 169 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | shard-manager-operational: Found primary for default: LocalPrimaryShardFound [primaryPath=akka.tcp://opendaylight-cluster-data@127.0.0.1:2550/user/shardmanager-operational/member-1-shard-default-operational#-701003399, localShardDataTree=org.opendaylight.yangtools.yang.data.impl.schema.tree.InMemoryDataTree@58bde7ca] 2015-05-26 21:11:48,517 | DEBUG | ult-dispatcher-5 | ActorContext | 169 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | findPrimaryShardAsync received: LocalPrimaryShardFound [primaryPath=akka.tcp://opendaylight-cluster-data@127.0.0.1:2550/user/shardmanager-operational/member-1-shard-default-operational#-701003399, localShardDataTree=org.opendaylight.yangtools.yang.data.impl.schema.tree.InMemoryDataTree@58bde7ca] So the FindPrimary message itself was quick. Also it succeeded so the AskTimeoutEx didn't come from that message. So what accounts for the 7 min gap between "findPrimaryShardAsync received" and the AbstractTransactionContextFactory receiving the Future callback? There could've been a long GC pause (for that long?). Or it could've taken that long for a thread to become available and process the OnComplete callback (not likely). But neither of those account for the AskTimeoutEx. I suspect it was onFindPrimaryShardFound that took 7 min. The only code I see that could be the culprit are the call to actorSelection and the call to onShardInfoUpdated while holding the lock. The latter shouldn't take that long even if the lock was contended. The call to actorSelection looks suspicious. I'm not sure what it does internally - maybe it tries to "ping" the shard actor but times out and throws the AskTimeoutEx - again that would explain the presence of the shard actor path in the AskTimeoutEx. It would be interesting to put a try/catch around actorSelection. I agree that https://git.opendaylight.org/gerrit/#/c/20571/ should help here. It should alleviate the excessive FindPrimary messages. |
| Comment by Moiz Raja [ 27/May/15 ] |
|
Another curious thing about most of the transactions failing with AskTimeout - there were at most 1 or 2 modifications in those transactions. Why would that be when most transactions seem to have thousands? |
| Comment by Moiz Raja [ 27/May/15 ] |
|
Tom, I don't think that actorSelection taking a long time is the root cause. Creating an actorSelection is supposed to be cheap. Using an actor selection is more expensive than an ActorRef. If we go with your theory then actorSelection would have to do an ask to the actor and wait for a response. Note that the timeouts we are seeing 5s/30s are the kind of timeouts we have set. I think there is something else going on here. Also note a bunch of consecutive transactions failed. 550777:2015-05-26 21:18:51,324 | DEBUG | ult-dispatcher-3 | bstractTransactionContextFactory | 169 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Tx member-1-txn-4668: Find primary for shard default failed |
| Comment by Dana Kutenicsova [ 27/May/15 ] |
|
I tried build https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/distribution-karaf/0.3.0-SNAPSHOT/distribution-karaf-0.3.0-20150527.075221-1862.tar.gz , which should contain the change 20571. From my point of view it didn't help, the logs are in ~/27-05-fail-karaf.log |
| Comment by Tom Pantelis [ 27/May/15 ] |
|
An actorSelecton shouldn't take that long but then again nothing should I think the bunch of consecutive failed transactions is b/c they were all backed up then timed out similarly. I agree the 5s/30s timeouts look like ours - that part is odd. The 5s is more odd b/c that would correspond to the operation timeout but we only use that now for remote tx operations - all these txs are local. (In reply to Moiz Raja from comment #8) |
| Comment by Tom Pantelis [ 27/May/15 ] |
|
I can't get to ~/27-05-fail-karaf.log (In reply to Dana Kutenicsova from comment #9) |
| Comment by Moiz Raja [ 27/May/15 ] |
|
Here is 27-05-karaf.log https://cisco.box.com/s/m5o616g8qsye2urcvr844zfim2l5ikt5 But it does not have debug logging. I'll enable that again and run the test then post the log after. |
| Comment by Moiz Raja [ 27/May/15 ] |
|
Debug logs https://cisco.box.com/s/qbw7pexrgkesgbuqpjgtij2p7kojf324 |
| Comment by Tom Pantelis [ 27/May/15 ] |
|
The default-operational shard sent the LeaderStateChanged event at : 2015-05-27 12:20:34,187 | INFO | er-data-akka.actor.default-dispatcher-34 | ShardManager | 169 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | shard-manager-operational: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-default-operational, leaderId=member-1-shard-default-operational] The FindPrimary message was sent and received 5 sec later 2015-05-27 12:20:39,113 | DEBUG | er-data-akka.actor.default-dispatcher-16 | ActorContext | 169 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | findPrimaryShardAsync received: LocalPrimaryShardFound [primaryPath=akka.tcp://opendaylight-cluster-data@127.0.0.1:2550/user/shardmanager-operational/member-1-shard-default-operational#-1521885687, localShardDataTree=org.opendaylight.yangtools.yang.data.impl.schema.tree.InMemoryDataTree@3eb14c15] In fact this is the only such log entry for default-operational, which makes sense b/c it should've been cached thereafter and there were no more LeaderStateChanged events to clear it. Also we see that the local data tree was resolved so it must have gotten put in the cache: 2015-05-27 12:20:39,114 | DEBUG | er-data-akka.actor.default-dispatcher-16 | bstractTransactionContextFactory | 169 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Shard default resolved to local data tree - adding local factory But yet a tx ~35 sec later failed to find the primary shard. Notice the 11 sec gap. 2015-05-27 12:21:14,949 | DEBUG | er-data-akka.actor.default-dispatcher-19 | SingleCommitCohortProxy | 169 - org.opendaylight.controller.sal-distributed-datastore - 1.2.0.SNAPSHOT | Tx member-1-txn-7907 canCommit But I see no FindPrimary message going out between 12:21:14 and 12:32:15 which indicates it got the Future from the cache. But how could the Future indicate failure when we only put immediate, successful Futures in the cache? Plus other transactions at 12:20:39+ succeeded b/c I see ReadyLocalTransaction messages going out. It sort of seems like there's something funky going on with akka/scala Futures - almost like it somehow loses its completed state and gets some unrelated failure associated to it. Maybe we should try switching to use ListenableFutures for the cache... |
| Comment by Moiz Raja [ 27/May/15 ] |
|
Could the future returned to us be mutable? Could it be reused by akka and set to something else? What if we cached the PrimaryShardInfo instead of the future? |
| Comment by Moiz Raja [ 27/May/15 ] |
|
Mutable futures is making less sense to me now. In PrimaryShardInfoFutureCache#putSuccessful we create a new future instead of storing the one from akka#ask. Now it only looks like this may be a case of a confused future... |
| Comment by Tom Pantelis [ 27/May/15 ] |
|
Yeah - it doesn't make sense. We need more debug logging - I'll push a patch for that. I want to try to prove that the scala/akka Future got messed up somehow or maybe I did something wrong in the isCompleted code block. We really don't need to use a scala Future here anyway - ListenableFutures are more of a known commodity - I feel more comfortable with them - I've read the source code and fully understand how they work. I don't really know enough scala to be able to understand their code at this point. |
| Comment by Moiz Raja [ 27/May/15 ] |
|
Tom, hold off on the ListenableFuture changes till we put in more diagnostics. I'll try to push a diagnostic patch as well. |
| Comment by Moiz Raja [ 04/Jun/15 ] |
|
This appears to have been fixed by https://git.opendaylight.org/gerrit/#/c/21749/ |
| Comment by Moiz Raja [ 09/Jun/15 ] |
|
On my machine running a test with 100K routes takes 15 mins for CDS vs 4.5 mins for IMDS. |
| Comment by Moiz Raja [ 10/Jun/15 ] |
|
Performance is acceptable now : 1M routes ingested in 5m |