[BGPCEP-392] BGP scaling target not met in 3-node cluster Created: 09/Feb/16  Updated: 16/Nov/21

Status: Confirmed
Project: bgpcep
Component/s: BGP
Affects Version/s: Bugzilla Migration
Fix Version/s: None

Type: Bug
Reporter: Radovan Sajben Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by BGPCEP-601 B and C: rejoining isolated node caus... Resolved
is blocked by CONTROLLER-1483 akka.pattern.AskTimeoutException on f... Resolved
is blocked by CONTROLLER-1610 InvalidActorNameException Resolved
is blocked by CONTROLLER-1616 Exception Cannot append new entry Resolved
External issue ID: 5282

 Description   

SUT: http://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/distribution-karaf/0.4.0-SNAPSHOT/distribution-karaf-0.4.0-20160208.111345-3983.zip

Capacity test target 1M prefixes failed in 3-node cluster environment: https://jenkins.opendaylight.org/sandbox/job/bgpcep-csit-3node-bgpclustering-only-beryllium/19/

Karaf.log
=========

2016-02-09 10:12:13,834 | ERROR | ult-dispatcher-4 | LocalThreePhaseCommitCohort | 143 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | Failed to prepare transaction member-1-chn-29-txn-3836 on backend
java.lang.RuntimeException: The transacton was aborted due to inflight leadership change.
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.abortPendingTransactions(ShardCommitCoordinator.java:485)[143:org.opendaylight.controller.sal-distributed-datastore:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.onStateChanged(Shard.java:695)[143:org.opendaylight.controller.sal-distributed-datastore:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.handleBehaviorChange(RaftActor.java:422)[140:org.opendaylight.controller.sal-akka-raft:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.switchBehavior(RaftActor.java:361)[140:org.opendaylight.controller.sal-akka-raft:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:267)[140:org.opendaylight.controller.sal-akka-raft:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:36)[139:org.opendaylight.controller.sal-clustering-commons:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.onReceiveCommand(Shard.java:273)[143:org.opendaylight.controller.sal-distributed-datastore:1.3.0.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:97)[139:org.opendaylight.controller.sal-clustering-commons:1.3.0.SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:534)[128:com.typesafe.akka.actor:2.3.14]
at akka.persistence.Recovery$State$class.process(Recovery.scala:30)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Recovery$class.aroundReceive(Recovery.scala:265)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)[128:com.typesafe.akka.actor:2.3.14]
at akka.actor.ActorCell.invoke(ActorCell.scala:487)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.Mailbox.run(Mailbox.scala:220)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)[128:com.typesafe.akka.actor:2.3.14]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]

...

2016-02-09 10:12:13,860 | ERROR | CommitFutures-5 | BGPPeer | 287 - org.opendaylight.bgpcep.bgp-rib-impl - 0.5.0.SNAPSHOT | Transaction chain failed.
TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=java.lang.RuntimeException: The transacton was aborted due to inflight leadership change.]]}

at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)[116:org.opendaylight.controller.sal-broker-impl:1.3.0.SNAPSHOT]
at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)[116:org.opendaylight.controller.sal-broker-impl:1.3.0.SNAPSHOT]
at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:96)[42:org.opendaylight.yangtools.util:0.8.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ConcurrentDOMDataBroker.handleException(ConcurrentDOMDataBroker.java:209)[143:org.opendaylight.controller.sal-distributed-datastore:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ConcurrentDOMDataBroker.access$000(ConcurrentDOMDataBroker.java:44)[143:org.opendaylight.controller.sal-distributed-datastore:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.ConcurrentDOMDataBroker$1.onFailure(ConcurrentDOMDataBroker.java:117)[143:org.opendaylight.controller.sal-distributed-datastore:1.3.0.SNAPSHOT]
at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)[39:com.google.guava:18.0.0]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)[39:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)[39:com.google.guava:18.0.0]
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)[39:com.google.guava:18.0.0]
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)[39:com.google.guava:18.0.0]
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:68)[39:com.google.guava:18.0.0]
at org.opendaylight.controller.cluster.datastore.SingleCommitCohortProxy$1.onComplete(SingleCommitCohortProxy.java:60)[143:org.opendaylight.controller.sal-distributed-datastore:1.3.0.SNAPSHOT]
at akka.dispatch.OnComplete.internal(Future.scala:247)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.OnComplete.internal(Future.scala:245)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:175)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:172)[128:com.typesafe.akka.actor:2.3.14]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)[128:com.typesafe.akka.actor:2.3.14]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)[128:com.typesafe.akka.actor:2.3.14]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[125:org.scala-lang.scala-library:2.11.7.v20150622-112736-1fbce4612c]
Caused by: java.lang.RuntimeException: The transacton was aborted due to inflight leadership change.
at org.opendaylight.controller.cluster.datastore.ShardCommitCoordinator.abortPendingTransactions(ShardCommitCoordinator.java:485)[143:org.opendaylight.controller.sal-distributed-datastore:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.onStateChanged(Shard.java:695)[143:org.opendaylight.controller.sal-distributed-datastore:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.handleBehaviorChange(RaftActor.java:422)[140:org.opendaylight.controller.sal-akka-raft:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.switchBehavior(RaftActor.java:361)[140:org.opendaylight.controller.sal-akka-raft:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:267)[140:org.opendaylight.controller.sal-akka-raft:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:36)[139:org.opendaylight.controller.sal-clustering-commons:1.3.0.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.onReceiveCommand(Shard.java:273)[143:org.opendaylight.controller.sal-distributed-datastore:1.3.0.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(Eventsourced.scala:430)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:97)[139:org.opendaylight.controller.sal-clustering-commons:1.3.0.SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:534)[128:com.typesafe.akka.actor:2.3.14]
at akka.persistence.Recovery$State$class.process(Recovery.scala:30)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.ProcessorImpl$$anon$2.process(Processor.scala:103)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.ProcessorImpl$$anon$2.aroundReceive(Processor.scala:114)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Recovery$class.aroundReceive(Recovery.scala:265)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(Eventsourced.scala:428)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$$anon$2.doAroundReceive(Eventsourced.scala:82)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$$anon$2.aroundReceive(Eventsourced.scala:78)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:369)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.persistence.UntypedPersistentActor.aroundReceive(Eventsourced.scala:428)[133:com.typesafe.akka.persistence.experimental:2.3.14]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)[128:com.typesafe.akka.actor:2.3.14]
at akka.actor.ActorCell.invoke(ActorCell.scala:487)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)[128:com.typesafe.akka.actor:2.3.14]
at akka.dispatch.Mailbox.run(Mailbox.scala:220)[128:com.typesafe.akka.actor:2.3.14]
... 5 more



 Comments   
Comment by Radovan Sajben [ 09/Feb/16 ]

Test references:
================

bgpcep_bgpclustering_txt_010_singlepeer_prefixcount_check_talking_ipv4_topology_count_1
bgpcep_bgpclustering_txt_010_singlepeer_prefixcount_check_for_empty_ipv4_topology_after_listening_1

Comment by Robert Varga [ 02/May/16 ]

BUG-5280 should result in this particular exception not being thrown and the transactions should recover behind the scenes. Root cause of why the migration happened needs to be analyzed, as that may point out to some other issue.

Comment by Peter Gubka [ 11/Apr/17 ]

I tried the test again. The present numbers are low, approximately 170k and 225k prefixes. This week it can be seen in sandbox:
https://jenkins.opendaylight.org/sandbox/job/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/1/
https://jenkins.opendaylight.org/sandbox/job/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/2/
Both runs are done with "latest" zip from nexus. These zips should contain the log message from
https://git.opendaylight.org/gerrit/#/c/54320/, but I haven't found this message in the logs.
For enabling tell-based protocol i used https://git.opendaylight.org/gerrit/#/c/54340/ as before.

Comment by Peter Gubka [ 02/May/17 ]

There is a longevity job https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-3node-bgpclustering-longevity-only-carbon/
which configures and de-configures 300k prefixes and it uses tell-based=True

In the run #1 https://jenkins.opendaylight.org/releng/view/bgpcep/job/bgpcep-csit-3node-bgpclustering-longevity-only-carbon/1/
it did not reach the target number of prefixes. It reaches 25k only.
https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-bgpclustering-longevity-only-carbon/1/archives/log.html.gz#s1-s2-t1-k9-k1-k1-k1-k1-k1-k1-k1-k1-k1-k2-k1-k3-k7-k3-k1-k6

On the other hand, in https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/245/archives/log.html.gz the 300k prefixes passed.

Comment by Robert Varga [ 05/Aug/17 ]

Does this still happen on master?

Generated at Wed Feb 07 19:12:55 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.