[CONTROLLER-1882] Inject DataBroker only when all shards have leaders Created: 11/Jan/19  Updated: 10/Jan/21  Resolved: 22/Sep/20

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: 2.0.4

Type: Task Priority: Medium
Reporter: Faseela K Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
is blocked by MDSAL-392 Clean up BindingRuntimeContext instan... Resolved
Issue split
split to CONTROLLER-1960 Publish Distributed EOS only after th... Resolved
Relates
relates to CONTROLLER-1914 Allow initial settle time to be tuned Resolved
relates to MDSAL-520 Replace blueprint containers with OSG... Confirmed
relates to OVSDB-484 Check for the SHARD Status before ope... Resolved

 Description   

We are hitting an issue in genius on stable/oxygen, where randomly idmanager-impl bundle does not come up, as the datastore read in the blueprint initialization code was failing with NoShardLeaderException.

https://lists.opendaylight.org/pipermail/genius-dev/2019-January/003554.html

While applications work on putting propoer failure handling, was thinking it would be better if dataBroker is injected only when all shards have leaders, than having a 2 min timeout.

 

ODL :: genius :: idmanager-impl (260)

-------------------------------------

Status: Failure

Blueprint

1/8/19 10:47 AM

Exception:

 

org.osgi.service.blueprint.container.ComponentDefinitionException: Error when instantiating bean idManager of class org.opendaylight.genius.idmanager.IdManager
org.osgi.service.blueprint.container.ComponentDefinitionException: org.osgi.service.blueprint.container.ComponentDefinitionException: Error when instantiating bean idManager of class org.opendaylight.genius.idmanager.IdManager
                at org.apache.aries.blueprint.container.ServiceRecipe.createService(ServiceRecipe.java:310)
                at org.apache.aries.blueprint.container.ServiceRecipe.internalGetService(ServiceRecipe.java:252)
                at org.apache.aries.blueprint.container.ServiceRecipe.internalCreate(ServiceRecipe.java:149)
                at org.apache.aries.blueprint.di.AbstractRecipe$1.call(AbstractRecipe.java:79)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at org.apache.aries.blueprint.di.AbstractRecipe.create(AbstractRecipe.java:88)
                at org.apache.aries.blueprint.container.BlueprintRepository.createInstances(BlueprintRepository.java:255)
                at org.apache.aries.blueprint.container.BlueprintRepository.createAll(BlueprintRepository.java:186)
                at org.apache.aries.blueprint.container.BlueprintContainerImpl.instantiateEagerComponents(BlueprintContainerImpl.java:704)
                at org.apache.aries.blueprint.container.BlueprintContainerImpl.doRun(BlueprintContainerImpl.java:410)
                at org.apache.aries.blueprint.container.BlueprintContainerImpl.run(BlueprintContainerImpl.java:275)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at org.apache.aries.blueprint.container.ExecutorServiceWrapper.run(ExecutorServiceWrapper.java:106)
                at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:48)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
                at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:748)
Caused by: org.osgi.service.blueprint.container.ComponentDefinitionException: Error when instantiating bean idManager of class org.opendaylight.genius.idmanager.IdManager
                at org.apache.aries.blueprint.container.BeanRecipe.wrapAsCompDefEx(BeanRecipe.java:361)
                at org.apache.aries.blueprint.container.BeanRecipe.getInstanceFromType(BeanRecipe.java:351)
                at org.apache.aries.blueprint.container.BeanRecipe.getInstance(BeanRecipe.java:282)
                at org.apache.aries.blueprint.container.BeanRecipe.internalCreate2(BeanRecipe.java:830)
                at org.apache.aries.blueprint.container.BeanRecipe.internalCreate(BeanRecipe.java:811)
                at org.apache.aries.blueprint.di.AbstractRecipe$1.call(AbstractRecipe.java:79)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at org.apache.aries.blueprint.di.AbstractRecipe.create(AbstractRecipe.java:88)
                at org.apache.aries.blueprint.di.RefRecipe.internalCreate(RefRecipe.java:62)
                at org.apache.aries.blueprint.di.AbstractRecipe.create(AbstractRecipe.java:106)
                at org.apache.aries.blueprint.container.ServiceRecipe.createService(ServiceRecipe.java:285)
                ... 21 more
Caused by: ReadFailedException{message=Error executeRead ReadData for path /(urn:opendaylight:genius:idmanager?revision=2016-04-06)id-pools, errorList=[RpcError [message=Error executeRead ReadData for path /(urn:opendaylight:genius:idmanager?revision=2016-04-06)id-pools, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard 192.168.70.1-shard-default-config currently has no leader. Try again later.]]}
                at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:71)
                at org.opendaylight.controller.cluster.datastore.TransactionProxy$1.invoke(TransactionProxy.java:98)
                at org.opendaylight.controller.cluster.datastore.TransactionContextWrapper.executePriorTransactionOperations(TransactionContextWrapper.java:194)
                at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.onFindPrimaryShardFailure(AbstractTransactionContextFactory.java:109)
                at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.access$100(AbstractTransactionContextFactory.java:37)
                at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:136)
                at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:130)
                at akka.dispatch.OnComplete.internal(Future.scala:260)
                at akka.dispatch.OnComplete.internal(Future.scala:258)
                at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
                at akka.dispatch.japi$CallbackBridge.apply(Future.scala:185)
                at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
                at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
                at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
                at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
                at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
                at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
                at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
                at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)
                at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
                at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
                at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
                at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard 192.168.70.1-shard-default-config currently has no leader. Try again later.

 



 Comments   
Comment by Faseela K [ 11/Jan/19 ]

One important observation is that in stable/oxygen we seem to hitting the ID Manager issue more often and even during a normal deployment whereas in Boron we hit this issue less often and only in scale setup on reboot.

So, delaying dataBroker may not completely fix the problem, the reason being that shards are taking unusually long time to come UP and if ID Manager is going to wait for the shards to be UP, we may hit Blueprint time out issues in genius/netvirt applications.

The fundamental concern here is why is the Shards coming up so late.

Comment by Tom Pantelis [ 11/Jan/19 ]

Take a look at the karaf logs and the timestamps of messages - hopefully that will provide a clue. The actual shard election process is mostly network bound, ie however long it takes to send/receive messages over the wire, although with all nodes starting around the same time split votes may lengthen the process but that shouldn't take minutes. Perhaps akka is taking an unusually long time to establish connectivity and form the cluster in some cases in your env - one thing is that the first seed node listed in the akka.conf has to be up, the other 2 nodes will not form a cluster by themselves. So I'd suggest to always start the first seed node first.

Interestingly, I haven't heard of this occurring in CSIT....

Comment by Robert Varga [ 11/Jan/19 ]

This is related to CDS, hence it is not an MD-SAL issue.

Comment by Faseela K [ 11/Jan/19 ]

Sure, I am checking all these. We will also work towards adding retries in application side logic.

Please do check if there will be a benefit in making this 2 min timeout increased? Or is it already configurable, so that we can pick a value based on scale of the deployment.

Comment by Robert Varga [ 16/Sep/19 ]

There are different trade-offs here and depending on how the applications are wired, and how BP is configured.

Comment by Robert Varga [ 28/Apr/20 ]

So the general approach we will take to this is that we will completely eliminate use of BP in mdsal/controller in favor of using OSGi DS. This work will result in the system being decomposed to smaller components (as opposed to BP containers). Each component can (and will) start as soon as it can. There will be no deadlines to activation, as OSGi DS is inherently asynchronous and aligned with OSGi lifecycle.

This will result in services being available as components, and thus the choice of DI framework with be left to downstreams – who can opt to switch to OSGi DS, too, eliminating this need for battling timeouts.

 

Comment by Robert Varga [ 31/Jul/20 ]

Most of the refactoring is done, such that core services do not really rely on blueprint. Most notably AbstractDataStore is started asynchronously and published to Service Registry only after shards have settled. There is still a notable gap of Distributed EOS, which does not wait for settle:

2020-07-31T15:18:16,998 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDistributedEntityOwnershipService | 139 - org.opendaylight.controller.sal-distributed-eos - 2.0.4.SNAPSHOT | Distributed Entity Ownership Service starting
2020-07-31T15:18:17,009 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDistributedEntityOwnershipService | 139 - org.opendaylight.controller.sal-distributed-eos - 2.0.4.SNAPSHOT | Distributed Entity Ownership Service started
2020-07-31T15:18:17,012 | INFO  | opendaylight-cluster-data-shard-dispatcher-48 | EntityOwnershipShard             | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | Shard created : member-1-shard-entity-ownership-operational, persistent : false
2020-07-31T15:18:17,013 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-28 | DistributedEntityOwnershipService | 139 - org.opendaylight.controller.sal-distributed-eos - 2.0.4.SNAPSHOT | Successfully created entity-ownership shard
2020-07-31T15:18:17,014 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-28 | RoleChangeNotifier               | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | RoleChangeNotifier:akka.tcp://opendaylight-cluster-data@127.0.0.1:2550/user/shardmanager-operational/member-1-s
hard-entity-ownership-operational/member-1-shard-entity-ownership-operational-notifier#533402702 created and ready for shard:member-1-shard-entity-ownership-operational
2020-07-31T15:18:17,018 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDOMDataBroker                | 138 - org.opendaylight.controller.sal-distributed-datastore - 2.0.4.SNAPSHOT | DOM Data Broker starting
2020-07-31T15:18:17,039 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | ConcurrentDOMDataBroker          | 185 - org.opendaylight.yangtools.util - 5.0.5 | ThreadFactory created: CommitFutures
2020-07-31T15:18:17,051 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDOMDataBroker                | 138 - org.opendaylight.controller.sal-distributed-datastore - 2.0.4.SNAPSHOT | DOM Data Broker started
2020-07-31T15:18:17,051 | INFO  | opendaylight-cluster-data-shard-dispatcher-48 | EntityOwnershipShard             | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | Starting recovery for member-1-shard-entity-ownership-operational with journal batch size 1
2020-07-31T15:18:17,055 | INFO  | opendaylight-cluster-data-shard-dispatcher-48 | EntityOwnershipShard             | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | member-1-shard-entity-ownership-operational: Recovery completed  - Switching actor to Follower - last log index = -1, last l
og term = -1, snapshot index = -1, snapshot term = -1, journal size = 0
2020-07-31T15:18:17,059 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | RoleChangeNotifier               | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | RoleChangeNotifier for member-1-shard-entity-ownership-operational , received role change from null to Follower
2020-07-31T15:18:17,060 | INFO  | opendaylight-cluster-data-shard-dispatcher-47 | EntityOwnershipShard             | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | member-1-shard-entity-ownership-operational (Candidate): Starting new election term 1
2020-07-31T15:18:17,061 | INFO  | opendaylight-cluster-data-shard-dispatcher-47 | EntityOwnershipShard             | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | member-1-shard-entity-ownership-operational (Follower) :- Switching from behavior Follower to Candidate, election term: 1

i.e. DistributedEntityOwnershipService itself does not wait for settle before being published. This will need to be fixed in a follow-up patch.

Generated at Wed Feb 07 19:56:41 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.