[CONTROLLER-1882] Inject DataBroker only when all shards have leaders Created: 11/Jan/19 Updated: 10/Jan/21 Resolved: 22/Sep/20 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | 2.0.4 |
| Type: | Task | Priority: | Medium |
| Reporter: | Faseela K | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Description |
|
We are hitting an issue in genius on stable/oxygen, where randomly idmanager-impl bundle does not come up, as the datastore read in the blueprint initialization code was failing with NoShardLeaderException. https://lists.opendaylight.org/pipermail/genius-dev/2019-January/003554.html While applications work on putting propoer failure handling, was thinking it would be better if dataBroker is injected only when all shards have leaders, than having a 2 min timeout.
ODL :: genius :: idmanager-impl (260) ------------------------------------- Status: Failure Blueprint 1/8/19 10:47 AM Exception:
org.osgi.service.blueprint.container.ComponentDefinitionException: Error when instantiating bean idManager of class org.opendaylight.genius.idmanager.IdManager
org.osgi.service.blueprint.container.ComponentDefinitionException: org.osgi.service.blueprint.container.ComponentDefinitionException: Error when instantiating bean idManager of class org.opendaylight.genius.idmanager.IdManager
at org.apache.aries.blueprint.container.ServiceRecipe.createService(ServiceRecipe.java:310)
at org.apache.aries.blueprint.container.ServiceRecipe.internalGetService(ServiceRecipe.java:252)
at org.apache.aries.blueprint.container.ServiceRecipe.internalCreate(ServiceRecipe.java:149)
at org.apache.aries.blueprint.di.AbstractRecipe$1.call(AbstractRecipe.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.aries.blueprint.di.AbstractRecipe.create(AbstractRecipe.java:88)
at org.apache.aries.blueprint.container.BlueprintRepository.createInstances(BlueprintRepository.java:255)
at org.apache.aries.blueprint.container.BlueprintRepository.createAll(BlueprintRepository.java:186)
at org.apache.aries.blueprint.container.BlueprintContainerImpl.instantiateEagerComponents(BlueprintContainerImpl.java:704)
at org.apache.aries.blueprint.container.BlueprintContainerImpl.doRun(BlueprintContainerImpl.java:410)
at org.apache.aries.blueprint.container.BlueprintContainerImpl.run(BlueprintContainerImpl.java:275)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.aries.blueprint.container.ExecutorServiceWrapper.run(ExecutorServiceWrapper.java:106)
at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:48)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.osgi.service.blueprint.container.ComponentDefinitionException: Error when instantiating bean idManager of class org.opendaylight.genius.idmanager.IdManager
at org.apache.aries.blueprint.container.BeanRecipe.wrapAsCompDefEx(BeanRecipe.java:361)
at org.apache.aries.blueprint.container.BeanRecipe.getInstanceFromType(BeanRecipe.java:351)
at org.apache.aries.blueprint.container.BeanRecipe.getInstance(BeanRecipe.java:282)
at org.apache.aries.blueprint.container.BeanRecipe.internalCreate2(BeanRecipe.java:830)
at org.apache.aries.blueprint.container.BeanRecipe.internalCreate(BeanRecipe.java:811)
at org.apache.aries.blueprint.di.AbstractRecipe$1.call(AbstractRecipe.java:79)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.aries.blueprint.di.AbstractRecipe.create(AbstractRecipe.java:88)
at org.apache.aries.blueprint.di.RefRecipe.internalCreate(RefRecipe.java:62)
at org.apache.aries.blueprint.di.AbstractRecipe.create(AbstractRecipe.java:106)
at org.apache.aries.blueprint.container.ServiceRecipe.createService(ServiceRecipe.java:285)
... 21 more
Caused by: ReadFailedException{message=Error executeRead ReadData for path /(urn:opendaylight:genius:idmanager?revision=2016-04-06)id-pools, errorList=[RpcError [message=Error executeRead ReadData for path /(urn:opendaylight:genius:idmanager?revision=2016-04-06)id-pools, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard 192.168.70.1-shard-default-config currently has no leader. Try again later.]]}
at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:71)
at org.opendaylight.controller.cluster.datastore.TransactionProxy$1.invoke(TransactionProxy.java:98)
at org.opendaylight.controller.cluster.datastore.TransactionContextWrapper.executePriorTransactionOperations(TransactionContextWrapper.java:194)
at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.onFindPrimaryShardFailure(AbstractTransactionContextFactory.java:109)
at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.access$100(AbstractTransactionContextFactory.java:37)
at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:136)
at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:130)
at akka.dispatch.OnComplete.internal(Future.scala:260)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:185)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard 192.168.70.1-shard-default-config currently has no leader. Try again later.
|
| Comments |
| Comment by Faseela K [ 11/Jan/19 ] |
|
One important observation is that in stable/oxygen we seem to hitting the ID Manager issue more often and even during a normal deployment whereas in Boron we hit this issue less often and only in scale setup on reboot. So, delaying dataBroker may not completely fix the problem, the reason being that shards are taking unusually long time to come UP and if ID Manager is going to wait for the shards to be UP, we may hit Blueprint time out issues in genius/netvirt applications. The fundamental concern here is why is the Shards coming up so late. |
| Comment by Tom Pantelis [ 11/Jan/19 ] |
|
Take a look at the karaf logs and the timestamps of messages - hopefully that will provide a clue. The actual shard election process is mostly network bound, ie however long it takes to send/receive messages over the wire, although with all nodes starting around the same time split votes may lengthen the process but that shouldn't take minutes. Perhaps akka is taking an unusually long time to establish connectivity and form the cluster in some cases in your env - one thing is that the first seed node listed in the akka.conf has to be up, the other 2 nodes will not form a cluster by themselves. So I'd suggest to always start the first seed node first. Interestingly, I haven't heard of this occurring in CSIT.... |
| Comment by Robert Varga [ 11/Jan/19 ] |
|
This is related to CDS, hence it is not an MD-SAL issue. |
| Comment by Faseela K [ 11/Jan/19 ] |
|
Sure, I am checking all these. We will also work towards adding retries in application side logic. Please do check if there will be a benefit in making this 2 min timeout increased? Or is it already configurable, so that we can pick a value based on scale of the deployment. |
| Comment by Robert Varga [ 16/Sep/19 ] |
|
There are different trade-offs here and depending on how the applications are wired, and how BP is configured. |
| Comment by Robert Varga [ 28/Apr/20 ] |
|
So the general approach we will take to this is that we will completely eliminate use of BP in mdsal/controller in favor of using OSGi DS. This work will result in the system being decomposed to smaller components (as opposed to BP containers). Each component can (and will) start as soon as it can. There will be no deadlines to activation, as OSGi DS is inherently asynchronous and aligned with OSGi lifecycle. This will result in services being available as components, and thus the choice of DI framework with be left to downstreams – who can opt to switch to OSGi DS, too, eliminating this need for battling timeouts.
|
| Comment by Robert Varga [ 31/Jul/20 ] |
|
Most of the refactoring is done, such that core services do not really rely on blueprint. Most notably AbstractDataStore is started asynchronously and published to Service Registry only after shards have settled. There is still a notable gap of Distributed EOS, which does not wait for settle: 2020-07-31T15:18:16,998 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDistributedEntityOwnershipService | 139 - org.opendaylight.controller.sal-distributed-eos - 2.0.4.SNAPSHOT | Distributed Entity Ownership Service starting 2020-07-31T15:18:17,009 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDistributedEntityOwnershipService | 139 - org.opendaylight.controller.sal-distributed-eos - 2.0.4.SNAPSHOT | Distributed Entity Ownership Service started 2020-07-31T15:18:17,012 | INFO | opendaylight-cluster-data-shard-dispatcher-48 | EntityOwnershipShard | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | Shard created : member-1-shard-entity-ownership-operational, persistent : false 2020-07-31T15:18:17,013 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-28 | DistributedEntityOwnershipService | 139 - org.opendaylight.controller.sal-distributed-eos - 2.0.4.SNAPSHOT | Successfully created entity-ownership shard 2020-07-31T15:18:17,014 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-28 | RoleChangeNotifier | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | RoleChangeNotifier:akka.tcp://opendaylight-cluster-data@127.0.0.1:2550/user/shardmanager-operational/member-1-s hard-entity-ownership-operational/member-1-shard-entity-ownership-operational-notifier#533402702 created and ready for shard:member-1-shard-entity-ownership-operational 2020-07-31T15:18:17,018 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDOMDataBroker | 138 - org.opendaylight.controller.sal-distributed-datastore - 2.0.4.SNAPSHOT | DOM Data Broker starting 2020-07-31T15:18:17,039 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | ConcurrentDOMDataBroker | 185 - org.opendaylight.yangtools.util - 5.0.5 | ThreadFactory created: CommitFutures 2020-07-31T15:18:17,051 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDOMDataBroker | 138 - org.opendaylight.controller.sal-distributed-datastore - 2.0.4.SNAPSHOT | DOM Data Broker started 2020-07-31T15:18:17,051 | INFO | opendaylight-cluster-data-shard-dispatcher-48 | EntityOwnershipShard | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | Starting recovery for member-1-shard-entity-ownership-operational with journal batch size 1 2020-07-31T15:18:17,055 | INFO | opendaylight-cluster-data-shard-dispatcher-48 | EntityOwnershipShard | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | member-1-shard-entity-ownership-operational: Recovery completed - Switching actor to Follower - last log index = -1, last l og term = -1, snapshot index = -1, snapshot term = -1, journal size = 0 2020-07-31T15:18:17,059 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | RoleChangeNotifier | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | RoleChangeNotifier for member-1-shard-entity-ownership-operational , received role change from null to Follower 2020-07-31T15:18:17,060 | INFO | opendaylight-cluster-data-shard-dispatcher-47 | EntityOwnershipShard | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | member-1-shard-entity-ownership-operational (Candidate): Starting new election term 1 2020-07-31T15:18:17,061 | INFO | opendaylight-cluster-data-shard-dispatcher-47 | EntityOwnershipShard | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | member-1-shard-entity-ownership-operational (Follower) :- Switching from behavior Follower to Candidate, election term: 1 i.e. DistributedEntityOwnershipService itself does not wait for settle before being published. This will need to be fixed in a follow-up patch. |