[CONTROLLER-1642] DistributedShardedDOMDataTreeRemotingTest.testProducerRegistrations fails intermittently Created: 25/Apr/17  Updated: 25/Jul/23  Resolved: 06/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Tom Pantelis Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8301

 Description   

testProducerRegistrations(org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTreeRemotingTest) Time elapsed: 108.576 sec <<< ERROR!
java.lang.RuntimeException: org.opendaylight.controller.cluster.sharding.DOMDataTreeShardCreationFailedException: Unable to create a cds shard.
at org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTree.initDefaultShard(DistributedShardedDOMDataTree.java:559)
at org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTree.init(DistributedShardedDOMDataTree.java:247)
at org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTreeRemotingTest.initEmptyDatastores(DistributedShardedDOMDataTreeRemotingTest.java:165)
at org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTreeRemotingTest.testProducerRegistrations(DistributedShardedDOMDataTreeRemotingTest.java:178)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
Caused by: org.opendaylight.controller.cluster.sharding.DOMDataTreeShardCreationFailedException: Unable to create a cds shard.
at org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTree$2$2.apply(DistributedShardedDOMDataTree.java:358)
at org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTree$2$2.apply(DistributedShardedDOMDataTree.java:355)
at scala.concurrent.Future$$anonfun$transform$1$$anonfun$apply$2.apply(Future.scala:224)
at scala.concurrent.Future$$anonfun$transform$1$$anonfun$apply$2.apply(Future.scala:224)
at scala.util.Try$.apply(Try.scala:192)
at scala.concurrent.Future$$anonfun$transform$1.apply(Future.scala:224)
at scala.concurrent.Future$$anonfun$transform$1.apply(Future.scala:222)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.opendaylight.controller.cluster.sharding.DOMDataTreeShardCreationFailedException: Unable to find the backend shard.Failing..
at org.opendaylight.controller.cluster.sharding.LookupTask.fail(LookupTask.java:50)
at org.opendaylight.controller.cluster.sharding.LookupTask.tryReschedule(LookupTask.java:40)
at org.opendaylight.controller.cluster.sharding.ShardedDataTreeActor$ShardCreationLookupTask$1.onComplete(ShardedDataTreeActor.java:451)
at org.opendaylight.controller.cluster.sharding.ShardedDataTreeActor$ShardCreationLookupTask$1.onComplete(ShardedDataTreeActor.java:447)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
... 13 more
Caused by: org.opendaylight.controller.cluster.datastore.exceptions.LocalShardNotFoundException: Local shard for default does not exist.
at org.opendaylight.controller.cluster.datastore.utils.ActorContext$3.checkedApply(ActorContext.java:294)
at org.opendaylight.controller.cluster.datastore.utils.ActorContext$3.checkedApply(ActorContext.java:283)
at akka.dispatch.Mapper.apply(Future.scala:359)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Success.map(Try.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
... 13 more

Since this is happening in initEmptyDatastores it could affect any of the tests. I've looked into it but haven't found the cause yet. It seems to happen if member-2 becomes the leader of prefix-configuration shard. The PrefixedShardConfigUpdateHandler DTCL for the ShardManager doesn't get notified so the local default shard on member-1 isn't found. I've also seen it fail where the PrefixedShardConfigUpdateHandler DTCL for the DistributedShardedDOMDataTree doesn't get notified.

For now I'll submit a patch to ignore the tests.



 Comments   
Comment by Tom Pantelis [ 27/Apr/17 ]

I fixed a couple things but still seeing a really weird scenario wrt the DTCL registations for the prefix-configuration-shard. I've added more debug and the following shows the ShardDataTreeChangeListenerPublisherActorProxy sending
the RegisterListener message to the ShardDataTreeChangePublisherActor:

09:03:42,342 PM [cluster-test-akka.actor.default-dispatcher-9] [DEBUG] ShardDataTreeChangeListenerPublisherActorProxy - member-1-shard-prefix-configuration-shard-config: Sending RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$e)]], initialState present=true] to publisher actor Actorakka://cluster-test/user/shardmanager-config/member-1-shard-prefix-configuration-shard-config/member-1-shard-prefix-configuration-shard-config-DTCL-publisher#1456071537, ACTORREF: 1171733008, REF: 2141773456

The FE DTCL actor's path ends in "/user/$e". This is for the CONFIG shard:
member-1-shard-prefix-configuration-shard-config. The ShardDataTreeChangePublisherActor's
path ends in "member-1-shard-prefix-configuration-shard-config-DTCL-publisher#1456071537". I also printed the publisher ActorRef reference identity (ACTORREF) and "this" ref identity (REF).

The ShardDataTreeChangePublisherActor for member-1-shard-prefix-configuration-shard-config receives the message as expected:

09:03:42,346 PM [cluster-test-akka.actor.default-dispatcher-2] [DEBUG] ShardDataTreeChangePublisherActor - member-1-shard-prefix-configuration-shard-config: Received RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$e)]], initialState present=true]: path: akka://cluster-test/user/shardmanager-config/member-1-shard-prefix-configuration-shard-config/member-1-shard-prefix-configuration-shard-config-DTCL-publisher, ref: 446773763

So far so good. Shortly thereafter another DTCL is registered for the same shard. Here's the ShardDataTreeChangeListenerPublisherActorProxy log message:

09:03:42,346 PM [cluster-test-akka.actor.default-dispatcher-9] [DEBUG] ShardDataTreeChangeListenerPublisherActorProxy - member-1-shard-prefix-configuration-shard-config: Sending RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$h)]], initialState present=true] to publisher actor Actorakka://cluster-test/user/shardmanager-config/member-1-shard-prefix-configuration-shard-config/member-1-shard-prefix-configuration-shard-config-DTCL-publisher#1456071537, ACTORREF: 1171733008, REF: 2141773456

The DTCL listener actor path ends in "user/$h". Same ACTORREF, REF, and
ShardDataTreeChangePublisherActor path. Looks good. However the message doesn't go to the same (member-1-shard-prefix-configuration-shard-config) ShardDataTreeChangePublisherActor instance. Instead it inexplicably goes to the member-1-shard-prefix-configuration-shard-operational instance:

09:03:42,351 PM [cluster-test-akka.actor.default-dispatcher-12] [DEBUG] ShardDataTreeChangePublisherActor - member-1-shard-prefix-configuration-shard-operational: Received RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$h)]], initialState present=true]: path: akka://cluster-test/user/shardmanager-operational/member-1-shard-prefix-configuration-shard-operational/member-1-shard-prefix-configuration-shard-operational-DTCL-publisher, ref: 202318682

This only happens very intermittently. Most times it takes up to 100 runs for it to happen (once was about 250). I'm scratching my head. It seems like akka is routing the message to the wrong actor - that's the only thing I can see at this point.

Comment by Tom Pantelis [ 27/Apr/17 ]

Submitted https://git.opendaylight.org/gerrit/#/c/56150/ to fix a couple issues.

Comment by Jakub Morvay [ 02/May/17 ]

(In reply to Tom Pantelis from comment #1)
> I fixed a couple things but still seeing a really weird scenario wrt the
> DTCL registations for the prefix-configuration-shard. I've added more debug
> and the following shows the ShardDataTreeChangeListenerPublisherActorProxy
> sending
> the RegisterListener message to the ShardDataTreeChangePublisherActor:
>
> 09:03:42,342 PM [cluster-test-akka.actor.default-dispatcher-9] [DEBUG]
> ShardDataTreeChangeListenerPublisherActorProxy -
> member-1-shard-prefix-configuration-shard-config: Sending RegisterListener
> [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:
> prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard,
> listener=ForwardingDataTreeChangeListener
> [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$e)]],
> initialState present=true] to publisher actor
> Actor[akka://cluster-test/user/shardmanager-config/member-1-shard-prefix-
> configuration-shard-config/member-1-shard-prefix-configuration-shard-config-
> DTCL-publisher#1456071537], ACTORREF: 1171733008, REF: 2141773456
>
> The FE DTCL actor's path ends in "/user/$e". This is for the CONFIG shard:
> member-1-shard-prefix-configuration-shard-config. The
> ShardDataTreeChangePublisherActor's
> path ends in
> "member-1-shard-prefix-configuration-shard-config-DTCL-publisher#1456071537".
> I also printed the publisher ActorRef reference identity (ACTORREF) and
> "this" ref identity (REF).
>
> The ShardDataTreeChangePublisherActor for
> member-1-shard-prefix-configuration-shard-config receives the message as
> expected:
>
> 09:03:42,346 PM [cluster-test-akka.actor.default-dispatcher-2] [DEBUG]
> ShardDataTreeChangePublisherActor -
> member-1-shard-prefix-configuration-shard-config: Received RegisterListener
> [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:
> prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard,
> listener=ForwardingDataTreeChangeListener
> [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$e)]],
> initialState present=true]: path:
> akka://cluster-test/user/shardmanager-config/member-1-shard-prefix-
> configuration-shard-config/member-1-shard-prefix-configuration-shard-config-
> DTCL-publisher, ref: 446773763
>
> So far so good. Shortly thereafter another DTCL is registered for the same
> shard. Here's the ShardDataTreeChangeListenerPublisherActorProxy log message:
>
> 09:03:42,346 PM [cluster-test-akka.actor.default-dispatcher-9] [DEBUG]
> ShardDataTreeChangeListenerPublisherActorProxy -
> member-1-shard-prefix-configuration-shard-config: Sending RegisterListener
> [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:
> prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard,
> listener=ForwardingDataTreeChangeListener
> [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$h)]],
> initialState present=true] to publisher actor
> Actor[akka://cluster-test/user/shardmanager-config/member-1-shard-prefix-
> configuration-shard-config/member-1-shard-prefix-configuration-shard-config-
> DTCL-publisher#1456071537], ACTORREF: 1171733008, REF: 2141773456
>
> The DTCL listener actor path ends in "user/$h". Same ACTORREF, REF, and
> ShardDataTreeChangePublisherActor path. Looks good. However the message
> doesn't go to the same (member-1-shard-prefix-configuration-shard-config)
> ShardDataTreeChangePublisherActor instance. Instead it inexplicably goes to
> the member-1-shard-prefix-configuration-shard-operational instance:
>
> 09:03:42,351 PM [cluster-test-akka.actor.default-dispatcher-12] [DEBUG]
> ShardDataTreeChangePublisherActor -
> member-1-shard-prefix-configuration-shard-operational: Received
> RegisterListener
> [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:
> prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard,
> listener=ForwardingDataTreeChangeListener
> [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$h)]],
> initialState present=true]: path:
> akka://cluster-test/user/shardmanager-operational/member-1-shard-prefix-
> configuration-shard-operational/member-1-shard-prefix-configuration-shard-
> operational-DTCL-publisher, ref: 202318682
>
> This only happens very intermittently. Most times it takes up to 100 runs
> for it to happen (once was about 250). I'm scratching my head. It seems like
> akka is routing the message to the wrong actor - that's the only thing I can
> see at this point.

Hi Tom,

We have also witnessed this strange behavior. The only difference is that we can see this on almost half of our test runs. Our test is pretty simple. We have 3-node cluster and we are creating a prefix based shard in config DS.

I can attach logs here, but basically there is nothing new in them. It boils down to bad registration.

2017-04-28 14:51:45,409 | DEBUG | lt-dispatcher-49 | hangeListenerPublisherActorProxy | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-1-shard-prefix-configuration-shard-operational: Sending RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/$g)]], initialState present=true] to publisher actor Actorakka://opendaylight-cluster-data/user/shardmanager-operational/member-1-shard-prefix-configuration-shard-operational/member-1-shard-prefix-configuration-shard-operational-DTCL-publisher#1253759913

2017-04-28 14:51:45,412 | DEBUG | on-dispatcher-57 | hardDataTreeChangePublisherActor | 219 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-prefix-configuration-shard-config: Received RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/$g)]], initialState present=true]

Comment by Tomas Cere [ 04/May/17 ]

https://git.opendaylight.org/gerrit/#/c/56456/
https://git.opendaylight.org/gerrit/#/c/56472/

We think if the create is called from multiple threads it raced and multiple actors were created with the same queue which would explain the mixed up messages for the config and operational listeners when the actors otherwise had no way to actually have an ActorRef to the counterpart in the otherdatastore.

Comment by Tom Pantelis [ 04/May/17 ]

(In reply to Tomas Cere from comment #4)
> https://git.opendaylight.org/gerrit/#/c/56456/
> https://git.opendaylight.org/gerrit/#/c/56472/
>
> We think if the create is called from multiple threads it raced and multiple
> actors were created with the same queue which would explain the mixed up
> messages for the config and operational listeners when the actors otherwise
> had no way to actually have an ActorRef to the counterpart in the
> otherdatastore.

yeah - good find. I forgot all about the MeteredBoundedMailbox.

Generated at Wed Feb 07 21:54:01 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.