[CONTROLLER-1642] DistributedShardedDOMDataTreeRemotingTest.testProducerRegistrations fails intermittently Created: 25/Apr/17 Updated: 25/Jul/23 Resolved: 06/May/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Tom Pantelis | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8301 |
| Description |
|
testProducerRegistrations(org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTreeRemotingTest) Time elapsed: 108.576 sec <<< ERROR! Since this is happening in initEmptyDatastores it could affect any of the tests. I've looked into it but haven't found the cause yet. It seems to happen if member-2 becomes the leader of prefix-configuration shard. The PrefixedShardConfigUpdateHandler DTCL for the ShardManager doesn't get notified so the local default shard on member-1 isn't found. I've also seen it fail where the PrefixedShardConfigUpdateHandler DTCL for the DistributedShardedDOMDataTree doesn't get notified. For now I'll submit a patch to ignore the tests. |
| Comments |
| Comment by Tom Pantelis [ 27/Apr/17 ] |
|
I fixed a couple things but still seeing a really weird scenario wrt the DTCL registations for the prefix-configuration-shard. I've added more debug and the following shows the ShardDataTreeChangeListenerPublisherActorProxy sending 09:03:42,342 PM [cluster-test-akka.actor.default-dispatcher-9] [DEBUG] ShardDataTreeChangeListenerPublisherActorProxy - member-1-shard-prefix-configuration-shard-config: Sending RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$e)]], initialState present=true] to publisher actor Actorakka://cluster-test/user/shardmanager-config/member-1-shard-prefix-configuration-shard-config/member-1-shard-prefix-configuration-shard-config-DTCL-publisher#1456071537, ACTORREF: 1171733008, REF: 2141773456 The FE DTCL actor's path ends in "/user/$e". This is for the CONFIG shard: The ShardDataTreeChangePublisherActor for member-1-shard-prefix-configuration-shard-config receives the message as expected: 09:03:42,346 PM [cluster-test-akka.actor.default-dispatcher-2] [DEBUG] ShardDataTreeChangePublisherActor - member-1-shard-prefix-configuration-shard-config: Received RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$e)]], initialState present=true]: path: akka://cluster-test/user/shardmanager-config/member-1-shard-prefix-configuration-shard-config/member-1-shard-prefix-configuration-shard-config-DTCL-publisher, ref: 446773763 So far so good. Shortly thereafter another DTCL is registered for the same shard. Here's the ShardDataTreeChangeListenerPublisherActorProxy log message: 09:03:42,346 PM [cluster-test-akka.actor.default-dispatcher-9] [DEBUG] ShardDataTreeChangeListenerPublisherActorProxy - member-1-shard-prefix-configuration-shard-config: Sending RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$h)]], initialState present=true] to publisher actor Actorakka://cluster-test/user/shardmanager-config/member-1-shard-prefix-configuration-shard-config/member-1-shard-prefix-configuration-shard-config-DTCL-publisher#1456071537, ACTORREF: 1171733008, REF: 2141773456 The DTCL listener actor path ends in "user/$h". Same ACTORREF, REF, and 09:03:42,351 PM [cluster-test-akka.actor.default-dispatcher-12] [DEBUG] ShardDataTreeChangePublisherActor - member-1-shard-prefix-configuration-shard-operational: Received RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://cluster-test/), Path(/user/$h)]], initialState present=true]: path: akka://cluster-test/user/shardmanager-operational/member-1-shard-prefix-configuration-shard-operational/member-1-shard-prefix-configuration-shard-operational-DTCL-publisher, ref: 202318682 This only happens very intermittently. Most times it takes up to 100 runs for it to happen (once was about 250). I'm scratching my head. It seems like akka is routing the message to the wrong actor - that's the only thing I can see at this point. |
| Comment by Tom Pantelis [ 27/Apr/17 ] |
|
Submitted https://git.opendaylight.org/gerrit/#/c/56150/ to fix a couple issues. |
| Comment by Jakub Morvay [ 02/May/17 ] |
|
(In reply to Tom Pantelis from comment #1) Hi Tom, We have also witnessed this strange behavior. The only difference is that we can see this on almost half of our test runs. Our test is pretty simple. We have 3-node cluster and we are creating a prefix based shard in config DS. I can attach logs here, but basically there is nothing new in them. It boils down to bad registration. 2017-04-28 14:51:45,409 | DEBUG | lt-dispatcher-49 | hangeListenerPublisherActorProxy | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-1-shard-prefix-configuration-shard-operational: Sending RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/$g)]], initialState present=true] to publisher actor Actorakka://opendaylight-cluster-data/user/shardmanager-operational/member-1-shard-prefix-configuration-shard-operational/member-1-shard-prefix-configuration-shard-operational-DTCL-publisher#1253759913 2017-04-28 14:51:45,412 | DEBUG | on-dispatcher-57 | hardDataTreeChangePublisherActor | 219 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-prefix-configuration-shard-config: Received RegisterListener [path=/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:clustering:prefix-shard-configuration?revision=2017-01-10)prefix-shards/shard, listener=ForwardingDataTreeChangeListener [actor=ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/$g)]], initialState present=true] |
| Comment by Tomas Cere [ 04/May/17 ] |
|
https://git.opendaylight.org/gerrit/#/c/56456/ We think if the create is called from multiple threads it raced and multiple actors were created with the same queue which would explain the mixed up messages for the config and operational listeners when the actors otherwise had no way to actually have an ActorRef to the counterpart in the otherdatastore. |
| Comment by Tom Pantelis [ 04/May/17 ] |
|
(In reply to Tomas Cere from comment #4) yeah - good find. I forgot all about the MeteredBoundedMailbox. |