[CONTROLLER-1637] Intermittent failure in DistributedDataStoreIntegrationTest.testChangeListenerRegistration Created: 14/Apr/17  Updated: 25/Jul/23  Resolved: 21/Apr/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Tom Pantelis Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8231

 Description   

This test failed on jenkins:

00:50:06 Tests run: 50, Failures: 1, Errors: 0, Skipped: 2, Time elapsed: 10.977 sec <<< FAILURE! - in org.opendaylight.controller.cluster.datastore.DistributedDataStoreIntegrationTest
00:50:06 testChangeListenerRegistration[class org.opendaylight.controller.cluster.databroker.ClientBackedDataStore](org.opendaylight.controller.cluster.datastore.DistributedDataStoreIntegrationTest) Time elapsed: 0.108 sec <<< FAILURE!
00:50:06 java.lang.AssertionError: Change 1 does not contain /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test. Actual: {}
00:50:06 at org.junit.Assert.fail(Assert.java:88)
00:50:06 at org.junit.Assert.assertTrue(Assert.java:41)
00:50:06 at org.opendaylight.controller.cluster.datastore.utils.MockDataChangeListener.waitForChangeEvents(MockDataChangeListener.java:66)
00:50:06 at org.opendaylight.controller.cluster.datastore.DistributedDataStoreIntegrationTest$21.<init>(DistributedDataStoreIntegrationTest.java:1178)
00:50:06 at org.opendaylight.controller.cluster.datastore.DistributedDataStoreIntegrationTest.testChangeListenerRegistration(DistributedDataStoreIntegrationTest.java:1162)

It failed in the second run for the tell-based protocol. The test writes the test model container, registers a DCL, then expects the initial created data notification for the test container. Afterwards it writes a couple entries to the contained "outer-list" and verifies the notification updates.

However, in the failed run, the initial notification received by the DCL was:

12:50:04,36 AM [cluster-test-akka.actor.default-dispatcher-10] [DEBUG] DataChangeListener - Sending change notification DOMImmutableDataChangeEvent [created=[], updated=[/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test], removed=[/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test/outer-list/outer-list[

{(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)id=1}

], /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test/outer-list/outer-list[

{(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)id=2}

]/id, /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test/outer-list/outer-list[

{(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)id=2}

], /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test/outer-list, /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test/outer-list/outer-list[

{(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)id=1}

]/id]] to listener org.opendaylight.controller.cluster.datastore.utils.MockDataChangeListener@d65e2b4

Right after it received the expected notification:

12:50:04,38 AM [cluster-test-akka.actor.default-dispatcher-10] [DEBUG] DataChangeListener - Sending change notification DOMImmutableDataChangeEvent [created=[/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test], updated=[], removed=[]] to listener org.opendaylight.controller.cluster.datastore.utils.MockDataChangeListener@d65e2b4

There's 2 problems here. One is that the DistributedDataStoreIntegrationTest class doesn't clear the InMemoryJournal for each test case. So when testChangeListenerRegistration runs for the tell-based protocol it observes the data left over from the ask-based test run. So when the test overwrites the test model container, a remove notification is generated for the stale outer list entries.

The second problem is why the DCL even received the remove notification in the first place since the initial write completes before the DCL is registered. It should just receive the created data notification. This is indirectly due to the sharing of the ListenerTree between the ShardDataTree and the ShardDataTreeNotificationPublisherActor. On registration, the listener is first added to the shared ListenerTree and then it sends a PublishNotifications message to the publisher actor to generate the initial notification using a local ListenerTree. The problem here is that listener is leaked to the shared ListenerTree prior to the initial notification so if the publisher actor already has a queued PublishNotifications message for a prior update, the listener will inadvertently observe that notification first.

To alleviate this, we need to have the ListenerTree owned by the ShardDataTreeNotificationPublisherActor and not share it with the ShardDataTree. On registration, the ShardDataTree should send a message to the ShardDataTreeNotificationPublisherActor to perform the on-boarding of the new listener, ie atomically generate and send the initial notification and then add the listener to the ListenerTree.



 Comments   
Comment by Robert Varga [ 14/Apr/17 ]

Right, which effectively means the registration requests are proxied by ShardDataTree, which attaches the current state of the tree and forwards the request to the notifier actor.

An alternative is to take a snapshot of the listener tree at the time when we are sending the request. This has the downside of blocking the shard actor's ability to process register/unregister requests until all previous notifications are dispatched.

Comment by Tom Pantelis [ 14/Apr/17 ]

(In reply to Robert Varga from comment #1)
> Right, which effectively means the registration requests are proxied by
> ShardDataTree, which attaches the current state of the tree and forwards the
> request to the notifier actor.
>
> An alternative is to take a snapshot of the listener tree at the time when
> we are sending the request. This has the downside of blocking the shard
> actor's ability to process register/unregister requests until all previous
> notifications are dispatched.

yeah.

Having the ShardDataTreeNotificationPublisherActor own the ListenerTree and sending s registration message to it as I proposed is also problematic b/c the Shard's support class needs the actual ListenerRegistration immediately in order to create the registration actor that closes it. We could also move the registration actor creation to the ShardDataTreeNotificationPublisherActor but other logic would have to move as well. Essentially we'd have to basically refactor the support class into the ShardDataTreeNotificationPublisherActor - that might get a bit tricky.

Another option is to initially disable the new listener (or drop notifications) until the initial notification is sent.

I'll look at it further.

Comment by Tom Pantelis [ 17/Apr/17 ]

master - https://git.opendaylight.org/gerrit/#/c/55108/

Generated at Wed Feb 07 19:56:03 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.