[CONTROLLER-1637] Intermittent failure in DistributedDataStoreIntegrationTest.testChangeListenerRegistration Created: 14/Apr/17 Updated: 25/Jul/23 Resolved: 21/Apr/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Tom Pantelis | Assignee: | Tom Pantelis |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8231 |
| Description |
|
This test failed on jenkins: 00:50:06 Tests run: 50, Failures: 1, Errors: 0, Skipped: 2, Time elapsed: 10.977 sec <<< FAILURE! - in org.opendaylight.controller.cluster.datastore.DistributedDataStoreIntegrationTest It failed in the second run for the tell-based protocol. The test writes the test model container, registers a DCL, then expects the initial created data notification for the test container. Afterwards it writes a couple entries to the contained "outer-list" and verifies the notification updates. However, in the failed run, the initial notification received by the DCL was: 12:50:04,36 AM [cluster-test-akka.actor.default-dispatcher-10] [DEBUG] DataChangeListener - Sending change notification DOMImmutableDataChangeEvent [created=[], updated=[/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test], removed=[/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test/outer-list/outer-list[ {(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)id=1}], /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test/outer-list/outer-list[ {(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)id=2}]/id, /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test/outer-list/outer-list[ {(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)id=2}], /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test/outer-list, /(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test/outer-list/outer-list[ {(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)id=1}]/id]] to listener org.opendaylight.controller.cluster.datastore.utils.MockDataChangeListener@d65e2b4 Right after it received the expected notification: 12:50:04,38 AM [cluster-test-akka.actor.default-dispatcher-10] [DEBUG] DataChangeListener - Sending change notification DOMImmutableDataChangeEvent [created=[/(urn:opendaylight:params:xml:ns:yang:controller:md:sal:dom:store:test?revision=2014-03-13)test], updated=[], removed=[]] to listener org.opendaylight.controller.cluster.datastore.utils.MockDataChangeListener@d65e2b4 There's 2 problems here. One is that the DistributedDataStoreIntegrationTest class doesn't clear the InMemoryJournal for each test case. So when testChangeListenerRegistration runs for the tell-based protocol it observes the data left over from the ask-based test run. So when the test overwrites the test model container, a remove notification is generated for the stale outer list entries. The second problem is why the DCL even received the remove notification in the first place since the initial write completes before the DCL is registered. It should just receive the created data notification. This is indirectly due to the sharing of the ListenerTree between the ShardDataTree and the ShardDataTreeNotificationPublisherActor. On registration, the listener is first added to the shared ListenerTree and then it sends a PublishNotifications message to the publisher actor to generate the initial notification using a local ListenerTree. The problem here is that listener is leaked to the shared ListenerTree prior to the initial notification so if the publisher actor already has a queued PublishNotifications message for a prior update, the listener will inadvertently observe that notification first. To alleviate this, we need to have the ListenerTree owned by the ShardDataTreeNotificationPublisherActor and not share it with the ShardDataTree. On registration, the ShardDataTree should send a message to the ShardDataTreeNotificationPublisherActor to perform the on-boarding of the new listener, ie atomically generate and send the initial notification and then add the listener to the ListenerTree. |
| Comments |
| Comment by Robert Varga [ 14/Apr/17 ] |
|
Right, which effectively means the registration requests are proxied by ShardDataTree, which attaches the current state of the tree and forwards the request to the notifier actor. An alternative is to take a snapshot of the listener tree at the time when we are sending the request. This has the downside of blocking the shard actor's ability to process register/unregister requests until all previous notifications are dispatched. |
| Comment by Tom Pantelis [ 14/Apr/17 ] |
|
(In reply to Robert Varga from comment #1) yeah. Having the ShardDataTreeNotificationPublisherActor own the ListenerTree and sending s registration message to it as I proposed is also problematic b/c the Shard's support class needs the actual ListenerRegistration immediately in order to create the registration actor that closes it. We could also move the registration actor creation to the ShardDataTreeNotificationPublisherActor but other logic would have to move as well. Essentially we'd have to basically refactor the support class into the ShardDataTreeNotificationPublisherActor - that might get a bit tricky. Another option is to initially disable the new listener (or drop notifications) until the initial notification is sent. I'll look at it further. |
| Comment by Tom Pantelis [ 17/Apr/17 ] |