[CONTROLLER-1711] Listener registration lost when local replica is removed Created: 07/Jun/17  Updated: 25/Jul/23  Resolved: 05/Oct/18

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8629

 Description   

This was seen on Sandbox on a code which is already merged to stable/carbon.

The title says possibly, as from the two similar test cases, only one has failed [0], it was the one where the listener was on the same member as the shard leader, which has been moved by calling remove-prefix-shard-replica.

In huge karaf.log [1] with some debugs look between removal starting at 19:23:10,608 and suite adding replica back at 19:23:41,773.

[0] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/6/log.html.gz#s1-s20-t1-k2-k16-k2-k1-k4-k7-k1
[1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/6/odl3_karaf.log.gz



 Comments   
Comment by Tom Pantelis [ 08/Jun/17 ]

So from my understanding, a DTCL is registered for shard A on member1 then shard A is removed from member1. Later shard A is re-added to member1 and the DTCL is not notified. Is this the scenario? If so that's to be expected as the DTCL registrations belong to the shard and thus "go away" when the shard does. Kind of an edge case...

Comment by Vratko Polak [ 09/Jun/17 ]

Now seen on RelEng [5].

> Later shard A is re-added to member1

That is only done in test teardown, so that is not part of the scenario.

> DTCL is registered for shard A on member1 then shard A is removed from member1.

This a DDTL [2], not DTCL.

If the listener is on a follower, removal of the leader shard replica does not lead to failures [3], thus the new leader continues to send notifications without missing any (or the new leader does not commit anything, which I would expect to lead to errors in producers [4]).

[2] https://github.com/opendaylight/controller/blob/5997e14efab9c12e7be2b7fb83f7efe16c2bfe7c/opendaylight/md-sal/samples/clustering-test-app/provider/src/main/java/org/opendaylight/controller/clustering/it/provider/impl/IdIntsDOMDataTreeLIstener.java#L23
[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/737/log.html.gz#s1-s38-t3-k2-k16-k2-k1-k4-k7-k1
[4] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/737/log.html.gz#s1-s38-t3-k2-k14
[5] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/737/log.html.gz#s1-s38-t1-k2-k16-k2-k1-k4-k7-k1

Comment by Vratko Polak [ 13/Jun/17 ]

Sandbox testing results.
Transaction producers confirmed successful end at 11:41:45.908 [6]
The unsubscribe-ddtl has been called between 11:41:46.006 and 11:41:47.293 [7], but the (huge) karaf.log [8] shows a change being received as late as 09:42:03,183, just because the member has been restarted at that point [9].

Is there a reasonable way to wait until no notification is being processed?

[6] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/21/log.html.gz#s1-s4-t1-k2-k14
[7] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/21/log.html.gz#s1-s4-t1-k2-k16-k2-k1-k4-k6-k1
[8] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/21/odl3_karaf.log.gz
[9] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/21/log.html.gz#s1-s4-t2-k2-k1-k3-k2-k3-k1-k2-k2-k1-k2-k1-k6

Comment by Vratko Polak [ 23/Jun/17 ]

> the new leader continues to send notifications without missing any

Okay, so this is where I was wrong. The current cluster-wide listener implementation apparently relies on local replica data. No local replica, no data change notifications.

Since this Bug now tracks the missing functionality, I will open another one for listener failures not related to missing local replica.

Generated at Wed Feb 07 19:56:15 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.