[CONTROLLER-890] Clustering: Handle shard initialization failures in DistributedDataStore#registerChangeListener Created: 23/Sep/14  Updated: 10/Nov/14  Resolved: 10/Nov/14

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: Helium
Fix Version/s: None

Type: Bug
Reporter: Tom Pantelis Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks CONTROLLER-976 Clustering: Leaderless default shard ... Resolved
External issue ID: 2055

 Description   

Currently if a shard isn't initialized yet (i.e. hasn't fully recovered yet from persistence), the RegisterChangeListener request to the shard would time out resulting in the NoOpDataChangeListenerRegistration being created which is essentially a failed registration.

DistributedDataStore#registerChangeListener needs to be more resilient and use some mechanism to wait for the shard to be initialized before attempting the RegisterChangeListener request.

A potential solution:

In the ShardManager, on FindLocalShard, if the shard is not present send back LocalShardNotFound immediately. If the shard is present but not yet initialized, then record the sender locally but don't send back a response yet. When the ActorInitialized message is received from the shard, send the LocalShardFound response. In this manner, the ShardManager doesn't return LocalShardFound until the shard is initialized and ready for use.

If shard recovery fails, the RecoveryFailed is sent by akka but I don't think RecoveryComplete is sent. Either way the shard needs to notify the ShardManager. The ShardManager needs to know the shard is ready for normal messages but I don't think it needs to know if recovery failed.

In DistributedDataStore#registerChangeListener, always create a DataChangeListenerRegistrationProxy. Move the code that performs the findLocalShard and RegisterChangeListener operations into DataChangeListenerRegistrationProxy. In DataChangeListenerRegistrationProxy, do the findLocalShard operation async (with a long time out or infinite). On success, then send the RegisterChangeListener message. A failure would essentially indicate the shard doesn't exist and would be unrecoverable.

Another solution is to have the FindLocalShard operation fail fast with the NotInitializedException and have the DataChangeListenerRegistrationProxy schedule retries until it either succeeds or fails with some other error.



 Comments   
Comment by Tom Pantelis [ 21/Oct/14 ]

Submitted https://git.opendaylight.org/gerrit/#/c/11994/

Comment by Tom Pantelis [ 24/Oct/14 ]

Also submitted https://git.opendaylight.org/gerrit/#/c/12215/ to make transaction creation more resilient to transient scenarios, i.e. the shard not initialized yet and no leader elected yet, with limited waits or retries.

Generated at Wed Feb 07 19:54:08 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.