Details
-
Bug
-
Status: Resolved
-
Resolution: Done
-
Helium
-
None
-
None
-
Operating System: All
Platform: All
-
2055
Description
Currently if a shard isn't initialized yet (i.e. hasn't fully recovered yet from persistence), the RegisterChangeListener request to the shard would time out resulting in the NoOpDataChangeListenerRegistration being created which is essentially a failed registration.
DistributedDataStore#registerChangeListener needs to be more resilient and use some mechanism to wait for the shard to be initialized before attempting the RegisterChangeListener request.
A potential solution:
In the ShardManager, on FindLocalShard, if the shard is not present send back LocalShardNotFound immediately. If the shard is present but not yet initialized, then record the sender locally but don't send back a response yet. When the ActorInitialized message is received from the shard, send the LocalShardFound response. In this manner, the ShardManager doesn't return LocalShardFound until the shard is initialized and ready for use.
If shard recovery fails, the RecoveryFailed is sent by akka but I don't think RecoveryComplete is sent. Either way the shard needs to notify the ShardManager. The ShardManager needs to know the shard is ready for normal messages but I don't think it needs to know if recovery failed.
In DistributedDataStore#registerChangeListener, always create a DataChangeListenerRegistrationProxy. Move the code that performs the findLocalShard and RegisterChangeListener operations into DataChangeListenerRegistrationProxy. In DataChangeListenerRegistrationProxy, do the findLocalShard operation async (with a long time out or infinite). On success, then send the RegisterChangeListener message. A failure would essentially indicate the shard doesn't exist and would be unrecoverable.
Another solution is to have the FindLocalShard operation fail fast with the NotInitializedException and have the DataChangeListenerRegistrationProxy schedule retries until it either succeeds or fails with some other error.
Attachments
Issue Links
- blocks
-
CONTROLLER-976 Clustering: Leaderless default shard during feature installation.
- Resolved