Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-890

Clustering: Handle shard initialization failures in DistributedDataStore#registerChangeListener

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Resolution: Done
    • Helium
    • None
    • mdsal
    • None
    • Operating System: All
      Platform: All

    • 2055

    Description

      Currently if a shard isn't initialized yet (i.e. hasn't fully recovered yet from persistence), the RegisterChangeListener request to the shard would time out resulting in the NoOpDataChangeListenerRegistration being created which is essentially a failed registration.

      DistributedDataStore#registerChangeListener needs to be more resilient and use some mechanism to wait for the shard to be initialized before attempting the RegisterChangeListener request.

      A potential solution:

      In the ShardManager, on FindLocalShard, if the shard is not present send back LocalShardNotFound immediately. If the shard is present but not yet initialized, then record the sender locally but don't send back a response yet. When the ActorInitialized message is received from the shard, send the LocalShardFound response. In this manner, the ShardManager doesn't return LocalShardFound until the shard is initialized and ready for use.

      If shard recovery fails, the RecoveryFailed is sent by akka but I don't think RecoveryComplete is sent. Either way the shard needs to notify the ShardManager. The ShardManager needs to know the shard is ready for normal messages but I don't think it needs to know if recovery failed.

      In DistributedDataStore#registerChangeListener, always create a DataChangeListenerRegistrationProxy. Move the code that performs the findLocalShard and RegisterChangeListener operations into DataChangeListenerRegistrationProxy. In DataChangeListenerRegistrationProxy, do the findLocalShard operation async (with a long time out or infinite). On success, then send the RegisterChangeListener message. A failure would essentially indicate the shard doesn't exist and would be unrecoverable.

      Another solution is to have the FindLocalShard operation fail fast with the NotInitializedException and have the DataChangeListenerRegistrationProxy schedule retries until it either succeeds or fails with some other error.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              tpantelis Tom Pantelis
              tpantelis Tom Pantelis
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: