[CONTROLLER-890] Clustering: Handle shard initialization failures in DistributedDataStore#registerChangeListener Created: 23/Sep/14 Updated: 10/Nov/14 Resolved: 10/Nov/14 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | mdsal |
| Affects Version/s: | Helium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Tom Pantelis | Assignee: | Tom Pantelis |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 2055 | ||||||||
| Description |
|
Currently if a shard isn't initialized yet (i.e. hasn't fully recovered yet from persistence), the RegisterChangeListener request to the shard would time out resulting in the NoOpDataChangeListenerRegistration being created which is essentially a failed registration. DistributedDataStore#registerChangeListener needs to be more resilient and use some mechanism to wait for the shard to be initialized before attempting the RegisterChangeListener request. A potential solution: In the ShardManager, on FindLocalShard, if the shard is not present send back LocalShardNotFound immediately. If the shard is present but not yet initialized, then record the sender locally but don't send back a response yet. When the ActorInitialized message is received from the shard, send the LocalShardFound response. In this manner, the ShardManager doesn't return LocalShardFound until the shard is initialized and ready for use. If shard recovery fails, the RecoveryFailed is sent by akka but I don't think RecoveryComplete is sent. Either way the shard needs to notify the ShardManager. The ShardManager needs to know the shard is ready for normal messages but I don't think it needs to know if recovery failed. In DistributedDataStore#registerChangeListener, always create a DataChangeListenerRegistrationProxy. Move the code that performs the findLocalShard and RegisterChangeListener operations into DataChangeListenerRegistrationProxy. In DataChangeListenerRegistrationProxy, do the findLocalShard operation async (with a long time out or infinite). On success, then send the RegisterChangeListener message. A failure would essentially indicate the shard doesn't exist and would be unrecoverable. Another solution is to have the FindLocalShard operation fail fast with the NotInitializedException and have the DataChangeListenerRegistrationProxy schedule retries until it either succeeds or fails with some other error. |
| Comments |
| Comment by Tom Pantelis [ 21/Oct/14 ] |
| Comment by Tom Pantelis [ 24/Oct/14 ] |
|
Also submitted https://git.opendaylight.org/gerrit/#/c/12215/ to make transaction creation more resilient to transient scenarios, i.e. the shard not initialized yet and no leader elected yet, with limited waits or retries. |