[CONTROLLER-1077] Clustering: Race condition may cause missing routes in RPC BucketStore Created: 19/Dec/14  Updated: 09/Jan/15  Resolved: 09/Jan/15

Status: Resolved
Project: controller
Component/s: mdsal
Affects Version/s: Helium
Fix Version/s: None

Type: Bug
Reporter: Tom Pantelis Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 2526
Priority: High

 Description   

The workflow by which the RpcRegistry adds route identifiers into the BucketStore is as follows:

  • The RpcRegistry sends the GetLocalBucket message to the BucketStore and waits on the Future.
  • BucketStore returns GetLocalBucketReply containing its internal local Bucket instance.
  • On Future complete, the RpcRegistry extracts the Bucket instance from the GetLocalBucketReply.
  • RpcRegistry calls getData on the Bucket instance to obtain the RoutingTable. getData returns a copy of its internal RoutingTable.
  • RpcRegistry adds the new RouteIdentifiers into its local RoutingTable copy.
  • RpcRegistry calls setData on the Bucket with the new RoutingTable instance.
  • RpcRegistry sends an UpdateBucket message with the Bucket instance.
  • The BucketStore extracts the Bucket instance from the UpdateBucket message and sets its localBucket member.

The problem is that the BucketStore exposes its internal localBucket, thus allowing it to be modified externally. The manner in which the RpcRegistry modifies the Bucket instance is not thread safe and introcudes a race condition. This can result in missing RouteIdentifiers as in the following scenario:

  • A GetLocalBucket Future callback is notified on Thread 1.
  • At the same time, another GetLocalBucket Future callback is notified on Thread 2.
  • Both threads call getData on the returned Bucket instance and get a local copy.
  • Both add their new routes into their local RoutingTable.
  • Thread 1 calls setData on the Bucket instance.
  • Thread 2 calls setData on the Bucket instance.
  • Since they both had their own local RoutingTable this results in Thread 2 overwriting the instance set by Thread 1 and thus Thread 1's routes are lost.

In addition, neither the 'data' member nor "version" member in Bucket are volatile so the writes may not be visible to other threads. Also setData should atomically set 'data' and 'version' for consistency.

Propose solution:

The BucketStore should not expose its internal state and allow it to be mutated externally. We should remove the GetLocalBucket message. Instead of getting the current routing table, modifying a local copy, then writing it back, which is problematic, we should just send an AddToLocalBucket message to the BucketStore and have it update its local Bucket. Similar with remove.



 Comments   
Comment by Tom Pantelis [ 19/Dec/14 ]

I created a unit test that illustrates the bug. It sends N AddOrUpdateRoutes messages and then verifies the routes IDs are all in the BucketStore. The test fails sporadically with a varying # missing – of course, the higher N, the more often it fails. I even saw it fail once with only 2 (it got 1 out of 2).

Comment by Tom Pantelis [ 19/Dec/14 ]

Submitted https://git.opendaylight.org/gerrit/#/c/13753/

Comment by Tom Pantelis [ 06/Jan/15 ]

Manually cherry picked to stable/helium: https://git.opendaylight.org/gerrit/#/c/13938/

Generated at Wed Feb 07 19:54:38 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.