[CONTROLLER-1991] Tell-based protocol can leak transaction IDs on the backend Created: 23/May/19 Updated: 12/Nov/21 Resolved: 12/Nov/21 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Oxygen SR4, 2.0.0, 3.0.0, 4.0.0 |
| Fix Version/s: | 3.0.13, 4.0.7, 2.0.10 |
| Type: | Bug | Priority: | High |
| Reporter: | Robert Varga | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
Transactions which touch (or not) various shards can end up leaving holes in the transaction sequence numbers in shards which are not touched. This can end up growing the tracking rangesets, effectively leaking storage. Devise a recovery strategy, which will ensure that backends can merge up holes in a (relatively) timely fashion. |
| Comments |
| Comment by Robert Varga [ 08/Jun/19 ] |
|
What we need is to maintain the lowest known, non-GC'd TransactionId which has been allocated for a particular history. We then need to communicate this ID once in a while to the backend, so that it can consolidate its maps. Note that nice pipelines will be sending a rapid succession of transactions towards a single shard, hence we need an efficient protocol for that. For a reasonable decision, we need to estimate the holes on the backend based on this information. I am not quite sure how to do that, yet. One option would be to send periodic cleanups, but that is not workable when there are multiple clients talking to the backend – as then the amount of potential backend leak is dependent on timing. This would not work even if the timer ran on the backend: it would have to be quite aggressive to keep up with the discontinuities produced by clients.
|
| Comment by Robert Varga [ 08/Jun/19 ] |
|
Another option would be for the backend to detect purgedTransactions discontinuities and communicate the lowest hole back to the frontend. The frontend should then determine if it has corresponding transactions queued for resolution (by maintaining an identifier queue?), or else enqueue a mass-purge request to the backend up to the first known transaction in that history. This would indicate we need to keep the identifier of the last transaction which is considered purged – and that should not be hard. |
| Comment by Robert Varga [ 05/Nov/21 ] |
|
So the story is that:
When we transition AbstractClientHandle state in ensureClosed() we need to issue a callback to AbstractClientHistory, which then can see whether there are any ProxyHistories not aware of the transaction being closed. If there are any, the transaction ID needs to be forwarded to each. The second part kicks in inside ProxyHistory: when we allocate a new transaction, or we exceed the threshold of such identifiers, we need to send out a mass purge request towards the shard. This way we should be healing any discontinuities rather speedily, preventing this leak. |