Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-2122

Provide efficient RAFT term storage


    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Medium Medium
    • 10.0.0
    • None
    • clustering

      Our current organization of persisted data via atomix-storage leaves a lot to be desired for reasons that are mostly historic: atomix-storage itself, Akka persistence, etc.

      At the end of the day, what we need is RAFT journal storage. In a RAFT journal, each entry has 129 bits of metadata, which has intrinsic properties offering optimizations.

      Each entry has:

      • a 64bit index, counting either from 1 or 0 (which is a detail), increasing monotonically
      • a 64bit term, counting from 0, increasing (but not necessarily monotonically: there may be times when a leader fails to commit its first entry!)
      • a one-bit commit indicator, which implied by commitIndex – all entries up to and including commitIndex are committed
      • associated state transition data

      atomix-storage uses index for its internal entries and has a provision to provide commitIndex, but it leaves the term information with the state data.

      This is a missed optimization opportunity, because in steady-state operation, when there are no leader elections, the term remains a constant.

      We are currently tracking each entry via:

      • 32bit entry length
      • 32bit entry CRC32

      where length == 0 is not allowed.

      Perhaps we should be tracking it via:

      • 32bit entry CRC32
      • variable 64bit term increment
      • variable 64bit entry length, perhaps trimmed to 32bits, but future-proof to 64bits

      which would lend itself to using WritableObjects.writeTwoLongs() for the second and third items – and the first long being almost always 0, leading to efficient storage of 1-17 bytes, with the usual case being 3-11 bytes (I think, needs to checked).

      If we can achieve this, then we can project RaftEntryMeta from the storage layer, without having to involve entry data serialization code – which we only need when we are replicating/applying entries.

      Based on the 3-11 bytes figure above, that has a potential of saving 1-9 bytes in overall for each entry. The cost is slightly increased buffer management code, as we would be dealing with variable records of 7-24 bytes headers vs. fixed 8 bytes (and implied 8 bytes payload).

            Unassigned Unassigned
            rovarga Robert Varga
            0 Vote for this issue
            1 Start watching this issue