Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-2034

Limit repeated replication of the same selection of Entries

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Medium Medium
    • None
    • None
    • None
    • None

      Consided following scenario:

      • the Leader is under load - there are 50 new Entries added to his Journal each second.
      • every time a new Entry is added, an AppendEntries message is sent to a Follower.
      • the AppendEntries message contains selection of Entries from the last index the Follower confirmed to the maximum index which can fit in the AppendEntries message. 
      • the Follower is slower (less resources, overloaded CPU, slow HW, etc...) and the AppendEntries message is filled to it's max capacity, therefore it takes him about 2 seconds to reply to this AppendEntries message.
      • in those 2 seconds the Leader sends 100 more AppendEntries containing the exact same selection of entries.
      • this creates 3 issues:
          - it puts the Follower further and further behind, since now his MessageQueue is stuffed with hundreds if not thousands of AppendEntries while many of them don't really provide any value. He still has to process each one of them.
          - it can lead to OOM issue on the Follower, since his MessageQueue is unbounded.
          - it leads to unnecessary overhead on the Leader's side and a possible GS overload, since every such AppendEntry has to be constructed, serialized, pushed through the EndpointWriter and then destroyed.

      Proposed solution - introduce timer to limit this pointless repeated replication.
      When the Leader attempts to send AppendEntries message to a Follower, the selection of Entries is checked first. If it's the same list of entries as in the last AppendEntries, a timeout is started. If the next attempt at AppendEntries contains the same Entries and appears within the timeout, it won't be sent (this case is also logged at the debug level). When the timeout elapses and the next AppendEntries still contains the same Entries it will be sent, but the timeout is increased. The initial timeout is 50ms and each increase is done by multiplying it with repeated-replication-timeout-multiplier up to the maximum timeout, which is configured by repeated-replication-max-timeout-seconds.

      • Leader tries to send the same Entries over and over again every 30ms
      • the first AppendEntries is sent immediately - no problem here
      • the second one contains the same Entries. Is is still sent immediately, but the timer is started (50ms)
      • after 30ms a new AppendEntries should be created, but the entries are the same and the timeout hasn't elapsed yet, so skip.
      • 30 ms later another one is attempted. Entries are the same, but timeout has elapsed. Send AppendEntries and multiply timeout
      • next attempts happen after 30, 60 and 90ms. Skip all of them.
      • 120ms elapsed, so send the next AppendEntries and increase timeout to 200ms.
      • if this keeps happening, the next AppendEntries will be sent after 400ms, then 800, 1600, 3200, 6400, 12800, and then every 150000ms which is the default maximum value. 
      • however whenever a different selection of Entries is chosen, the timeout is reset back to 0 and this new selection is stored as the last replicated. The limiter is also only applied to AppendEntries, which contain some entries, since empty AppendEntries are used as heartbeats.

            tibor.kral Tibor Král
            tibor.kral Tibor Král
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: