Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1789

Recover from shard stopped condition on akka persistence circuit-breaker failure

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Confirmed
    • Priority: Medium
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Under heavy transaction load in cluster setup, sometimes akka persistence circuit-breaker failure is observed. When this happens, the shard gets stopped and it never recovers from this condition. This bug is opened to request recovery of shard when it gets into this condition

      Problem can be forced to happen pretty easily:
      1. On node-1, install controller in a nfs mounted directory
      2. Start all nodes and make sure node-1 is leader of the target shard
      3. Make write transactions at high rate against this shard
      4. Stop/start the nfs service on server from which the dir was mounted. This will cause the dir to be not writable for some time
      5. During this period, akka persistence for journal records will fail. When this failure happens, the shard is stopped and never started again

      Increasing the akka journal persistence circuit-breaker call-timeout value (default is 10s) does help in making it more tolerant to outage

      Few others have seen this issue. Ref. https://lists.opendaylight.org/pipermail/controller-dev/2017-August/013777.html

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            ajayslele Ajay Lele
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated: