Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1463

[Clustering]Datastore: Unrecoverable failure when high volume of write transactions are initiated

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • None
    • Beryllium
    • clustering
    • None
    • Operating System: All
      Platform: All

    • 4823
    • High

      Build used :
      ===================
      Karaf distro from latest ODL Beryllium master code

      Test Type :
      ===================
      Adding OF Flows to inventory shard of config-ds datastore only (no switches connected)

      Objective of test :
      ===================
      To stress the datastore by adding flows spread across switches

      Test Steps :
      ============
      1. Bring up healthy 3 node cluster
      2. Write OF flows into inventory config datastore
      a. Used newWriteOnlyTransaction
      b. Flows spread across switches
      c. Flows of single switch pushed sequentially (in onSuccess of previous txn, next flow is submitted), but across switches txns are pushed in parallel

      3. Check if the flows are completely pushed and record various metrics like - rate, state and latency

      Controllers (to cross-check logs):
      ===================================
      c1 - Controller 1 with IP 10.183.181.41 - config-inventory-shard leader
      c2 - Controller 2 with IP 10.183.181.42 - config-inventory-shard follower (flow transactions are initiated from c2)
      c3 - Controller 3 with IP 10.183.181.43 - config-inventory-shard follower

      Enclosed Logs:
      ==============
      c1.karaf.log for controller c1
      c2.karaf.log for controller c2
      c3.karaf.log for controller c3

      Observations and issue-summary:
      ================================
      1. Leader (c1) marks of one of followers (c3) UNREACHABLE - line 1230 of c1.karaf.log
      2. Leader (c1) marks follower (c3) REACHABLE within few milliseconds - line 1232 of c1.karaf.log
      3. In line 1566 in log c2.karaf.log follower c2 marks leader c1 as UNREACHABLE but never recovers from that state. And all transactions initiated from c2 fails with org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Shard member-2-shard-inventory-config currently has no leader. Try again later.
      4. After this state, transactions from follower does not go through - which is understandable as a transient state due to leader becoming UNREACHABLE
      5. But, even after substantial 2-3 minutes after above incidence, transactions from follower c2 or c3 keeps failing.

            Unassigned Unassigned
            muthukumaran.k@ericsson.com Muthukumaran Kothandaraman
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: