Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1756

OOM due to huge Map in ShardDataTree

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • None
    • Carbon
    • mdsal
    • None
    • Operating System: All
      Platform: All

    • 9034

      We're seeing an OOM in Red Hat internal scale testing:

      Our scenario is a cluster of 3 nodes with odl-netvirt-openstack being stress tested by OpenStack's rally benchmarking tool.

      The ODL version that this is being seen with is a a Carbon built last Thursday; specifically https://nexus.opendaylight.org/content/repositories/opendaylight-carbon-epel-7-x86_64-devel/org/opendaylight/integration-packaging/opendaylight/6.2.0-0.1.20170817rel1931.el7.noarch/opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch.rpm

      We've started testing with giving all 3 ODL node VMs just 2 GB, in an effort to understand ODL memory requirements better. If it's just "normal" that we cannot run with 2 GB in such a "real world" scenario, we'll be gradually increasing Xmx in this environment - but we wanted to get community feedback for this OOM with 2 GB already.

      I'll attach, or provide links, to the usual HPROF, plus a "Leak Suspects" report produced by https://www.eclipse.org/mat/, plus the Karaf log.

      Basically what we're seeing is a huge (>1 GB) Map in ShardDataTree (I'm not sure if that's its Map<LocalHistoryIdentifier, ShardDataTreeTransactionChain> transactionChains or its Map<Payload, Runnable> replicationCallbacks).

      As far as I can tell from my limited understanding, this is not the same as CONTROLLER-1746 (there's nothing about "closedTransactions" anywhere..), and not CONTROLLER-1755 either?

      The Karaf log, among other errors which are less relevant in this context AFAIK, shows:

      (1) a lot of those "ERROR ShardDataTree org.opendaylight.controller.sal-distributed-datastore - 1.5.2.Carbon | member-0-shard-default-operational: Failed to commit transaction ... java.lang.IllegalStateException: Store tree org.opendaylight.yangtools.yang.data.api.schema.tree.spi.MaterializedContainerNode@78fe0203 and candidate base org.opendaylight.yangtools.yang.data.api.schema.tree.spi
      .MaterializedContainerNode@686861e8 differ" errors - seems vaguely familiar from recent list posts, someone remind me what were those that all about again?

      (2) at the very end before it blows up, seems to indicate genius' lockmanager perhaps not being too happy due to " Waiting for the lock ... is timed out. retrying again" - probably just an impact of this OOM? Or could that (lockmanager) somehow be related and actually be the cause not the effect - could "bad application code" (like not closing DataBroker transaction correctly, or something like that?) cause this OOM?

            Unassigned Unassigned
            vorburger Michael Vorburger
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: