[ODLPARENT-125] Oxygen development is limited by SingleFeatureTest running out of heap space Created: 27/Sep/17 Updated: 13/Apr/18 Resolved: 13/Apr/18 |
|
| Status: | Resolved |
| Project: | odlparent |
| Component/s: | General |
| Affects Version/s: | 2.0.5 |
| Fix Version/s: | 3.1.0 |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 9218 |
| Description |
|
SingleFeatureTest is failing on odl-integration-all feature for some contributions, preventing them from merging (as distribution-check fails). As reported in e-mail [0], there is a workaround (Distribution to set higher heap size in order for SingleFeatureTest to pass), which just needs Odlparent releasing 2.0.5 version. Local testing showed this might be just a consequence of Also, this Bug is more severe than Additional example of a Change failing distribution-check: [1]. [0] https://lists.opendaylight.org/pipermail/odlparent-dev/2017-September/001341.html |
| Comments |
| Comment by Robert Varga [ 27/Sep/17 ] |
|
Based on a heap dump we've got, it is something between karaf and felix resolver. The resolver is tasked to resolve 24837 requirements, which quickly falls over due to OpenHashMapList cloning during permutations (recorded 978 copies) and 24M CandidateSelector objects. Moreover there are huge numbers of duplicate karaf's SimpleFilter objects - in groups of 400-5000 objects, wasting around 5MB memory. Furthermore there is a ton of duplicate (Linked)HashMap entries retained from RequirementImpl – typically containing only a single element, wasting quite a bit of memory there. To top it all off, there are 21K sparsely populated – pre-allocated to 10 entries, but holding 1 or 2 entries – most of which comes from SimpleFilter. |
| Comment by Robert Varga [ 27/Sep/17 ] |
|
A test case is quite simple:
|
| Comment by Robert Varga [ 28/Sep/17 ] |
|
So this is definitely exposed by us packaging bundles multiple times. The problem is that felix resolver is performing SUBSTITUTE permutations, which are the memory hog. These are emitted each time a bundle is duplicated – e.g. bgp-parser-api, which is packaged three times adds two permutation candidates. I think suspect there is a bug in karaf, but we can definitely hide the issue by being careful with out packaging. |
| Comment by Robert Varga [ 28/Sep/17 ] |
|
So this issue is a combination of us multi-packaging our features and how Karaf implements Resource, Capability and Requirement interfaces – these have well-defined semantics and are tied together (i.e. they cross-reference each other). They are violating the API contract by not providing proper hashCode()/equals() as specified by the contract. While Capability and Requirement are easily retro-fitted, Karaf does not record Resource (i.e. bundle) content sufficiently to make Resource equals() not dependent on Capability. This has two impacts: 2) Felix Resolver is led to believe that duplicate bundles are actually different potential candidates, which provide same capabilities (expanding the search space along the permutation axis) and also treating each such bundle's requirements as unique (expanding the search space along the requirement axis). The end result is that the resolver has a large working set for each permutation and also has a large number of permutations – quickly exhausting memory. Fixing this completely correctly on Karaf side is probably going to be a major surgery and a multi-week effort. Raising severity of this issue to blocker. Will re-target to Nitrogen SR1 and open per-project issues to once there is an SR1 milestone in bugzilla. |
| Comment by Robert Varga [ 28/Sep/17 ] |
| Comment by Vratko Polak [ 05/Oct/17 ] |
|
Even with Odlparent 2.0.5 (Karaf 4.0.10) we can run out of heap when installing large enough tree of features. See [2]. Note that SingleFeatureTest during distribution-check currently uses 4 GB of heap space. But subsequent boot (to verify Restconf) uses only 3 GB. Distribution-check currently does not archive heap dump. Even if it did, .gz would have almost 900 MB, so Nexus upload would fail. |
| Comment by Vratko Polak [ 06/Oct/17 ] |
|
> .xz would have around 120MB, it would take some time to make Some time has passed, releng/builder improvement is ready: [3]. |
| Comment by Robert Varga [ 07/Oct/17 ] |
|
patches for individual projects are collected under https://git.opendaylight.org/gerrit/#/q/topic:bug9218 |
| Comment by Vratko Polak [ 09/Oct/17 ] |
|
>> Even with Odlparent 2.0.5 (Karaf 4.0.10) we can run out of heap when > releng/builder improvement is ready: [3]. Compressed heap dump is archived: [4]. [4] https://logs.opendaylight.org/releng/jenkins092/distribution-check-nitrogen/425/hprof.tar.xz |