[CONTROLLER-1844] Unable to start blueprint container for bundle org.opendaylight.netconf.restconf-nb-bierman02-auth/1.8.0 Created: 26/Jun/18 Updated: 28/Jun/18 Resolved: 28/Jun/18 |
|
| Status: | Verified |
| Project: | controller |
| Component/s: | clustering, netconf, restconf |
| Affects Version/s: | None |
| Fix Version/s: | Fluorine |
| Type: | Bug | Priority: | Medium |
| Reporter: | Victor Pickard | Assignee: | Tom Pantelis |
| Resolution: | Done | Votes: | 0 |
| Labels: | csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
This bundle is failing to start in the controller clustering job 127, after ODL is restarted with Tell Based False.
From then on, RPC failures occur with error 500.
Snippet from odl1_karaf.log ====================== 2018-06-25T13:04:34,012 | ERROR | Blueprint Extender: 2 | BlueprintContainerImpl | 75 - org.apache.aries.blueprint.core - 1.8.3 | Unable to start blueprint container for bundle org.opendaylight.netconf.restconf-nb-bierman02-auth/1.8.0 |
| Comments |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
Can you please provide exact reproduction steps? |
| Comment by Victor Pickard [ 27/Jun/18 ] |
|
Hi Tom, This is the controller csit clustering job. The job stops ODL on all nodes, with a kill -9 on karaf pid. Then, the job starts ODL on all nodes, with ../bin/start. When ODL starts, we see the exception, and restconf fails with error 500 from then on.
|
| Comment by Victor Pickard [ 27/Jun/18 ] |
|
Here is a link to the job with the exception. This will make it a little easier to see all the karaf logs, etc.
|
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
I think there's more detail in there. You mentioned changing tell-based setting - I assume that happens after kill -9. Does it delete the data dir before restarting? Is this reproducible every time or intermittent? Is it reproducible with single node? |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
I think I see the problem - I see both restconf-nb-bierman02-noauth and restconf-nb-bierman02-auth features being installed. It should be one or the other although we should put in a guard to at least avoid the ISE and emit a warning in Bierman02WebRegistrarImpl if the web context has already been created. |
| Comment by Victor Pickard [ 27/Jun/18 ] |
|
Great, if you put a link to the patch when you have it ready, I should be able to run it through csit to see how it looks.
|
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
Well CSIT should not install both features |
| Comment by Jamo Luhrsen [ 27/Jun/18 ] |
|
featuresBoot = odl-integration-compatible-with-all,odl-jolokia,odl-restconf-noauth,odl-clustering-test-app, 1a22a137-5311-4fe4-b894-bff5147a7819 |
| Comment by Jamo Luhrsen [ 27/Jun/18 ] |
|
where is the misconfig? maybe int/dist has something wrong with the compatible-with-all? otherwise, the misconfig is in the cluster-test-app or restconf-noauth or odl-jolokia features. |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
I suspect odl-integration-compatible-with-all installs odl-restconf. It's not wrong - it's just that installing odl-restconf-noauth in addition causes the issue. Like I said, I'll push a patch to guard against it but you'll get non-deterministic behavior across runs, ie one run uses auth, the next doesn't. Maybe that's why we're still seeing time outs - we think it's using noauth but it's really not. It looks like we need the equivalent of odl-integration-compatible-with-all that installs noauth. Or take restconf out of odl-integration-compatible-with-all and install the desired version separately. |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
Patch to guard against multiple registrations: https://git.opendaylight.org/gerrit/#/c/73488/ |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
I assume odl-integration-compatible-with-all installs the "world" (at least the managed world). If so and this is a controller CSIT why are we installing all that? All you should need odl-clustering-test-app, odl-jolokia and odl-restconf(-noauth). |
| Comment by Victor Pickard [ 27/Jun/18 ] |
|
Thanks Tom. I'm just queued a csit job that removes the odl-netconf-noauth from the features list.
The odl-integration-compatible-with-all feature is installed as part of clustering config, I'll have to dig some more or get input from Jamo on how best to adjust that one. |
| Comment by Jamo Luhrsen [ 27/Jun/18 ] |
|
ok, so we got a good theory here, and I know vpickard is testing this in the sandbox now, but running the job without |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
We can do that but that patch really shouldn't be needed for CSIT - as I've mentioned you don't want to install both auth and no-auth - it will be non-deterministic which one will take effect on each run. You should be able to just remove odl-integration-compatible-with-all as I've mentioned. odl-clustering-test-app installs mdsal etc - that's all I install, along with restconf, when I'm testing with the cars stuff. |
| Comment by Jamo Luhrsen [ 27/Jun/18 ] |
|
the point of compatible-with-all is to load I wasn't implying the patch is needed for CSIT. If this is really what's happening, the patch to protect btw, I don't understand how this explains the sporadic nature of this problem. If compatible-with-all |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
It repro'ed every time I installed both - same exact error - so the theory is correct - based on how pax web works it is deterministic wrt failure. Are you sure you're really seeing it sporadically? Has there ever been a case where you saw it occur and didn't install both. I thought compatible-with-all was for dist-check jobs... isn't necessary for CSIT?
|
| Comment by Jamo Luhrsen [ 27/Jun/18 ] |
|
ok, good that it's easy to reproduce. vpickard, I assumed this was one of the sporadic failures. Is this happening compatible-with-all is what is used in every CSIT job you see with |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
Maybe it's a legacy thing. I mean if a CSIT is testing clustering with cars, why install eg lispflowmapping, sfc et al? What if one of those unused features has an orthogonal issue that affects the clustering tests? |
| Comment by Victor Pickard [ 27/Jun/18 ] |
|
I took a closer look at the logs for this CSIT run. ODL is stopped and started a number of times in the CSIT.
From what I see, one of the features will end up successfully registering, and the other one will fail. And it isn't the same one that wins every time. I think we are/should be using the bierman02-auth version, right? So, if that is the one that fails to register, then 500 from then on until ODL is restarted in next test case down the line. |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
Right - that's what I mentioned before - it's non-deterministic which one gets in first. We should be using auth. In fact no-auth really should not even exist anymore - it was kept for legacy from the days before we even had authentication. Devs wanted to keep it for convenience (including me That said, the reason for using no-auth now in CSIT was to try to isolate the restconf failures in https://jira.opendaylight.org/browse/CONTROLLER-1838. Which then led to this issue b/c CSIT was installing both which is a no-no. I'll lower the priority.
|
| Comment by Jamo Luhrsen [ 27/Jun/18 ] |
|
I think that's the point, to find (and fix) the orthogonal issues that come up. If it's deemed some kind of we used to have both |
| Comment by Jamo Luhrsen [ 27/Jun/18 ] |
|
So, it's non-deterministic, but tpantelis can hit it every time which sounds deterministic maybe the CSIT timing is random enough that it doesn't happen every time... Also, tpantelis this is not the same job or same feature set from vpickard, once we known tihs never comes once we don't have -noauth, we can close this tpantelis, I wonder if we should hide this -noauth feature somehow, since it's strictly a |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
Maybe I'm not understanding something here... but I thought it's the purpose of the distribution-check jobs which run for every gerrit patch to test feature compatibility across all the projects. Hence why compatible-with-all was created - distribution-check jobs install and verify all the features come up. If all that is correct (which I think it is), why do project CSITs need to duplicate what distribution-check jobs already do? |
| Comment by Jamo Luhrsen [ 27/Jun/18 ] |
|
no, that's not correct. compatible-with-all was there from the start and used in CSIT. distribution-check came example, l2switch, if loaded along side of netvirt will break things. l2switch is not compatible with other anyway, none of this really matters as it applies to this jira. this jira is about us investigating and figuring out (we hope) that we should not be using -noauth |
| Comment by Tom Pantelis [ 27/Jun/18 ] |
|
I mean it's non-deterministic which feature will install and which will fail, ie first one in wins. The failure (w/o my patch) is deterministic. Not sure how you hide it. It might not just be a dev tool - it's been released for many versions so someone else may be using it out there so we technically can't/shouldn't just remove it from the release w/o going thru the deprecation/EOL cycle. That process can certainly be started if someone wants to drive it...
|
| Comment by Victor Pickard [ 28/Jun/18 ] |
|
Here is the patch that removed odl-netconf-noauth from the clustering csit job. Patch has been merged. |