[CONTROLLER-1844] Unable to start blueprint container for bundle org.opendaylight.netconf.restconf-nb-bierman02-auth/1.8.0 Created: 26/Jun/18  Updated: 28/Jun/18  Resolved: 28/Jun/18

Status: Verified
Project: controller
Component/s: clustering, netconf, restconf
Affects Version/s: None
Fix Version/s: Fluorine

Type: Bug Priority: Medium
Reporter: Victor Pickard Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to NETVIRT-1315 Troubleshooting Controller CSIT In Progress

 Description   

This bundle is failing to start in the controller clustering job 127, after ODL is restarted with Tell Based False.

 

From then on, RPC failures occur with error 500.

 

 

Snippet from odl1_karaf.log

====================== 

2018-06-25T13:04:34,012 | ERROR | Blueprint Extender: 2 | BlueprintContainerImpl | 75 - org.apache.aries.blueprint.core - 1.8.3 | Unable to start blueprint container for bundle org.opendaylight.netconf.restconf-nb-bierman02-auth/1.8.0
org.osgi.service.blueprint.container.ComponentDefinitionException: Error when instantiating bean .component-1 of class org.opendaylight.restconf.nb.bierman02.web.auth.WebInitializer
at org.apache.aries.blueprint.container.BeanRecipe.wrapAsCompDefEx(BeanRecipe.java:361) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.container.BeanRecipe.getInstanceFromType(BeanRecipe.java:351) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.container.BeanRecipe.getInstance(BeanRecipe.java:282) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.container.BeanRecipe.internalCreate2(BeanRecipe.java:830) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.container.BeanRecipe.internalCreate(BeanRecipe.java:811) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.di.AbstractRecipe$1.call(AbstractRecipe.java:79) [75:org.apache.aries.blueprint.core:1.8.3]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
at org.apache.aries.blueprint.di.AbstractRecipe.create(AbstractRecipe.java:88) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.container.BlueprintRepository.createInstances(BlueprintRepository.java:255) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.container.BlueprintRepository.createAll(BlueprintRepository.java:186) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.container.BlueprintContainerImpl.instantiateEagerComponents(BlueprintContainerImpl.java:704) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.container.BlueprintContainerImpl.doRun(BlueprintContainerImpl.java:410) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.container.BlueprintContainerImpl.run(BlueprintContainerImpl.java:275) [75:org.apache.aries.blueprint.core:1.8.3]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
at org.apache.aries.blueprint.container.ExecutorServiceWrapper.run(ExecutorServiceWrapper.java:106) [75:org.apache.aries.blueprint.core:1.8.3]
at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:48) [75:org.apache.aries.blueprint.core:1.8.3]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:?]
at java.lang.Thread.run(Thread.java:748) [?:?]
Caused by: java.lang.IllegalStateException: Http context already used. Context params can be set/changed only before first usage
at org.ops4j.pax.web.service.internal.HttpServiceStarted.setContextParam(HttpServiceStarted.java:707) ~[?:?]
at org.ops4j.pax.web.service.internal.HttpServiceProxy.setContextParam(HttpServiceProxy.java:271) ~[?:?]
at org.opendaylight.aaa.web.osgi.PaxWebServer$WebContextImpl.<init>(PaxWebServer.java:164) ~[?:?]
at org.opendaylight.aaa.web.osgi.PaxWebServer$2$1.<init>(PaxWebServer.java:116) ~[?:?]
at org.opendaylight.aaa.web.osgi.PaxWebServer$2.registerWebContext(PaxWebServer.java:116) ~[?:?]
at Proxy61f92867_78e3_4dd2_a6c9_e40871b2c69a.registerWebContext(Unknown Source) ~[?:?]
at org.opendaylight.netconf.sal.restconf.web.Bierman02WebRegistrarImpl.register(Bierman02WebRegistrarImpl.java:82) ~[?:?]
at org.opendaylight.netconf.sal.restconf.web.Bierman02WebRegistrarImpl.registerWithAuthentication(Bierman02WebRegistrarImpl.java:55) ~[?:?]
at Proxyb86b267b_05f2_4bbe_bd31_3106e600b567.registerWithAuthentication(Unknown Source) ~[?:?]
at Proxyf7cb35fe_3ba9_43d0_a739_d4c847b9747b.registerWithAuthentication(Unknown Source) ~[?:?]
at org.opendaylight.restconf.nb.bierman02.web.auth.WebInitializer.<init>(WebInitializer.java:19) ~[?:?]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:?]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:?]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:?]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:?]
at org.apache.aries.blueprint.utils.ReflectionUtils.newInstance(ReflectionUtils.java:331) ~[?:?]
at org.apache.aries.blueprint.container.BeanRecipe.newInstance(BeanRecipe.java:984) ~[?:?]
at org.apache.aries.blueprint.container.BeanRecipe.getInstanceFromType(BeanRecipe.java:349) ~[?:?]
... 22 more



 Comments   
Comment by Tom Pantelis [ 27/Jun/18 ]

Can you please provide exact reproduction steps?

Comment by Victor Pickard [ 27/Jun/18 ]

Hi Tom,

This is the controller csit clustering job. 

The job stops ODL on all nodes, with a kill -9 on karaf pid.

Then, the job starts ODL on all nodes, with ../bin/start.

When ODL starts, we see the exception, and restconf fails with error 500 from then on.

 

Comment by Victor Pickard [ 27/Jun/18 ]

Here is a link to the job with the exception. This will make it a little easier to see all the karaf logs, etc.

 

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-clustering-all-fluorine/127/

 

Comment by Tom Pantelis [ 27/Jun/18 ]

I think there's more detail in there. You mentioned changing tell-based setting - I assume that happens after kill -9. Does it delete the data dir before restarting? Is this reproducible every time or intermittent? Is it reproducible with single node?

Comment by Tom Pantelis [ 27/Jun/18 ]

I think I see the problem - I see both restconf-nb-bierman02-noauth and restconf-nb-bierman02-auth features being installed. It should be one or the other although we should put in a guard to at least avoid the ISE and emit a warning in Bierman02WebRegistrarImpl if  the web context has already been created.

Comment by Victor Pickard [ 27/Jun/18 ]

Great, if you put a link to the patch when you have it ready, I should be able to run it through csit to see how it looks. 

 

Comment by Tom Pantelis [ 27/Jun/18 ]

Well CSIT should not install both features  so I'd suggest to fix that to unblock. Seeing that it's a mis-configuration then I think we can lower severity/priority.

Comment by Jamo Luhrsen [ 27/Jun/18 ]

featuresBoot = odl-integration-compatible-with-all,odl-jolokia,odl-restconf-noauth,odl-clustering-test-app, 1a22a137-5311-4fe4-b894-bff5147a7819

Comment by Jamo Luhrsen [ 27/Jun/18 ]

where is the misconfig? maybe int/dist has something wrong with the compatible-with-all? otherwise, the misconfig is in the cluster-test-app or restconf-noauth or odl-jolokia features.

Comment by Tom Pantelis [ 27/Jun/18 ]

I suspect  odl-integration-compatible-with-all installs odl-restconf. It's not wrong - it's just that installing odl-restconf-noauth in addition causes the issue. Like I said, I'll push a patch to guard against it but you'll get non-deterministic behavior across runs, ie one run uses auth, the next doesn't. Maybe that's why we're still seeing time outs - we think it's using noauth but it's really not. It looks like we need the equivalent of odl-integration-compatible-with-all that installs noauth. Or take restconf out of odl-integration-compatible-with-all and install the desired version separately.

Comment by Tom Pantelis [ 27/Jun/18 ]

Patch to guard against multiple registrations: https://git.opendaylight.org/gerrit/#/c/73488/

Comment by Tom Pantelis [ 27/Jun/18 ]

I assume odl-integration-compatible-with-all installs the "world" (at least the managed world). If so and this is a controller CSIT why are we installing all that? All you should need odl-clustering-test-app, odl-jolokia and odl-restconf(-noauth).

Comment by Victor Pickard [ 27/Jun/18 ]

Thanks Tom. I'm just queued a csit job that removes the odl-netconf-noauth from the features list.

 

The odl-integration-compatible-with-all feature is installed as part of clustering config, I'll have to dig some more or get input from Jamo on how best to adjust that one.

Comment by Jamo Luhrsen [ 27/Jun/18 ]

ok, so we got a good theory here, and I know vpickard is testing this in the sandbox now, but running the job without
the noauth feature being installed. The protection patch c/73488 seems like a good idea as well. Need it cherry picked
to oxygen too, I guess.

Comment by Tom Pantelis [ 27/Jun/18 ]

We can do that but that patch really shouldn't be needed for CSIT - as I've mentioned you don't want to install both auth and no-auth - it will be non-deterministic which one will take effect on each run.  You should be able to just remove odl-integration-compatible-with-all as I've mentioned. odl-clustering-test-app installs mdsal etc - that's all I install, along with restconf, when I'm testing with the cars stuff.

Comment by Jamo Luhrsen [ 27/Jun/18 ]

the point of compatible-with-all is to load all the features we have determined should be able to
run together without breaking things.

I wasn't implying the patch is needed for CSIT. If this is really what's happening, the patch to protect
/restconf from being totally dead is nice to have. We will now also know that putting noauth on the
same job as running compatible-with-all is wrong.

btw, I don't understand how this explains the sporadic nature of this problem. If compatible-with-all
is bringing in -auth and we are also always trying to install -noauth, then why doesn't the issue show
up every time?

Comment by Tom Pantelis [ 27/Jun/18 ]

It repro'ed every time I installed both - same exact error - so the theory is correct - based on how pax web works it is deterministic wrt failure. Are you sure you're really seeing it sporadically? Has there ever been a case where you saw it occur and didn't install both.

I thought compatible-with-all was for dist-check jobs... isn't necessary for CSIT?

 

 

Comment by Jamo Luhrsen [ 27/Jun/18 ]

ok, good that it's easy to reproduce. vpickard, I assumed this was one of the sporadic failures. Is this happening
every time? Seems that the end result of restconf being totally dead would make every test case fail.

compatible-with-all is what is used in every CSIT job you see with all in it. The jobs with only in it do not
install that.

Comment by Tom Pantelis [ 27/Jun/18 ]

Maybe it's a legacy thing. I mean if a CSIT is testing clustering with cars, why install eg lispflowmapping, sfc et al? What if one of those unused features has an orthogonal issue that affects the clustering tests?

Comment by Victor Pickard [ 27/Jun/18 ]

I took a closer look at the logs for this CSIT run. ODL is stopped and started a number of times in the CSIT.

 

From what I see, one of the features will end up successfully registering, and the other one will fail. And it isn't the same one that wins every time.

I think we are/should be using the bierman02-auth version, right? So, if that is the one that fails to register, then 500 from then on until ODL is restarted in next test case down the line.

Comment by Tom Pantelis [ 27/Jun/18 ]

Right - that's what I mentioned before - it's non-deterministic which one gets in first.  We should be using auth. In fact no-auth really should not even exist anymore - it was kept for legacy from the days before we even had authentication. Devs wanted to keep it for convenience (including me ).  You may notice the new rfc8040 restconf feature no longer has a no-auth option - I removed it recently. 

That said, the reason for using no-auth  now in CSIT was to try to isolate the restconf failures in https://jira.opendaylight.org/browse/CONTROLLER-1838. Which then led to this issue b/c CSIT was installing both which is a no-no. 

I'll lower the priority.

 

Comment by Jamo Luhrsen [ 27/Jun/18 ]

I think that's the point, to find (and fix) the orthogonal issues that come up. If it's deemed some kind of
unresolvable conflict, then we remove the feature from the compatible-with-all list.

we used to have both only and all jobs, so we could more easily know if the problems were
coming like this, but we were overloading jenkins, so where we could we removed the only
jobs. this is one such case.

Comment by Jamo Luhrsen [ 27/Jun/18 ]

So, it's non-deterministic, but tpantelis can hit it every time which sounds deterministic

maybe the CSIT timing is random enough that it doesn't happen every time...

Also, tpantelis this is not the same job or same feature set from CONTROLLER-1838. let's
not tie them together.

vpickard, once we known tihs never comes once we don't have -noauth, we can close this
bug, I think.

tpantelis, I wonder if we should hide this -noauth feature somehow, since it's strictly a
dev tool?

Comment by Tom Pantelis [ 27/Jun/18 ]

Maybe I'm not understanding something here... but I thought it's the purpose of the distribution-check jobs which run for every gerrit patch to test feature compatibility across all the projects. Hence why compatible-with-all was created - distribution-check jobs install and verify all the features come up.

If all that is correct (which I think it is), why do project CSITs need to duplicate what distribution-check jobs already do? 

Comment by Jamo Luhrsen [ 27/Jun/18 ]

no, that's not correct. compatible-with-all was there from the start and used in CSIT. distribution-check came
later and took advantage of it. And, distribution-check just loads features. CSIT executes functional/system
level tests against those features.

example, l2switch, if loaded along side of netvirt will break things. l2switch is not compatible with other
openflow-y applications. However, l2switch will load just fine with along side netvirt. Thus, we ended
up creating those buckets (compatible or not) so we can make sure we can keep track of these kinds of
things.

anyway, none of this really matters as it applies to this jira.

this jira is about us investigating and figuring out (we hope) that we should not be using -noauth
(anywhere, I guess) in our CSIT jobs.

Comment by Tom Pantelis [ 27/Jun/18 ]

I mean it's non-deterministic which feature will install and which will fail, ie first one in wins. The failure (w/o my patch) is deterministic.

Not sure how you hide it. It might not just be a dev tool - it's been released for many versions so someone else may be using it out there so we technically can't/shouldn't just remove it from the release w/o going thru the deprecation/EOL cycle. That process can certainly be started if someone wants to drive it...

 

 

Comment by Victor Pickard [ 28/Jun/18 ]

Here is the patch that removed odl-netconf-noauth from the clustering csit job. Patch has been merged.

https://git.opendaylight.org/gerrit/#/c/73535/

Generated at Wed Feb 07 19:56:35 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.