[CONTROLLER-1732] nitrogen: server unhealthy for karaf4 when feature installed Created: 29/Jun/17  Updated: 25/Jul/23  Resolved: 15/Aug/17

Status: Resolved
Project: controller
Component/s: config
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Peter Gubka Assignee: Tomas Cere
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Text File unhealthy_karaf.log    
External issue ID: 8771

 Description   

odl: https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/integration/distribution/karaf/0.7.0-SNAPSHOT/karaf-0.7.0-20170628.124408-72.zip

It was built from integration/distribution=07/59607/4 in https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-nitrogen/12. This patch added odl-netconf-clustered-topology into the karaf4 distribution.

Steps to reproduce
-unzip -q karaf-0.7.0-20170628.124408-72.zip
-start karaf and feature:install odl-netconf-clustered-topology

Exception:
2017-06-29 10:02:59,015 | ERROR | config-pusher | ConfigRegistryImpl | 138 - org.opendaylight.controller.config-manager - 0.7.0.SNAPSHOT | Configuration Transaction failed on 2PC, server is unhealthy
java.lang.IllegalStateException: Error - getInstance() failed for ModuleIdentifier

{factoryName='netty-threadgroup-fixed', instanceName='global-worker-group'}

in transaction TransactionIdentifier

{name='ConfigTransaction-21-23'}

at org.opendaylight.controller.config.manager.impl.ConfigTransactionControllerImpl.secondPhaseCommit(ConfigTransactionControllerImpl.java:446)
at org.opendaylight.controller.config.manager.impl.ConfigRegistryImpl.secondPhaseCommit(ConfigRegistryImpl.java:342)
at org.opendaylight.controller.config.manager.impl.ConfigRegistryImpl.commitConfigSafe(ConfigRegistryImpl.java:283)
at org.opendaylight.controller.config.manager.impl.ConfigRegistryImpl.commitConfig(ConfigRegistryImpl.java:248)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)[:1.8.0_131]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)[:1.8.0_131]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)[:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498)[:1.8.0_131]
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)[:1.8.0_131]
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)[:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498)[:1.8.0_131]
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)[:1.8.0_131]
at com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:193)[:1.8.0_131]
at com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:175)[:1.8.0_131]
at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:117)[:1.8.0_131]
at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:54)[:1.8.0_131]
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)[:1.8.0_131]
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)[:1.8.0_131]
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)[:1.8.0_131]
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)[:1.8.0_131]
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)[:1.8.0_131]
at com.sun.jmx.mbeanserver.MXBeanProxy$InvokeHandler.invoke(MXBeanProxy.java:150)[:1.8.0_131]
at com.sun.jmx.mbeanserver.MXBeanProxy.invoke(MXBeanProxy.java:167)[:1.8.0_131]
at javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:258)[:1.8.0_131]
at com.sun.proxy.$Proxy35.commitConfig(Unknown Source)[68:org.opendaylight.controller.config-api:0.7.0.SNAPSHOT]
at org.opendaylight.controller.config.util.ConfigRegistryJMXClient.commitConfig(ConfigRegistryJMXClient.java:110)[70:org.opendaylight.controller.config-util:0.7.0.SNAPSHOT]
at org.opendaylight.controller.config.facade.xml.transactions.TransactionProvider.commitTransaction(TransactionProvider.java:142)[69:org.opendaylight.controller.config-manager-facade-xml:0.7.0.SNAPSHOT]
at org.opendaylight.controller.config.facade.xml.ConfigSubsystemFacade.commitSilentTransaction(ConfigSubsystemFacade.java:138)[69:org.opendaylight.controller.config-manager-facade-xml:0.7.0.SNAPSHOT]
at org.opendaylight.controller.config.persist.impl.ConfigPusherImpl.pushConfig(ConfigPusherImpl.java:291)[142:org.opendaylight.controller.config-persister-impl:0.7.0.SNAPSHOT]
at org.opendaylight.controller.config.persist.impl.ConfigPusherImpl.pushConfigWithConflictingVersionRetries(ConfigPusherImpl.java:159)[142:org.opendaylight.controller.config-persister-impl:0.7.0.SNAPSHOT]
at org.opendaylight.controller.config.persist.impl.ConfigPusherImpl.internalPushConfigs(ConfigPusherImpl.java:124)[142:org.opendaylight.controller.config-persister-impl:0.7.0.SNAPSHOT]
at org.opendaylight.controller.config.persist.impl.ConfigPusherImpl.processSingle(ConfigPusherImpl.java:78)[142:org.opendaylight.controller.config-persister-impl:0.7.0.SNAPSHOT]
at org.opendaylight.controller.config.persist.impl.ConfigPusherImpl.process(ConfigPusherImpl.java:70)[142:org.opendaylight.controller.config-persister-impl:0.7.0.SNAPSHOT]
at org.opendaylight.controller.config.persist.impl.osgi.ConfigPersisterActivator$2.run(ConfigPersisterActivator.java:131)[142:org.opendaylight.controller.config-persister-impl:0.7.0.SNAPSHOT]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_131]
Caused by: org.opendaylight.controller.config.api.osgi.ServiceNotFoundException: Wait for OSGi service interface io.netty.channel.EventLoopGroup was interrrupted
at org.opendaylight.controller.config.api.osgi.WaitingServiceTracker.waitForService(WaitingServiceTracker.java:54)
at org.opendaylight.controller.config.yang.netty.threadgroup.NettyThreadgroupModule.createInstance(NettyThreadgroupModule.java:58)
at org.opendaylight.controller.config.spi.AbstractModule.getInstance(AbstractModule.java:87)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)[:1.8.0_131]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)[:1.8.0_131]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)[:1.8.0_131]
at java.lang.reflect.Method.invoke(Method.java:498)[:1.8.0_131]
at org.opendaylight.controller.config.manager.impl.dependencyresolver.DependencyResolverManager$ModuleInvocationHandler.handleInvocation(DependencyResolverManager.java:148)
at com.google.common.reflect.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:84)
at com.sun.proxy.$Proxy47.getInstance(Unknown Source)
at org.opendaylight.controller.config.manager.impl.ConfigTransactionControllerImpl.secondPhaseCommit(ConfigTransactionControllerImpl.java:439)
... 35 more



 Comments   
Comment by Peter Gubka [ 29/Jun/17 ]

Attachment unhealthy_karaf.log has been added with description: karaf log

Comment by Vratko Polak [ 29/Jun/17 ]

> feature:install odl-netconf-clustered-topology

Does this happen also if you do the following instead?
feature:install odl-config-netty

Comment by Peter Gubka [ 29/Jun/17 ]

(In reply to Vratko Polák from comment #1)
> > feature:install odl-netconf-clustered-topology
>
> Does this happen also if you do the following instead?
> feature:install odl-config-netty

No, it does not happen with feature:install odl-config-netty.

Comment by Tomas Cere [ 13/Jul/17 ]

https://git.opendaylight.org/gerrit/#/c/60253/
https://git.opendaylight.org/gerrit/#/c/60254/

There seems to be no need afaik to start the threadpools via cfg subsystem,
this gets rid of the unhealthy message, but there is some weird behaviour where the config pusher seems to be waiting for some unrelated features? Im kind of stumped because everything with regards to the feature seems to be working.

Comment by Tom Pantelis [ 14/Jul/17 ]

The netty threadpool CSS modules are still used by bpgcep. It looks like the netty-threadgroup-fixed module was interrupted on shut down waiting for the corresponding blueprint container to advertise the services. This indicates either the blueprint container failed startup or the bundle hadn't activated yet, ie some other previous bundle was hung/stuck activating. If it's the former, there would be an error in the log to indicate it.

Comment by Tom Pantelis [ 14/Jul/17 ]

Either way, a CSS failure should not affect netconf now as all CSS remnants are gone from netconf/restconf.

Comment by Tomas Cere [ 14/Jul/17 ]

It doesnt really affect netconf apart from distribution check as everythings working, but i was told by the testers that distribution-check keeps failing due to this in the logs.

Comment by Vratko Polak [ 14/Jul/17 ]

> CSS failure should not affect netconf

Distribution check is failing if it detects "server is unhealthy", because at that point Config SubSystem is unable to make further progress, e.g. subsequent feature installations which rely on config pusher would not see their CSS config files applied.

If Netconf is not affected, it means it does not really need the config file which caused the failure. Or perhaps there were some bundle refreshes and whatnot, and config pusher is attempting to apply the same config for a second time? Either way, Netconf should not be causing this behavior.

Comment by Tom Pantelis [ 14/Jul/17 ]

right - it's not an issue with netconf unless something in netconf is blocking bundle activation - you need to get a thread dump. The CSS modules in questions are from the controller and unfortunately they must remain a while longer for backwards compat.

Comment by Tomas Cere [ 14/Jul/17 ]

So whats the next step here? We cant get the netconf features into the distribution with failing distribution-check..

Comment by Tom Pantelis [ 14/Jul/17 ]

(In reply to Tomas Cere from comment #9)
> So whats the next step here? We cant get the netconf features into the
> distribution with failing distribution-check..

As I mentioned, need to get a thread dump.

Comment by Tomas Cere [ 17/Jul/17 ]

I took a thread dump but there doesnt seem to be anything suspicous

https://gist.github.com/cerko2/9a46093fcd4e15dee575dd2318bab665

Comment by Stephen Kitt [ 18/Jul/17 ]

https://git.opendaylight.org/gerrit/60544 and https://git.opendaylight.org/gerrit/60545 should avoid the bcpkix-related refreshes.

Comment by Vratko Polak [ 15/Aug/17 ]

I believe this is fixed since Odlparent 2.0.3

Generated at Wed Feb 07 19:56:18 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.