[CONTROLLER-1809] Failed to restart all blueprint containers within 5 minutes. Attempted to restart 1 [org.opendaylight.controller.netty-threadgroup-config_0.8.0.SNAPSHOT [209]] but only 0 completed restart Created: 22/Jan/18  Updated: 24/Jan/18  Resolved: 24/Jan/18

Status: Resolved
Project: controller
Component/s: blueprint
Affects Version/s: None
Fix Version/s: Oxygen

Type: Bug Priority: Highest
Reporter: Michael Vorburger Assignee: Michael Vorburger
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File karaf.log     File karaf_with_c67429.log    

 Description   

<rant>It's shocking me personally that 1 week after the Big Bump, after release code freeze and project's milestone reports, we are still in a complete mess and netvirt totally broken, and everyone (incl. the full TSC) just finds this "perfectly normal" and sort of "a week or two instability is a fair price to pay for supposed previous stability due to separate releasing of odlparent and yangtools". It seems to me that we are doing this all wrong. Controller continous integration accross all projects, with better tooling to gauge impacts, not separate releasing, seems a better model for ODL, to me, to discover this kind of thing faster and earlier.</rant>

shague reports that following the big odlparent/yangtools bump, the netvirt CSIT which runs off integration/distribution/karaf with odl-infrautils-ready and odl-netvirt-openstack, "frequently" fails during boot up, before the CSIT actually "does anything" ... we suspect that something shuts things down, and indeed this log looks very suspicious, here's the most relevant/interesting lines:

2018-01-22T11:01:25,818 | INFO  | awaitility[checkBundleDiagInfos] | SystemReadyImpl                  | 278 - org.opendaylight.infrautils.ready-impl - 1.3.0.SNAPSHOT | checkBundleDiagInfos: Elapsed time 17s, remaining time 282s, diag: Active {Installed=0, Resolved=6, Unknown=0, GracePeriod=0, Waiting=0, Starting=0, Active=461, Stopping=0, Failure=0}
2018-01-22T11:01:25,846 | INFO  | SystemReadyService-0 | TestBundleDiag                   | 278 - org.opendaylight.infrautils.ready-impl - 1.3.0.SNAPSHOT | diag: Active {Installed=0, Resolved=6, Unknown=0, GracePeriod=0, Waiting=0, Starting=0, Active=461, Stopping=0, Failure=0}
2018-01-22T11:01:25,847 | INFO  | SystemReadyService-0 | SystemReadyImpl                  | 278 - org.opendaylight.infrautils.ready-impl - 1.3.0.SNAPSHOT | System ready; AKA: Aye captain, all warp coils are now operating at peak efficiency! [M.]
2018-01-22T11:01:25,860 | INFO  | SystemReadyService-0 | SystemReadyImpl                  | 278 - org.opendaylight.infrautils.ready-impl - 1.3.0.SNAPSHOT | Now notifying all its registered SystemReadyListeners...
2018-01-22T11:01:27,018 | INFO  | SystemReadyService-0 | MBeanUtils                       | 269 - org.opendaylight.infrautils.diagstatus-api - 1.3.0.SNAPSHOT | JMX Connector Server started for url service:jmx:rmi:///jndi/rmi://127.0.0.1:6886/server
2018-01-22T11:01:27,572 | INFO  | paxweb-extender-1-thread-2 | ContextHandler                   | 163 - org.eclipse.jetty.util - 9.3.14.v20161028 | Started HttpServiceContext{httpContext=WebAppHttpContext{org.opendaylight.aaa.shiro - 188}}
2018-01-22T11:01:27,575 | INFO  | paxweb-extender-1-thread-3 | ContextHandler                   | 163 - org.eclipse.jetty.util - 9.3.14.v20161028 | Started HttpServiceContext{httpContext=WebAppHttpContext{org.opendaylight.netconf.restconf-nb-bierman02 - 328}}
2018-01-22T11:01:27,723 | INFO  | paxweb-extender-1-thread-1 | ContextHandler                   | 163 - org.eclipse.jetty.util - 9.3.14.v20161028 | Started HttpServiceContext{httpContext=WebAppHttpContext{org.opendaylight.neutron.northbound-api - 365}}
2018-01-22T11:01:27,724 | INFO  | Thread-94        | TcpHandler                       | 388 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.0.SNAPSHOT | Switch listener started and ready to accept incoming tcp/tls connections on port: 6653
2018-01-22T11:01:27,735 | INFO  | Thread-95        | OpenFlowPluginProviderImpl       | 380 - org.opendaylight.openflowplugin.impl - 0.6.0.SNAPSHOT | All switchConnectionProviders are up and running (2).
2018-01-22T11:01:27,736 | INFO  | Thread-95        | TcpHandler                       | 388 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.0.SNAPSHOT | Switch listener started and ready to accept incoming tcp/tls connections on port: 6633
2018-01-22T11:01:30,091 | INFO  | Thread-91        | StateManager                     | 358 - org.opendaylight.netvirt.statemanager-impl - 0.6.0.SNAPSHOT | StateManager all is ready
2018-01-22T11:03:08,300 | INFO  | metrics-file-reporter-1-thread-1 | MetricsFileReporter              | 276 - org.opendaylight.infrautils.metrics-impl - 1.3.0.SNAPSHOT | Directory does not exist, creating it: metrics
2018-01-22T11:03:08,303 | INFO  | metrics-file-reporter-1-thread-1 | MetricsFileReporter              | 276 - org.opendaylight.infrautils.metrics-impl - 1.3.0.SNAPSHOT | File does not exist, creating it: metrics.2.11
2018-01-22T11:05:55,249 | WARN  | BlueprintContainerRestartService | BlueprintContainerRestartServiceImpl | 192 - org.opendaylight.controller.blueprint - 0.8.0.SNAPSHOT | Failed to restart all blueprint containers within 5 minutes. Attempted to restart 1 [org.opendaylight.controller.netty-threadgroup-config_0.8.0.SNAPSHOT [209]] but only 0 completed restart
2018-01-22T11:05:55,251 | INFO  | BlueprintContainerRestartService | BlueprintContainerRestartServiceImpl | 192 - org.opendaylight.controller.blueprint - 0.8.0.SNAPSHOT | Restarting blueprint containers for bundle org.opendaylight.controller.netty-threadgroup-config_0.8.0.SNAPSHOT [209] and its dependent bundles []
2018-01-22T11:05:55,259 | INFO  | Blueprint-Container-ThreadPool | BlueprintExtender                | 79 - org.apache.aries.blueprint.core - 1.8.3 | Destroying BlueprintContainer for bundle org.opendaylight.controller.netty-threadgroup-config/0.8.0.SNAPSHOT
2018-01-22T11:05:55,265 | INFO  | Blueprint-Container-ThreadPool | BlueprintContainerRestartServiceImpl | 192 - org.opendaylight.controller.blueprint - 0.8.0.SNAPSHOT | Restarting blueprint container for bundle org.opendaylight.controller.netty-threadgroup-config_0.8.0.SNAPSHOT [209] with paths [bundleentry://209.fwk576936864/org/opendaylight/blueprint/netty-threadgroup.xml]
2018-01-22T11:05:55,278 | ERROR | CM Configuration Updater (ManagedService Update: pid=[org.opendaylight.netty.threadgroup]) | configadmin                      | 5 - org.apache.felix.configadmin - 1.8.16 | [org.osgi.service.cm.ManagedService, id=576, bundle=209/mvn:org.opendaylight.controller/netty-threadgroup-config/0.8.0-SNAPSHOT]: Unexpected problem updating configuration org.opendaylight.netty.threadgroup
org.osgi.service.blueprint.container.ServiceUnavailableException: The Blueprint container is being or has been destroyed: (objectClass=org.opendaylight.controller.blueprint.BlueprintContainerRestartService

so if I read above correctly, we have infrautils.ready confirming that we have 461 active bundles, after (only.. not that bad) Elapsed time 17s, so all good. Then some background thread starts 3x Jetty Web contexts, openflowplugin is happy, netvirt.statemanager-impl cheers and then... controller.blueprint decides to restart controller.netty-threadgroup-config ?? And that probably is the root cause for all the havoc we see after.



 Comments   
Comment by Tom Pantelis [ 22/Jan/18 ]

controller.netty-threadgroup-config and org.opendaylight.controller.netty-timer-config both restart but they're not the problem as they have no dependent bundles. The problem is:

 

2018-01-22T11:15:55,263 | INFO  | BlueprintContainerRestartService | BlueprintContainerRestartServiceImpl | 192 - org.opendaylight.controller.blueprint - 0.8.0.SNAPSHOT | Restarting blueprint containers for bundle org.opendaylight.controller.sal-broker-impl

which attempts to restart the world. 

The restarts are due to the ComponentProcessor thinking the cfg file has updated. It has a ManagedService imp

    public void updated(final Dictionary<String, ?> properties) {
        LOG.debug("{}: ManagedService updated for persistentId {}, properties: {}, initialUpdate: {}",
                        logName(), persistentId, properties, initialUpdate);

         // The first update occurs when the service is registered so ignore it as we want subsequent
         // updates when it changes. The ConfigAdmin will send an update even if the cfg file doesn't
         // yet exist.
         if (initialUpdate) {
             initialUpdate = false;
         } else {
             blueprintContainerRestartService.restartContainerAndDependents(bundle);
         }
      }

 
I wonder if we're getting extraneous updates causing it to think it's an actual update. Enabling debug for org.opendaylight.controller.blueprint.ext.ComponentProcessor would help.

Comment by Tom Pantelis [ 22/Jan/18 ]

The full stack trace for the exception is

2018-01-22T11:05:55,278 | ERROR | CM Configuration Updater (ManagedService Update: pid=[org.opendaylight.netty.threadgroup]) | configadmin                      | 5 - org.apache.felix.configadmin - 1.8.16 | [org.osgi.service.cm.ManagedService, id=576, bundle=209/mvn:org.opendaylight.controller/netty-threadgroup-config/0.8.0-SNAPSHOT]: Unexpected problem updating configuration org.opendaylight.netty.threadgroup
org.osgi.service.blueprint.container.ServiceUnavailableException: The Blueprint container is being or has been destroyed: (objectClass=org.opendaylight.controller.blueprint.BlueprintContainerRestartService)
	at org.apache.aries.blueprint.container.ReferenceRecipe.getService(ReferenceRecipe.java:241) [79:org.apache.aries.blueprint.core:1.8.3]
	at org.apache.aries.blueprint.container.ReferenceRecipe.access$000(ReferenceRecipe.java:56) [79:org.apache.aries.blueprint.core:1.8.3]
	at org.apache.aries.blueprint.container.ReferenceRecipe$ServiceDispatcher.call(ReferenceRecipe.java:306) [79:org.apache.aries.blueprint.core:1.8.3]
	at Proxyf38dd91d_6455_476d_85a8_204bc3b78dae.restartContainerAndDependents(Unknown Source) [?:?]
	at org.opendaylight.controller.blueprint.ext.ComponentProcessor$1.updated(ComponentProcessor.java:149) [192:org.opendaylight.controller.blueprint:0.8.0.SNAPSHOT]
	at org.apache.felix.cm.impl.helper.ManagedServiceTracker.updated(ManagedServiceTracker.java:189) [5:org.apache.felix.configadmin:1.8.16]
	at org.apache.felix.cm.impl.helper.ManagedServiceTracker.updateService(ManagedServiceTracker.java:152) [5:org.apache.felix.configadmin:1.8.16]
	at org.apache.felix.cm.impl.helper.ManagedServiceTracker.provideConfiguration(ManagedServiceTracker.java:85) [5:org.apache.felix.configadmin:1.8.16]
	at org.apache.felix.cm.impl.ConfigurationManager$ManagedServiceUpdate.provide(ConfigurationManager.java:1479) [5:org.apache.felix.configadmin:1.8.16]
	at org.apache.felix.cm.impl.ConfigurationManager$ManagedServiceUpdate.run(ConfigurationManager.java:1435) [5:org.apache.felix.configadmin:1.8.16]
	at org.apache.felix.cm.impl.UpdateThread.run0(UpdateThread.java:141) [5:org.apache.felix.configadmin:1.8.16]
	at org.apache.felix.cm.impl.UpdateThread.run(UpdateThread.java:109) [5:org.apache.felix.configadmin:1.8.16]
	at java.lang.Thread.run(Thread.java:748) [?:?]

So the ConfigAdmin is notifying for some reason after/while the BP container is being destroyed. It seems some behavior in ConfigAdmin had changed.

Comment by Michael Vorburger [ 22/Jan/18 ]

This IS actually locally reproducible reliably, fairly easily...

> Enabling debug for org.opendaylight.controller.blueprint.ext.ComponentProcessor would help.

I've attached a karaf.log obtained like this (exact steps here more to be able to repeat this in the future, if ever needed) :

cd integration/distribution/karaf
mvn clean package
cd target/assembly

nano etc/org.ops4j.pax.logging.cfg
log4j2.logger.org_opendaylight_controller_blueprint_ext_componentprocessor.level = DEBUG
log4j2.logger.org_opendaylight_controller_blueprint_ext_componentprocessor.name = org.opendaylight.controller.blueprint.ext.ComponentProcessor
log4j2.logger.org_opendaylight_controller_blueprint_ext_datastoreappconfigmetadata.level = DEBUG
log4j2.logger.org_opendaylight_controller_blueprint_ext_datastoreappconfigmetadata.name = org.opendaylight.controller.blueprint.ext.DataStoreAppConfigMetadata
log4j2.logger.org_opendaylight_controller_blueprint_blueprintcontainerrestartserviceimpl.level = DEBUG
log4j2.logger.org_opendaylight_controller_blueprint_blueprintcontainerrestartserviceimpl.name = org.opendaylight.controller.blueprint.BlueprintContainerRestartServiceImpl

nano etc/org.apache.karaf.features.cfg
add "odl-infrautils-ready odl-netvirt-openstack" before that UUID in featuresBoot = ...

bin/karaf

as per tpantelis suggested workaround on https://lists.opendaylight.org/pipermail/odlparent-dev/2018-January/001592.html (Thank You!), I'll now try to see if setting "odl:restart-dependents-on-updates=false" instead of current true in sal-broker-impl helps...

Comment by Michael Vorburger [ 22/Jan/18 ]

>  try to see if setting "odl:restart-dependents-on-updates=false" instead of current true in sal-broker-impl helps...

https://git.opendaylight.org/gerrit/#/c/67429/ does this (in a few more places than sal-broker-impl) and I can confirm this solves the problem above.

Whether that is the solution we want to adopt I think we should decide together... finding the real fix would be even better, of course; alternatively tpantelis suggested we could just comment out reload in code instead (which has the advantage of affecting and helping all projects without having to change indidivual BP XML, but I fear could be VERY confusing in the future, if we don't find the real fix, so personally I'm less of a fan).

NB: Beyond c/67429, I'm seeing more odl:restart-dependents-on-updates which may have to be changed from true to false, in:

./lispflowmapping/mappingservice/southbound/src/main/resources/org/opendaylight/blueprint/mappingservice-southbound.xml: odl:restart-dependents-on-updates="true"
./netconf/netconf/mdsal-netconf-impl/src/main/resources/org/opendaylight/blueprint/mdsal-netconf-impl.xml: odl:restart-dependents-on-updates="true"
./netconf/netconf/mdsal-netconf-ssh/src/main/resources/org/opendaylight/blueprint/netconf-ssh.xml: odl:restart-dependents-on-updates="true">
./netconf/netconf/netconf-config/src/main/resources/org/opendaylight/blueprint/netconf-config.xml: odl:restart-dependents-on-updates="true">
./netconf/netconf/mdsal-netconf-notification/src/main/resources/org/opendaylight/blueprint/mdsal-netconf-notification.xml: odl:restart-dependents-on-updates="true">
./netconf/netconf/mdsal-netconf-tcp/src/main/resources/org/opendaylight/blueprint/netconf-tcp.xml: odl:restart-dependents-on-updates="true">
./controller/opendaylight/md-sal/samples/toaster-provider/src/main/resources/org/opendaylight/blueprint/toaster-provider.xml: odl:restart-dependents-on-updates="true" odl:use-default-for-reference-types="true">
./tsdr/datapurge/src/main/resources/org/opendaylight/blueprint/tsdr-datapurge.xml: odl:use-default-for-reference-types="true" odl:restart-dependents-on-updates="true">
./tsdr/collectors/syslog-collector/src/main/resources/org/opendaylight/blueprint/syslog-collector.xml: odl:use-default-for-reference-types="true" odl:restart-dependents-on-updates="true">
Comment by Tom Pantelis [ 22/Jan/18 ]

As I suspected, for some reason ConfigAdmin sends an extraneous duplicate update:

2018-01-22T16:18:05,630 | DEBUG | CM Configuration Updater (ManagedService Update: pid=[org.ops4j.pax.url.mvn]) | ComponentProcessor | 192 - org.opendaylight.controller.blueprint - 0.8.0.SNAPSHOT | org.opendaylight.controller.netty-threadgroup-config: ManagedService updated for persistentId org.opendaylight.netty.threadgroup, properties: null, initialUpdate: true
2018-01-22T16:18:05,631 | DEBUG | CM Configuration Updater (ManagedService Update: pid=[org.opendaylight.netty.threadgroup]) | ComponentProcessor | 192 - org.opendaylight.controller.blueprint - 0.8.0.SNAPSHOT | org.opendaylight.controller.netty-threadgroup-config: ManagedService updated for persistentId org.opendaylight.netty.threadgroup, properties: null, initialUpdate: false

Oddly, this doesn't reproduce when running the controller karaf distro with the toaster installed. Looks like we'll have to de-duplicate but can disable restarts for now.

Comment by Tom Pantelis [ 22/Jan/18 ]

I pushed https://git.opendaylight.org/gerrit/#/c/67448/ to store the last update and only restart if it changed although, in my testing, the ConfigAdmin does not notify unless a change actually occurred. It's mystery why the duplicate updates occur when running integration/CSIT as I don't see the strange behavior when running the controller karaf distro. There may be something else insidious going on - it behaves like there's 2 ConfigAdmin instances running in karaf.

Comment by Faseela K [ 23/Jan/18 ]

Tom,

  Could you please try running netvirt distro? For me genius distro used to work fine, but whenever I built netvirt distro, and did feature:install odl-netvirt-openstack, several issues started popping up.

 

Thanks,

Faseela

Comment by Michael Vorburger [ 24/Jan/18 ]

https://git.opendaylight.org/gerrit/#/c/67444/

Generated at Wed Feb 07 19:56:30 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.