[CONTROLLER-1867] Cluster: Member fail to start with odl-netconf-clustered-topology feature Created: 17/Oct/18  Updated: 12/Jan/19  Resolved: 07/Nov/18

Status: Resolved
Project: controller
Component/s: blueprint
Affects Version/s: Fluorine, Oxygen SR3
Fix Version/s: Neon, Fluorine SR1, Oxygen SR4

Type: Bug Priority: High
Reporter: Luis Gomez Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by DAEXIM-11 CSIT blocker: daexim-csit-3node-clust... Resolved
Relates
relates to NETCONF-591 Netconf connection failing in master ... Resolved
relates to ODLPARENT-172 Provide secure Java XML stack configu... Confirmed

 Description   

The issue only happens in oxygen and was hidden because current CSIT test installs all compatible features (all test), however if we run same test with just odl-netconf-clustered-topology feature (only test) we see that a cluster member that has been killed cannot be restarted:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-3node-clustering-all-oxygen/421/robot-plugin/log.html.gz

In the karaf log of the restarting member (1):

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-3node-clustering-all-oxygen/421/odl_1/odl1_karaf.log.gz

we see this ERROR:

2018-10-17T19:33:22,521 | INFO | config-pusher | ConfigPusherImpl | 204 - org.opendaylight.controller.config-persister-impl - 0.8.4.SNAPSHOT | Pushing configuration snapshot 06-clustered-entity-ownership.xml(odl-netconf-connector,odl-netconf-connector)
2018-10-17T19:33:22,627 | ERROR | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | ActorSystemImpl | 50 - com.typesafe.akka.slf4j - 2.5.11 | Uncaught error from thread [opendaylight-cluster-data-akka.actor.default-dispatcher-4]: org/w3c/dom/ElementTraversal, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-data]
java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
at java.lang.ClassLoader.defineClass(ClassLoader.java:763) ~[?:?]
at org.eclipse.osgi.internal.loader.ModuleClassLoader.defineClass(ModuleClassLoader.java:276) ~[?:?]
at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.defineClass(ClasspathManager.java:655) ~[?:?]
at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.findClassImpl(ClasspathManager.java:578) ~[?:?]
at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.findLocalClassImpl(ClasspathManager.java:538) ~[?:?]
at org.eclipse.osgi.internal.loader.classpath.ClasspathManager.findLocalClass(ClasspathManager.java:525) ~[?:?]
at org.eclipse.osgi.internal.loader.ModuleClassLoader.findLocalClass(ModuleClassLoader.java:328) ~[?:?]
at org.eclipse.osgi.internal.loader.BundleLoader.findLocalClass(BundleLoader.java:368) ~[?:?]
at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:446) ~[?:?]
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:395) ~[?:?]
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:387) ~[?:?]
at org.eclipse.osgi.internal.loader.ModuleClassLoader.loadClass(ModuleClassLoader.java:150) ~[?:?]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:?]
at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown Source) ~[378:wrap_file__tmp_karaf-0.8.4-SNAPSHOT_system_xerces_xercesImpl_2.11.0_xercesImpl-2.11.0.jar:0.0.0]
at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown Source) ~[378:wrap_file__tmp_karaf-0.8.4-SNAPSHOT_system_xerces_xercesImpl_2.11.0_xercesImpl-2.11.0.jar:0.0.0]
at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown Source) ~[378:wrap_file__tmp_karaf-0.8.4-SNAPSHOT_system_xerces_xercesImpl_2.11.0_xercesImpl-2.11.0.jar:0.0.0]
at org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown Source) ~[378:wrap_file__tmp_karaf-0.8.4-SNAPSHOT_system_xerces_xercesImpl_2.11.0_xercesImpl-2.11.0.jar:0.0.0]
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) ~[378:wrap_file__tmp_karaf-0.8.4-SNAPSHOT_system_xerces_xercesImpl_2.11.0_xercesImpl-2.11.0.jar:0.0.0]
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) ~[378:wrap_file__tmp_karaf-0.8.4-SNAPSHOT_system_xerces_xercesImpl_2.11.0_xercesImpl-2.11.0.jar:0.0.0]
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) ~[378:wrap_file__tmp_karaf-0.8.4-SNAPSHOT_system_xerces_xercesImpl_2.11.0_xercesImpl-2.11.0.jar:0.0.0]
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) ~[378:wrap_file__tmp_karaf-0.8.4-SNAPSHOT_system_xerces_xercesImpl_2.11.0_xercesImpl-2.11.0.jar:0.0.0]
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) ~[378:wrap_file__tmp_karaf-0.8.4-SNAPSHOT_system_xerces_xercesImpl_2.11.0_xercesImpl-2.11.0.jar:0.0.0]
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121) ~[?:?]
at org.opendaylight.controller.blueprint.ext.DataStoreAppConfigDefaultXMLReader.parsePossibleDefaultAppConfigXMLFile(DataStoreAppConfigDefaultXMLReader.java:174) ~[194:org.opendaylight.controller.blueprint:0.8.4.SNAPSHOT]
at org.opendaylight.controller.blueprint.ext.DataStoreAppConfigDefaultXMLReader.createDefaultInstance(DataStoreAppConfigDefaultXMLReader.java:126) ~[194:org.opendaylight.controller.blueprint:0.8.4.SNAPSHOT]
at org.opendaylight.controller.blueprint.ext.DataStoreAppConfigMetadata.createDefaultInstance(DataStoreAppConfigMetadata.java:262) ~[194:org.opendaylight.controller.blueprint:0.8.4.SNAPSHOT]
at org.opendaylight.controller.blueprint.ext.DataStoreAppConfigMetadata.setInitialAppConfig(DataStoreAppConfigMetadata.java:229) ~[194:org.opendaylight.controller.blueprint:0.8.4.SNAPSHOT]
at org.opendaylight.controller.blueprint.ext.DataStoreAppConfigMetadata.access$200(DataStoreAppConfigMetadata.java:56) ~[194:org.opendaylight.controller.blueprint:0.8.4.SNAPSHOT]
at org.opendaylight.controller.blueprint.ext.DataStoreAppConfigMetadata$1.onSuccess(DataStoreAppConfigMetadata.java:172) ~[194:org.opendaylight.controller.blueprint:0.8.4.SNAPSHOT]
at org.opendaylight.controller.blueprint.ext.DataStoreAppConfigMetadata$1.onSuccess(DataStoreAppConfigMetadata.java:165) ~[194:org.opendaylight.controller.blueprint:0.8.4.SNAPSHOT]
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1393) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:691) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.AbstractTransformFuture$TransformFuture.setResult(AbstractTransformFuture.java:237) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:159) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:398) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1015) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:868) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.AbstractFuture.set(AbstractFuture.java:691) ~[38:com.google.guava:23.6.1.jre]
at com.google.common.util.concurrent.SettableFuture.set(SettableFuture.java:48) ~[38:com.google.guava:23.6.1.jre]
at org.opendaylight.controller.cluster.datastore.messages.ReadData.processResponse(ReadData.java:40) ~[226:org.opendaylight.controller.sal-distributed-datastore:1.7.4.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.RemoteTransactionContext$2.onComplete(RemoteTransactionContext.java:250) ~[226:org.opendaylight.controller.sal-distributed-datastore:1.7.4.SNAPSHOT]
at akka.dispatch.OnComplete.internal(Future.scala:261) ~[44:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.OnComplete.internal(Future.scala:258) ~[44:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) ~[44:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:185) ~[44:com.typesafe.akka.actor:2.5.11]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) ~[353:org.scala-lang.scala-library:2.12.6.v20180426-183816-VFINAL-d7ae7a8]
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) ~[44:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91) ~[44:com.typesafe.akka.actor:2.5.11]
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) ~[353:org.scala-lang.scala-library:2.12.6.v20180426-183816-VFINAL-d7ae7a8]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) ~[353:org.scala-lang.scala-library:2.12.6.v20180426-183816-VFINAL-d7ae7a8]
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91) ~[44:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) ~[44:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43) ~[44:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [44:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [44:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [44:com.typesafe.akka.actor:2.5.11]
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [44:com.typesafe.akka.actor:2.5.11]
Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal cannot be found by wrap_file__tmp_karaf-0.8.4-SNAPSHOT_system_xerces_xercesImpl_2.11.0_xercesImpl-2.11.0.jar_0.0.0
at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:433) ~[?:?]
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:395) ~[?:?]
at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:387) ~[?:?]
at org.eclipse.osgi.internal.loader.ModuleClassLoader.loadClass(ModuleClassLoader.java:150) ~[?:?]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:?]
... 61 more{comment}


 Comments   
Comment by Robert Varga [ 23/Oct/18 ]

Ugh, this looks very weird. We are executing on akka's dispatch pool, from where we switch to xerces, which attempts to load org.w3c.dom, which should be available from the JRE, I think. This feels like a TCCL issue w.r.t. blueprint.

Comment by Robert Varga [ 23/Oct/18 ]

tpantelis any ideas what is going on here?

Comment by Tom Pantelis [ 23/Oct/18 ]

Not sure off the top of my head. I'd think org.w3c.dom would be exported from the JRE via the root class loader (I do see it listed in jre.properties in my local karaf install). Is this happening only in Oxygen SR3 and not <= Oxygen SR2?

Comment by Richard Kosegi [ 24/Oct/18 ]

We just seen similar problem in fluorine : https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/daexim-csit-3node-clustering-basic-only-fluorine/249/odl_3/odl3_karaf.log.gz

 

Comment by Robert Varga [ 24/Oct/18 ]

This smells quite a bit. rkosegi ecelgp can we reproduce this with

 

org.eclipse.osgi/debug/loader = true

in etc/equinox-debug.properties ?

 

Comment by Robert Varga [ 24/Oct/18 ]

Looking at the xerces bundle, it should be able to resolve org.w3c.dom, as it has a optional import for that package, which seems to be resolved (from root bundle) in my local environment. We need that debug it identify which ModuleClassLoader is being used to resolve the class.

Comment by Robert Varga [ 24/Oct/18 ]

Well, it seems it's the correct loader, trying to resolve an imported class:

        // 3) search the imported packages
        PackageSource source = findImportedSource(pkgName, null);
        if (source != null) {
            if (debug.DEBUG_LOADER) {
                Debug.println("BundleLoader[" + this + "] loading from import package: " + source); //$NON-NLS-1$ //$NON-NLS-2$
            }
            // 3) found import source terminate search at the source
            result = source.loadClass(name);
            if (result != null)
                return result;
            throw new ClassNotFoundException(name + " cannot be found by " + this); //$NON-NLS-1$
        }

We need to know what "source" is.

Comment by Robert Varga [ 24/Oct/18 ]

My working theory is that this has something to do with the order in which the bundles are deployed and activated. xercesImpl provides a DocumentBuilderFactory instance, via ServiceLoader, which uses TCCL – hence if xercesImpl.jar is visible via TCCL when UntrustedXML.class is loaded, it may pick it up as the DocumentBuilder.

There are multiple ways to fix this:

  • odlparent should provide a safely-configured XML stack via system properties, which would force a concrete implementation to be used
  • UntrustedXML could force TCCL to be the same as its class loader, hence potentially removing xercesImpl from its view
  • blueprint could ping UntrustedXML from its Activator, which presumably does not see xercesImpl either
  • netconf-netty-util could somehow defang xercesImpl, as it is pulled in by exificient for just a few (DBF-unrelated) classes. One way would be to completely shade/relocate exificient within netconf-netty-util.
  • exificient should shade pars of xercesImpl it needs and use them internally

 

Comment by Robert Varga [ 24/Oct/18 ]

exificient issue: https://github.com/EXIficient/exificient-grammars/issues/11

Comment by Luis Gomez [ 25/Oct/18 ]

Strangely I do not see the issue in stable/fluorine branch, maybe it is not consistent there. For oxygen is very consistent so would it be possible to cherry-pick https://git.opendaylight.org/gerrit/#/c/77266/ to stable/oxygen?

Comment by Robert Varga [ 25/Oct/18 ]

done

Comment by Robert Varga [ 25/Oct/18 ]

btw. org/w3c/dom/ElementTraversal is present in xml-apis-1.4.01, but not in Java 8. This is fixes with Java 9+, which includes ElementTraversal.

Comment by Luis Gomez [ 25/Oct/18 ]

FYI this patch: https://git.opendaylight.org/gerrit/#/c/77274/ works in Oxygen.

BTW, it does fix the problem in question but it may also bring or expose some other issues as this is also seen in the log of a failed test using the patch:

2018-10-26T00:31:40,438 | ERROR | Blueprint Extender: 2 | AbstractDataStore | 226 - org.opendaylight.controller.sal-distributed-datastore - 1.7.4.SNAPSHOT | Shard leaders failed to settle in 90 seconds, giving up

Test:
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-3node-clustering-all-oxygen/434/robot-plugin/log.html.gz#s1-s4

Karaf log:
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-3node-clustering-all-oxygen/434/odl_1/odl1_karaf.log.gz

In any case I keep my +1 as the original problem was consistent and the new one seems like more sporadic.

Comment by Robert Varga [ 07/Nov/18 ]

All active release streams have this fixed. The NETCONF part of the fix for Neon needs some more work, but will complete for Neon.

Generated at Wed Feb 07 19:56:39 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.