[CONTROLLER-1224] An attempt to push 5k paths to ODL SR3 through BGP exposed a bug which causes the odl-restconf-noauth to hang on install approx 3% of the time Created: 24/Mar/15  Updated: 23/Jul/15  Resolved: 23/Jul/15

Status: Resolved
Project: controller
Component/s: restconf
Affects Version/s: Helium
Fix Version/s: None

Type: Bug
Reporter: RichardHill Assignee: Jozef Behran
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 2891

 Description   

Jozez Behran created a test job that repeatedly performs a small load test of Helium SR3 during attempt to push 5k paths to ODL through BGP.

It exposes a bug which causes the odl-restconf-noauth to hang on install occasionally.

He has captured a YourKit performance snapshot in the time when the test failed due to the hang so it is now possible to see what exactly is blocked and where.

I'll add as attachment.



 Comments   
Comment by RichardHill [ 24/Mar/15 ]

https://drive.google.com/file/d/0ByXiyf4iY7RYbXZWdlVYdHV6clE/view?usp=sharing

Comment by RichardHill [ 24/Mar/15 ]

Needs yourkit profiler to view

https://drive.google.com/file/d/0ByXiyf4iY7RYTjVQMFpNV2ExZmc/view?usp=sharing

https://drive.google.com/file/d/0ByXiyf4iY7RYX2Y0c3hVdTF6NFU/view?usp=sharing

Comment by RichardHill [ 24/Mar/15 ]

This test executes a performance test of Helium SR3 on the following scenario:

Unclustered configuration
Topology and RIB are both updated
IMDS used
5 k paths

Comment by RichardHill [ 25/Mar/15 ]

Investigation of profiling snapshot revealed the suspicious method is
in file "OsgiRegistry.java" OsgiRegistry.setOSGi.ServiceFinderIteratorProvider

Comment by Jozef Behran [ 25/Mar/15 ]

My suspection is that odl-restconf-noauth is doing something wrong with the OSGi registry (or with the Karaf's InstallFeature functionality) and when the constellation of the bytes in the process happens to be just right, it hangs the whole thing.

Comment by Jozef Behran [ 25/Mar/15 ]

I found the following:

  • There is a thread with name pool-X-thread-Y (X and Y are some numbers that are differing from run to run) which is "runnable" and its stack trace looks like this:

com.sun.jersey.core.osgi.OsgRegistry.setOSGiServiceFinderIteratorProvider()
at OsgiRegistry.java:422
org.apache.karaf.features.internal.FeaturesServiceImpl.installFeature(String, String, EnumSet)
at FeaturesServiceImpl.java:362
Proxy<some_horrible_number_in_GUID_format>.installFeature(String, String, EnumSet)
at ??? (maybe somehow autogenerated ???)
java.lang.Thread.run()

To me this thread seems to be the one trying to install odl-restconf-noauth

  • There is another thread with name pool-X-thread-Y (X and Y are some numbers that are differing from run to run and differing from the first thread) which is blocked here:

com.sun.jersey.core.osgi.OsgiRegistry.getInstance()
at OsgiRegistry.java:113
org.eclipse.jetty.server.handler.ContextHandler.doStart()
at ContextHandler.java:717
org.ops4j.pax.web.service.jetty.internal.HttpServiceContext.doStart()
at HttpServiceContext.java:222
org.eclipse.jetty.util.component.AbstractLifeCycle.start()
at AbstractLifeCycle.java:64
org.ops4j.pax.web.service.jetty.internal.JettyServerImpl$1.start()
at jettyServerImpl.java:197
org.ops4j.pax.web.service.internal.HttpServiceStarted.end(HttpContext)
at HttpServiceStarted.java:1032
org.ops4j.pax.web.service.internal.HttpServiceProxy.end(HttpContext)
at HttpServiceProxy.java:422
org.ops4j.pax.web.extender.war.internal.RegisterWebAppVisitorWC.end()
at RegisterWebAppVisitorWC.java:341
org.ops4j.pax.web.extender.war.internal.model.WebApp.accept(WebAppVisitor)
at WebApp.java:678
org.ops4j.pax.web.extender.war.internal.WebAppPublisher$WebAppDependencyListener.register(WebAppDependencyHolder, HttpService)
at WebAppPublisher.java:237
org.ops4j.pax.web.extender.war.internal.WebAppPublisher$WebAppDependencyListener.addingService(ServiceReference)
at WebAppPublisher.java:182
org.ops4j.pax.web.extender.war.internal.WebAppPublisher$WebAppDependencyListener.addingService(ServiceReference)
at WebAppPublisher.java:135
org.osgi.util.tracker.ServiceTracker.open()
at ServiceTracker.java:261
org.ops4j.pax.web.extender.war.internal.WebAppPublisher.publish(WebApp)
at WebAppPublisher.java:101
org.ops4j.pax.web.extender.war.internal.WebObserver.deploy(WebApp)
at WebObserver.java:213
org.ops4j.pax.web.extender.war.internal.WebObserver$1.doStart()
at WebObserver.java:175
org.ops4j.pax.web.extender.war.internal.extender.SimpleExtension.start()
at SimpleExtension.java:58
org.ops4j.pax.web.extender.war.internal.extender.AbstractExtender$1.run()
at AbstractExtender.java:266
java.lang.Thread.run()

I am not sure about the purpose of this thread or why it is still hanging around here but in every failure I see it hung at exactly this one spot. I was not able to track this thread's stacktrace down in a snapshot of a build where the test passed though.

Comment by Vaclav Demcak [ 13/Jul/15 ]

Hi all.
I'd like to ask for all steps for a possibility to reproduction the described behavior. Please provide some test inputs too (e.g. robot test path or file or something).
I'd like to ask for a whole log file, because it could by a problem with loading required dependencies (follow J.Behran description) but in some step before. Did you try to reload odl-restconf-noauth in the described unexpected state?

Comment by RichardHill [ 23/Jul/15 ]

Hi Vaclav

To clarify you want the test rerun with full logging and the log provided.

Comment by Jozef Behran [ 23/Jul/15 ]

Testing for this bug requires running the test job for several days. Additionally, the full logging is most likely to cause this type of heisenbug to disappear. As Helium SR4 is the last release and its deadline is today, there is no time to even determine whether the bug is still there or not, let alone fixing it.

Generated at Wed Feb 07 19:54:59 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.