temporary link for sandbox(run 1 and 2): https://jenkins.opendaylight.org/sandbox/job/tomas-bgpcep-csit-1node-periodic-bgp-ingest-all-oxygen/
First Errors from sandbox test
2018-02-20T10:57:38,373 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher | 356 - org.opendaylight.infrautils.metrics-impl - 1.3.0.SNAPSHOT | Oh nose - there are 2 deadlocked threads!! :-( 2018-02-20T10:57:38,377 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher | 356 - org.opendaylight.infrautils.metrics-impl - 1.3.0.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-notification-dispatcher-92 locked on org.opendaylight.protocol.bgp.rib.impl.ExportPolicyPeerTrackerImpl@43eaef86 (owned by epollEventLoopGroup-10-7): at org.opendaylight.protocol.bgp.rib.impl.ExportPolicyPeerTrackerImpl.getPeerGroup(ExportPolicyPeerTrackerImpl.java:114) at org.opendaylight.protocol.bgp.mode.spi.AbstractRouteEntry.getRoutePeerIdRole(AbstractRouteEntry.java:96) at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.lambda$fillAdjRibsOut$0(BaseAbstractRouteEntry.java:187) at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry$$Lambda$1418/229233901.accept(Unknown Source) at org.opendaylight.protocol.bgp.rib.impl.PeerExportGroupImpl.forEach(PeerExportGroupImpl.java:48) at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.fillAdjRibsOut(BaseAbstractRouteEntry.java:186) at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.addPathToDataStore(BaseAbstractRouteEntry.java:161) at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.updateRoute(BaseAbstractRouteEntry.java:111) at org.opendaylight.protocol.bgp.rib.impl.LocRibWriter.walkThrough(LocRibWriter.java:276) at org.opendaylight.protocol.bgp.rib.impl.LocRibWriter.onDataTreeChanged(LocRibWriter.java:179) at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:67) at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:41) at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:166) at akka.actor.Actor.aroundReceive(Actor.scala:514) at akka.actor.Actor.aroundReceive$(Actor.scala:512) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:96) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527) at akka.actor.ActorCell.invoke(ActorCell.scala:496) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Locally I reproduced this deadlock on nitrogen with installed features:
odl-restconf odl-bgpcep-bgp
I added scripts for quick config.
They should be run from integration/test project in directory tools/fastbgp
First "configureall.py" is for setup bgp to accept connection from our 10 peers (127.0.0.2-127.0.0.11)
Second "manypeersibgp.sh" is to start play.py itself
Almost immediately after start there is deadlock occuring, from which odl doesn't recover even after killing the script.
I am adding deadlock.log from yourkit profiler, and communication from wireshark. There's some problem with open message communication between the script and odl.
I also added countroutes.sh which is restconf script to get the number of routes which went through. pretty easy to spot when deadlock occured with this.