Uploaded image for project: 'bgpcep'
  1. bgpcep
  2. BGPCEP-760

Deadlock in manypeers_changecount test

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Medium Medium
    • Nitrogen, Carbon, Oxygen
    • Nitrogen, Carbon, Oxygen
    • BGP
    • None

      temporary link for sandbox(run 1 and 2): https://jenkins.opendaylight.org/sandbox/job/tomas-bgpcep-csit-1node-periodic-bgp-ingest-all-oxygen/

      First Errors from sandbox test

      2018-02-20T10:57:38,373 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher                   | 356 - org.opendaylight.infrautils.metrics-impl - 1.3.0.SNAPSHOT | Oh nose - there are 2 deadlocked threads!! :-(
      2018-02-20T10:57:38,377 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher                   | 356 - org.opendaylight.infrautils.metrics-impl - 1.3.0.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-notification-dispatcher-92 locked on org.opendaylight.protocol.bgp.rib.impl.ExportPolicyPeerTrackerImpl@43eaef86 (owned by epollEventLoopGroup-10-7):
      	 at org.opendaylight.protocol.bgp.rib.impl.ExportPolicyPeerTrackerImpl.getPeerGroup(ExportPolicyPeerTrackerImpl.java:114)
      	 at org.opendaylight.protocol.bgp.mode.spi.AbstractRouteEntry.getRoutePeerIdRole(AbstractRouteEntry.java:96)
      	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.lambda$fillAdjRibsOut$0(BaseAbstractRouteEntry.java:187)
      	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry$$Lambda$1418/229233901.accept(Unknown Source)
      	 at org.opendaylight.protocol.bgp.rib.impl.PeerExportGroupImpl.forEach(PeerExportGroupImpl.java:48)
      	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.fillAdjRibsOut(BaseAbstractRouteEntry.java:186)
      	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.addPathToDataStore(BaseAbstractRouteEntry.java:161)
      	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.updateRoute(BaseAbstractRouteEntry.java:111)
      	 at org.opendaylight.protocol.bgp.rib.impl.LocRibWriter.walkThrough(LocRibWriter.java:276)
      	 at org.opendaylight.protocol.bgp.rib.impl.LocRibWriter.onDataTreeChanged(LocRibWriter.java:179)
      	 at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:67)
      	 at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:41)
      	 at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
      	 at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:166)
      	 at akka.actor.Actor.aroundReceive(Actor.scala:514)
      	 at akka.actor.Actor.aroundReceive$(Actor.scala:512)
      	 at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:96)
      	 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
      	 at akka.actor.ActorCell.invoke(ActorCell.scala:496)
      	 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
      	 at akka.dispatch.Mailbox.run(Mailbox.scala:224)
      	 at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
      	 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      	 at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
      	 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      	 at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      

      Locally I reproduced this deadlock on nitrogen with installed features:

      odl-restconf
      odl-bgpcep-bgp
      

      I added scripts for quick config.
      They should be run from integration/test project in directory tools/fastbgp
      First "configureall.py" is for setup bgp to accept connection from our 10 peers (127.0.0.2-127.0.0.11)
      Second "manypeersibgp.sh" is to start play.py itself

      Almost immediately after start there is deadlock occuring, from which odl doesn't recover even after killing the script.

      I am adding deadlock.log from yourkit profiler, and communication from wireshark. There's some problem with open message communication between the script and odl.

      I also added countroutes.sh which is restconf script to get the number of routes which went through. pretty easy to spot when deadlock occured with this.

        1. manypeersibgp.sh
          0.2 kB
        2. manypeers_changecount.pcapng
          2.29 MB
        3. deadlock.log
          10 kB
        4. countroutes.sh
          0.2 kB
        5. configureall.py
          2 kB

            cdgasparini Claudio David Gasparini
            tomas.markovic Tomas Markovic
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: