Uploaded image for project: 'bgpcep'
  1. bgpcep
  2. BGPCEP-760

Deadlock in manypeers_changecount test

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Medium Medium
    • Nitrogen, Carbon, Oxygen
    • Nitrogen, Carbon, Oxygen
    • BGP
    • None

      temporary link for sandbox(run 1 and 2): https://jenkins.opendaylight.org/sandbox/job/tomas-bgpcep-csit-1node-periodic-bgp-ingest-all-oxygen/

      First Errors from sandbox test

      2018-02-20T10:57:38,373 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher                   | 356 - org.opendaylight.infrautils.metrics-impl - 1.3.0.SNAPSHOT | Oh nose - there are 2 deadlocked threads!! :-(
      2018-02-20T10:57:38,377 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher                   | 356 - org.opendaylight.infrautils.metrics-impl - 1.3.0.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-notification-dispatcher-92 locked on org.opendaylight.protocol.bgp.rib.impl.ExportPolicyPeerTrackerImpl@43eaef86 (owned by epollEventLoopGroup-10-7):
      	 at org.opendaylight.protocol.bgp.rib.impl.ExportPolicyPeerTrackerImpl.getPeerGroup(ExportPolicyPeerTrackerImpl.java:114)
      	 at org.opendaylight.protocol.bgp.mode.spi.AbstractRouteEntry.getRoutePeerIdRole(AbstractRouteEntry.java:96)
      	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.lambda$fillAdjRibsOut$0(BaseAbstractRouteEntry.java:187)
      	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry$$Lambda$1418/229233901.accept(Unknown Source)
      	 at org.opendaylight.protocol.bgp.rib.impl.PeerExportGroupImpl.forEach(PeerExportGroupImpl.java:48)
      	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.fillAdjRibsOut(BaseAbstractRouteEntry.java:186)
      	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.addPathToDataStore(BaseAbstractRouteEntry.java:161)
      	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.updateRoute(BaseAbstractRouteEntry.java:111)
      	 at org.opendaylight.protocol.bgp.rib.impl.LocRibWriter.walkThrough(LocRibWriter.java:276)
      	 at org.opendaylight.protocol.bgp.rib.impl.LocRibWriter.onDataTreeChanged(LocRibWriter.java:179)
      	 at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:67)
      	 at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:41)
      	 at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
      	 at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:166)
      	 at akka.actor.Actor.aroundReceive(Actor.scala:514)
      	 at akka.actor.Actor.aroundReceive$(Actor.scala:512)
      	 at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:96)
      	 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
      	 at akka.actor.ActorCell.invoke(ActorCell.scala:496)
      	 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
      	 at akka.dispatch.Mailbox.run(Mailbox.scala:224)
      	 at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
      	 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
      	 at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
      	 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
      	 at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
      

      Locally I reproduced this deadlock on nitrogen with installed features:

      odl-restconf
      odl-bgpcep-bgp
      

      I added scripts for quick config.
      They should be run from integration/test project in directory tools/fastbgp
      First "configureall.py" is for setup bgp to accept connection from our 10 peers (127.0.0.2-127.0.0.11)
      Second "manypeersibgp.sh" is to start play.py itself

      Almost immediately after start there is deadlock occuring, from which odl doesn't recover even after killing the script.

      I am adding deadlock.log from yourkit profiler, and communication from wireshark. There's some problem with open message communication between the script and odl.

      I also added countroutes.sh which is restconf script to get the number of routes which went through. pretty easy to spot when deadlock occured with this.

        1. configureall.py
          2 kB
        2. countroutes.sh
          0.2 kB
        3. deadlock.log
          10 kB
        4. manypeers_changecount.pcapng
          2.29 MB
        5. manypeersibgp.sh
          0.2 kB

            cdgasparini Claudio David Gasparini
            tomas.markovic Tomas Markovic
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: