[BGPCEP-760] Deadlock in manypeers_changecount test Created: 20/Feb/18  Updated: 18/Apr/18  Resolved: 01/Mar/18

Status: Verified
Project: bgpcep
Component/s: BGP
Affects Version/s: Nitrogen, Carbon, Oxygen
Fix Version/s: Nitrogen, Carbon, Oxygen

Type: Bug Priority: Medium
Reporter: Tomas Markovic Assignee: Claudio David Gasparini
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File configureall.py     File countroutes.sh     File deadlock.log     File manypeers_changecount.pcapng     File manypeersibgp.sh    

 Description   

temporary link for sandbox(run 1 and 2): https://jenkins.opendaylight.org/sandbox/job/tomas-bgpcep-csit-1node-periodic-bgp-ingest-all-oxygen/

First Errors from sandbox test

2018-02-20T10:57:38,373 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher                   | 356 - org.opendaylight.infrautils.metrics-impl - 1.3.0.SNAPSHOT | Oh nose - there are 2 deadlocked threads!! :-(
2018-02-20T10:57:38,377 | ERROR | infrautils.metrics.ThreadsWatcher-0 | ThreadsWatcher                   | 356 - org.opendaylight.infrautils.metrics-impl - 1.3.0.SNAPSHOT | Deadlocked thread stack trace: opendaylight-cluster-data-notification-dispatcher-92 locked on org.opendaylight.protocol.bgp.rib.impl.ExportPolicyPeerTrackerImpl@43eaef86 (owned by epollEventLoopGroup-10-7):
	 at org.opendaylight.protocol.bgp.rib.impl.ExportPolicyPeerTrackerImpl.getPeerGroup(ExportPolicyPeerTrackerImpl.java:114)
	 at org.opendaylight.protocol.bgp.mode.spi.AbstractRouteEntry.getRoutePeerIdRole(AbstractRouteEntry.java:96)
	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.lambda$fillAdjRibsOut$0(BaseAbstractRouteEntry.java:187)
	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry$$Lambda$1418/229233901.accept(Unknown Source)
	 at org.opendaylight.protocol.bgp.rib.impl.PeerExportGroupImpl.forEach(PeerExportGroupImpl.java:48)
	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.fillAdjRibsOut(BaseAbstractRouteEntry.java:186)
	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.addPathToDataStore(BaseAbstractRouteEntry.java:161)
	 at org.opendaylight.protocol.bgp.mode.impl.base.BaseAbstractRouteEntry.updateRoute(BaseAbstractRouteEntry.java:111)
	 at org.opendaylight.protocol.bgp.rib.impl.LocRibWriter.walkThrough(LocRibWriter.java:276)
	 at org.opendaylight.protocol.bgp.rib.impl.LocRibWriter.onDataTreeChanged(LocRibWriter.java:179)
	 at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:67)
	 at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:41)
	 at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38)
	 at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:166)
	 at akka.actor.Actor.aroundReceive(Actor.scala:514)
	 at akka.actor.Actor.aroundReceive$(Actor.scala:512)
	 at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:96)
	 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
	 at akka.actor.ActorCell.invoke(ActorCell.scala:496)
	 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	 at akka.dispatch.Mailbox.run(Mailbox.scala:224)
	 at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	 at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	 at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Locally I reproduced this deadlock on nitrogen with installed features:

odl-restconf
odl-bgpcep-bgp

I added scripts for quick config.
They should be run from integration/test project in directory tools/fastbgp
First "configureall.py" is for setup bgp to accept connection from our 10 peers (127.0.0.2-127.0.0.11)
Second "manypeersibgp.sh" is to start play.py itself

Almost immediately after start there is deadlock occuring, from which odl doesn't recover even after killing the script.

I am adding deadlock.log from yourkit profiler, and communication from wireshark. There's some problem with open message communication between the script and odl.

I also added countroutes.sh which is restconf script to get the number of routes which went through. pretty easy to spot when deadlock occured with this.



 Comments   
Comment by Claudio David Gasparini [ 26/Feb/18 ]

carbon https://git.opendaylight.org/gerrit/#/c/68665/

nitrogen https://git.opendaylight.org/gerrit/#/c/68667/

master https://git.opendaylight.org/gerrit/#/c/68648/

 

Comment by Claudio David Gasparini [ 26/Feb/18 ]

Hi, Tomas please confirm fix worked as expected for all version and close the bug.

 

Regards, 

Comment by Michael Vorburger [ 28/Feb/18 ]

tomas.markovic and cdgasparini I'm glad that INFRAUTILS-21 is adding value for you!

PS FYI also INFRAUTILS-22 and INFRAUTILS-29.

Generated at Wed Feb 07 19:14:02 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.