[CONTROLLER-1745] produce-transactions can get stuck when closing itemProducer Created: 27/Jul/17  Updated: 25/Jul/23  Resolved: 24/Aug/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8898

 Description   

This is a fork of CONTROLLER-1687 as new investigation suggests the cause of this Bug is different than the previously fixed CONTROLLER-1687 cause.

Filling against Controller/clustering, even though the guilty project might be Mdsal instead.

This affects one CSIT test case [34], but the probability of failure is less than 10%.

The test calls write-transactions on a shard follower, then isolates that member, then waits some time (115s) expecting no failure, and then some time (20 s) expecting a failure (from RequestTimeoutException). If this Bug appears, write-transactions does not return anything in withing reasonable time.

Copied from CONTROLLER-1687 comments:

> I will try to reproduce with more logging.

Reproduced. This time the karaf.log [39] shows that the progress stops when closing itemProducer. Will try to repeat with even more logs [40].

2017-07-26 17:28:10,129 | TRACE | pool-31-thread-2 | ProduceTransactionsHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler
@30b30b0 Entering runFailed.

[34] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/794/log.html.gz#s1-s34-t7-k2-k16-k1-k1
[39] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/18/odl1_karaf.log.gz
[40] https://git.opendaylight.org/gerrit/#/c/60783/1



 Comments   
Comment by Vratko Polak [ 28/Jul/17 ]

> Will try to repeat with even more logs [40].

Finaly got that repeated.
The big log [41] will get deleted in half a day, so I am going to paste long segments here.

First, here is an example of transaction submit going right:

2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | ProduceTransactionsHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Inserting item: 88373
2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | ProxyHistory | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Allocated proxy member-2-datastore-Shard-id-ints!fe-0-chn-5-txn-613
0 for transaction member-2-datastore-Shard-id-ints!-fe-0-chn-5-txn-613-0
2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | AbstractProxyTransaction | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction RemoteProxyTransaction

{identifier=member-2-datastore-Shar d-id-ints!-fe-0-chn-5-txn-613-0, state=OPEN} allocated sequence 0
2017-07-28 14:10:35,348 | TRACE | pool-30-thread-1 | ShardedDOMDataTreeProducer | 147 - org.opendaylight.mdsal.d
om-broker - 2.2.2.SNAPSHOT | org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@4884f6f8 synchronizing for n
ull lastTx in transactionSubmitted
2017-07-28 14:10:35,348 | TRACE | pool-30-thread-1 | ShardedDOMDataTreeProducer | 147 - org.opendaylight.mdsal.d
om-broker - 2.2.2.SNAPSHOT | org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@4884f6f8 synchronized in tra
nsactionSubmitted, submitting
2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | ardedDOMDataTreeWriteTransaction | 147 - org.opendaylight.mdsal.d
om-broker - 2.2.2.SNAPSHOT | Readying tx SHARDED-DOM-613
2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | AbstractProxyTransaction | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction proxy RemoteProxyTransaction{identifier=member-2-datastore-Shard-id-ints!-fe-0-chn-5-txn-613-0, state=OPEN}

sending request ModifyTransactionRequest

{target=member-2-datastore- Shard-id-ints!-fe-0-chn-5-txn-613-0, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#-2004618990], modifications=1, protocol=READY}

callback org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransac
tion$$Lambda$141/838048079@50466bcd
2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | TransmitQueue | 197 - org.opendaylight.control
ler.cds-access-client - 1.1.2.SNAPSHOT | Queue org.opendaylight.controller.cluster.access.client.TransmitQueue$Transmi
tting@130ebfcf transmitting entry ConnectionEntry{request=ModifyTransactionRequest

{target=member-2-datastore-Shard-id- ints!-fe-0-chn-5-txn-613-0, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#-2004618990], modificat ions=1, protocol=READY}

, enqueuedTicks=503857260922}
2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | AbstractClientHistory | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Client history org.opendaylight.controller.cluster.databroker.actors.
dds.ClientLocalHistory@5f1578c5 changed state from TX_OPEN to IDLE
2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | AbstractClientHistory | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Local history org.opendaylight.controller.cluster.databroker.actors.d
ds.ClientLocalHistory@5f1578c5 readied transaction member-2-datastore-Shard-id-ints!-fe-0-chn-5-txn-613-0
2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | AbstractProxyTransaction | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction RemoteProxyTransaction

{identifier=member-2-datastore-Shar d-id-ints!-fe-0-chn-5-txn-613-0, state=FLUSHED} allocated sequence 1
2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | AbstractProxyTransaction | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction proxy RemoteProxyTransaction{identifier=member-2-datastore-Shard-id-ints!-fe-0-chn-5-txn-613-0, state=FLUSHED}

sending request ModifyTransactionRequest

{target=member-2-datasto re-Shard-id-ints!-fe-0-chn-5-txn-613-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#-2004618990 ], modifications=0, protocol=SIMPLE}

callback org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyT
ransaction$$Lambda$144/2040119293@41f2f4ef
2017-07-28 14:10:35,349 | DEBUG | pool-30-thread-1 | TransmitQueue | 197 - org.opendaylight.control
ler.cds-access-client - 1.1.2.SNAPSHOT | Queue org.opendaylight.controller.cluster.access.client.TransmitQueue$Transmi
tting@130ebfcf transmitting entry ConnectionEntry{request=ModifyTransactionRequest

{target=member-2-datastore-Shard-id- ints!-fe-0-chn-5-txn-613-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#-2004618990], modificat ions=0, protocol=SIMPLE}

, enqueuedTicks=503857373572}
2017-07-28 14:10:35,349 | TRACE | pool-30-thread-1 | ShardedDOMDataTreeProducer | 147 - org.opendaylight.mdsal.d
om-broker - 2.2.2.SNAPSHOT | org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@4884f6f8 submitted in transa
ctionSubmitted, returning
2017-07-28 14:10:35,349 | DEBUG | pool-30-thread-1 | AbstractTransactionHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | org.opendaylight.controller.clustering.it.provider.impl.ProduceT
ransactionsHandler@1c238bc8 New future #5504 allocated

Compare to the final (possible) deadlock:

2017-07-28 14:12:35,377 | DEBUG | pool-30-thread-1 | ProduceTransactionsHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Inserting item: 216844
2017-07-28 14:12:35,377 | TRACE | pool-30-thread-1 | ShardedDOMDataTreeProducer | 147 - org.opendaylight.mdsal.d
om-broker - 2.2.2.SNAPSHOT | org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@4884f6f8 synchronizing for n
ull lastTx in transactionSubmitted
2017-07-28 14:12:35,377 | TRACE | pool-30-thread-1 | ShardedDOMDataTreeProducer | 147 - org.opendaylight.mdsal.d
om-broker - 2.2.2.SNAPSHOT | org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@4884f6f8 synchronized in tra
nsactionSubmitted, submitting
2017-07-28 14:12:35,377 | DEBUG | pool-30-thread-1 | ardedDOMDataTreeWriteTransaction | 147 - org.opendaylight.mdsal.d
om-broker - 2.2.2.SNAPSHOT | Readying tx SHARDED-DOM-614
2017-07-28 14:12:35,377 | DEBUG | lt-dispatcher-26 | AbstractProxyTransaction | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction RemoteProxyTransaction

{identifier=member-2-datastore-Shar d-id-ints!-fe-0-chn-5-txn-613-0, state=FLUSHED} directCommit completed
2017-07-28 14:12:35,377 | DEBUG | lt-dispatcher-26 | AbstractProxyTransaction | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | RemoteProxyTransaction{identifier=member-2-datastore-Shard-id-ints!-fe-0-chn-5-txn-613-0, state=FLUSHED}

: initiating purge
2017-07-28 14:12:35,377 | DEBUG | pool-30-thread-1 | AbstractProxyTransaction | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction proxy RemoteProxyTransaction

{identifier=member-2-datastor e-Shard-id-ints!-fe-0-chn-5-txn-614-0, state=OPEN}

sending request ModifyTransactionRequest

{target=member-2-datastore- Shard-id-ints!-fe-0-chn-5-txn-614-0, sequence=120, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#-2004618990] , modifications=37, protocol=READY}

callback org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTran
saction$$Lambda$141/838048079@7768cd2e
2017-07-28 14:12:35,377 | DEBUG | lt-dispatcher-26 | AbstractProxyTransaction | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction RemoteProxyTransaction

{identifier=member-2-datastore-Shar d-id-ints!-fe-0-chn-5-txn-613-0, state=DONE} allocated sequence 2
2017-07-28 14:12:35,377 | DEBUG | pool-31-thread-2 | AbstractTransactionHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | org.opendaylight.controller.clustering.it.provider.impl.ProduceT
ransactionsHandler@1c238bc8 Future #5504 failed in state RUNNING cause org.opendaylight.controller.cluster.access.clie
nt.RequestTimeoutException: Timed out after 120.023840912seconds
2017-07-28 14:12:35,377 | DEBUG | lt-dispatcher-26 | AbstractProxyTransaction | 199 - org.opendaylight.control
ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction proxy RemoteProxyTransaction{identifier=member-2-datastore-Shard-id-ints!-fe-0-chn-5-txn-613-0, state=DONE}

enqueing request TransactionPurgeRequest

{target=member-2-datastore- Shard-id-ints!-fe-0-chn-5-txn-613-0, sequence=2, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#-2004618990]}


callback org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction$$Lambda$153/1233784439@539
a6d2f
2017-07-28 14:12:35,377 | DEBUG | pool-30-thread-1 | AbstractClientConnection | 197 - org.opendaylight.control
ler.cds-access-client - 1.1.2.SNAPSHOT | member-2-frontend-datastore-Shard-id-ints!: connection ConnectedClientConnect
ion{client=ClientIdentifier

{frontend=member-2-frontend-datastore-Shard-id-ints!, generation=0}

, cookie=0, backend=Shar
dBackendInfo

{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.124:2550/user/shardmanager-config/member-1-shar d-id-ints!-config#-145628703], sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=id-ints!, dataTree=absent }}

has a successor, not scheduling timer
2017-07-28 14:12:35,378 | TRACE | pool-31-thread-2 | AbstractTransactionHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Removing the future.
2017-07-28 14:12:35,378 | TRACE | pool-31-thread-2 | AbstractTransactionHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | The future has been removed.
2017-07-28 14:12:35,378 | TRACE | pool-31-thread-2 | AbstractTransactionHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Local state RUNNING
2017-07-28 14:12:35,378 | TRACE | pool-31-thread-2 | AbstractTransactionHandler | 257 - org.opendaylight.control
ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Setting state to FAILED.
2017-07-28 14:12:35,378 | TRACE | pool-31-thread-2 | AbstractTransactionHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Cancelling writingFuture.
2017-07-28 14:12:35,378 | TRACE | pool-31-thread-2 | AbstractTransactionHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Calling runFailed
2017-07-28 14:12:35,378 | TRACE | pool-31-thread-2 | ProduceTransactionsHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler@1c238bc8 Entering runFailed.
2017-07-28 14:12:35,378 | TRACE | pool-31-thread-2 | ShardedDOMDataTreeProducer | 147 - org.opendaylight.mdsal.dom-broker - 2.2.2.SNAPSHOT | org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@4884f6f8 entering close
2017-07-28 14:12:35,378 | TRACE | pool-31-thread-2 | ShardedDOMDataTreeProducer | 147 - org.opendaylight.mdsal.dom-broker - 2.2.2.SNAPSHOT | org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@4884f6f8 close doing compare and set
2017-07-28 14:12:35,378 | TRACE | pool-31-thread-2 | ShardedDOMDataTreeProducer | 147 - org.opendaylight.mdsal.dom-broker - 2.2.2.SNAPSHOT | org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@4884f6f8 close set, synchronizing

Here, the threads are:

  • thread pool-30-thread-1 : writingExecutor thread
  • pool-31-thread-2 : completingExecutor thread
  • lt-dispatcher-26 : the actor thread which detects transaction timeouts (I think)

The common part is up to "RemoteProxyTransaction {} sending request".
Then (instead of "TransmitQueue {} transmitting entry"), pool-30-thread-1 only did "has a successor, not scheduling timer",
which I believe is called from AbstractClientConnection#commonEnqueue when queue.isEmpty(), and after that, nothing.

ShardedDOMDataTreeProducer@4884f6f8 remained in synchronized state, which explains why the close method got stuck on synchronize as well.

Any ideas where to put even more logs?

[41] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/81/odl2_karaf.log.gz

Comment by Robert Varga [ 29/Jul/17 ]

This looks like it's pointing towards an AB/BA deadlock. It would be best to obtain a thread dump when this thing is stuck – that way we know which threads are holding locks and what is being waited for...

Comment by Vratko Polak [ 31/Jul/17 ]

> obtain a thread dump

Got it: [42]. Karaf log for context: [43].

Copying the deadlock report, once again pool-30-thread-1 is the writingExecutor thread and pool-31-thread-2 is the completingExecutor thread:

Found one Java-level deadlock:
=============================
"pool-31-thread-2":
waiting to lock monitor 0x00007f93880b4d58 (object 0x00000000856472b8, a org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer),
which is held by "pool-30-thread-1"
"pool-30-thread-1":
waiting for ownable synchronizer 0x00000000f5c7f750, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
which is held by "opendaylight-cluster-data-akka.actor.default-dispatcher-3"
"opendaylight-cluster-data-akka.actor.default-dispatcher-3":
waiting for ownable synchronizer 0x00000000854dedf0, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
which is held by "pool-30-thread-1"

Java stack information for the threads listed above:
===================================================
"pool-31-thread-2":
at org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer.close(ShardedDOMDataTreeProducer.java:209)

  • waiting to lock <0x00000000856472b8> (a org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer)
    at org.opendaylight.controller.cluster.sharding.DistributedShardedDOMDataTree$ProxyProducer.close(DistributedShardedDOMDataTree.java:679)
    at org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler.closeProducer(ProduceTransactionsHandler.java:98)
    at org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler.runFailed(ProduceTransactionsHandler.java:135)
    at org.opendaylight.controller.clustering.it.provider.impl.AbstractTransactionHandler.txFailure(AbstractTransactionHandler.java:201)
    at org.opendaylight.controller.clustering.it.provider.impl.AbstractTransactionHandler$1.onFailure(AbstractTransactionHandler.java:136)
    at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
    "pool-30-thread-1":
    at sun.misc.Unsafe.park(Native Method)
  • parking to wait for <0x00000000f5c7f750> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
    at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.enqueueOrForward(AbstractClientConnection.java:175)
    at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.sendEntry(AbstractClientConnection.java:257)
    at org.opendaylight.controller.cluster.access.client.SimpleReconnectForwarder.forwardEntry(SimpleReconnectForwarder.java:18)
    at org.opendaylight.controller.cluster.access.client.TransmitQueue.enqueueOrForward(TransmitQueue.java:220)
    at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.enqueueOrForward(AbstractClientConnection.java:179)
    at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.sendEntry(AbstractClientConnection.java:257)
    at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.sendRequest(AbstractClientConnection.java:153)
    at org.opendaylight.controller.cluster.databroker.actors.dds.ProxyHistory.sendRequest(ProxyHistory.java:449)
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.sendRequest(AbstractProxyTransaction.java:327)
    at org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransaction.sendModification(RemoteProxyTransaction.java:169)
    at org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransaction.flushBuilder(RemoteProxyTransaction.java:162)
    at org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransaction.sealAndSend(RemoteProxyTransaction.java:274)
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.seal(AbstractProxyTransaction.java:339)
    at org.opendaylight.controller.cluster.databroker.actors.dds.ClientTransaction$$Lambda$133/1610149084.accept(Unknown Source)
    at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4707)
    at org.opendaylight.controller.cluster.databroker.actors.dds.ClientTransaction.ready(ClientTransaction.java:103)
    at org.opendaylight.controller.cluster.sharding.DistributedShardModificationContext.ready(DistributedShardModificationContext.java:49)
    at org.opendaylight.controller.cluster.sharding.DistributedShardModification.seal(DistributedShardModification.java:63)
    at org.opendaylight.controller.cluster.sharding.ShardProxyTransaction.ready(ShardProxyTransaction.java:104)
    at org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeWriteTransaction.doSubmit(ShardedDOMDataTreeWriteTransaction.java:143)
    at org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer.submitTransaction(ShardedDOMDataTreeProducer.java:156)
    at org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer.transactionSubmitted(ShardedDOMDataTreeProducer.java:239)
  • locked <0x00000000856472b8> (a org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer)
    at org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeWriteTransaction.submit(ShardedDOMDataTreeWriteTransaction.java:128)
  • locked <0x00000000f5a4fba8> (a org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeWriteTransaction)
    at org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler.execWrite(ProduceTransactionsHandler.java:129)
    at org.opendaylight.controller.clustering.it.provider.impl.AbstractTransactionHandler.runningExecute(AbstractTransactionHandler.java:123)
    at org.opendaylight.controller.clustering.it.provider.impl.AbstractTransactionHandler.execute(AbstractTransactionHandler.java:105)
    at org.opendaylight.controller.clustering.it.provider.impl.AbstractTransactionHandler$$Lambda$260/1225246231.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
    "opendaylight-cluster-data-akka.actor.default-dispatcher-3":
    at sun.misc.Unsafe.park(Native Method)
  • parking to wait for <0x00000000854dedf0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
    at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.enqueueEntry(AbstractClientConnection.java:190)
    at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.enqueueRequest(AbstractClientConnection.java:170)
    at org.opendaylight.controller.cluster.databroker.actors.dds.ProxyHistory.enqueueRequest(ProxyHistory.java:445)
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.enqueueRequest(AbstractProxyTransaction.java:322)
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.enqueuePurge(AbstractProxyTransaction.java:642)
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.enqueuePurge(AbstractProxyTransaction.java:624)
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.enqueuePurge(AbstractProxyTransaction.java:619)
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction.lambda$directCommit$4(AbstractProxyTransaction.java:516)
    at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction$$Lambda$138/500577411.accept(Unknown Source)
    at org.opendaylight.controller.cluster.access.client.ConnectionEntry.complete(ConnectionEntry.java:50)
    at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.lockedCheckTimeout(AbstractClientConnection.java:415)
    at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.runTimer(AbstractClientConnection.java:346)
    at org.opendaylight.controller.cluster.access.client.AbstractClientConnection$$Lambda$132/2043163015.execute(Unknown Source)
    at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:123)
    at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:46)
    at org.opendaylight.controller.cluster.access.client.AbstractClientActor.onReceiveCommand(AbstractClientActor.java:60)
    at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)
    at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
    at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)
    at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)
    at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)
    at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

[42] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/2/karaf_1_threads_after.log.gz
[43] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/2/odl1_karaf.log.gz

Comment by Robert Varga [ 31/Jul/17 ]

This is a classic AB/BA deadlock scenario involving "pool-30-thread-1" and "opendaylight-cluster-data-akka.actor.default-dispatcher-3", each of which is dealing with a chain of connections.

The former is submitting a request, hence it is taking the locks in order of succession (old to new), which is correct.

The latter is running a timeout check on the last connection in the succession chain and invoking a timeout callback, which is rooted at the head of the chain. That callback ends up trying to lock take the locks in the same order, but because pool-30-thread-1 is waiting for the tail of the chain, which is already held, we arrive at a deadlock.

Comment by Robert Varga [ 31/Jul/17 ]

https://git.opendaylight.org/gerrit/60934

Generated at Wed Feb 07 19:56:20 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.