[CONTROLLER-1745] produce-transactions can get stuck when closing itemProducer Created: 27/Jul/17 Updated: 25/Jul/23 Resolved: 24/Aug/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8898 |
| Description |
|
This is a fork of Filling against Controller/clustering, even though the guilty project might be Mdsal instead. This affects one CSIT test case [34], but the probability of failure is less than 10%. The test calls write-transactions on a shard follower, then isolates that member, then waits some time (115s) expecting no failure, and then some time (20 s) expecting a failure (from RequestTimeoutException). If this Bug appears, write-transactions does not return anything in withing reasonable time. Copied from > I will try to reproduce with more logging. Reproduced. This time the karaf.log [39] shows that the progress stops when closing itemProducer. Will try to repeat with even more logs [40]. 2017-07-26 17:28:10,129 | TRACE | pool-31-thread-2 | ProduceTransactionsHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler [34] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/794/log.html.gz#s1-s34-t7-k2-k16-k1-k1 |
| Comments |
| Comment by Vratko Polak [ 28/Jul/17 ] |
|
> Will try to repeat with even more logs [40]. Finaly got that repeated. First, here is an example of transaction submit going right: 2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | ProduceTransactionsHandler | 257 - org.opendaylight.control 2017-07-28 14:10:35,348 | TRACE | pool-30-thread-1 | ShardedDOMDataTreeProducer | 147 - org.opendaylight.mdsal.d om-broker - 2.2.2.SNAPSHOT | org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@4884f6f8 synchronizing for n ull lastTx in transactionSubmitted 2017-07-28 14:10:35,348 | TRACE | pool-30-thread-1 | ShardedDOMDataTreeProducer | 147 - org.opendaylight.mdsal.d om-broker - 2.2.2.SNAPSHOT | org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@4884f6f8 synchronized in tra nsactionSubmitted, submitting 2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | ardedDOMDataTreeWriteTransaction | 147 - org.opendaylight.mdsal.d om-broker - 2.2.2.SNAPSHOT | Readying tx SHARDED-DOM-613 2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | AbstractProxyTransaction | 199 - org.opendaylight.control ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction proxy RemoteProxyTransaction{identifier=member-2-datastore-Shard-id-ints!-fe-0-chn-5-txn-613-0, state=OPEN} sending request ModifyTransactionRequest {target=member-2-datastore- Shard-id-ints!-fe-0-chn-5-txn-613-0, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#-2004618990], modifications=1, protocol=READY} callback org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTransac , enqueuedTicks=503857260922} 2017-07-28 14:10:35,348 | DEBUG | pool-30-thread-1 | AbstractProxyTransaction | 199 - org.opendaylight.control ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction proxy RemoteProxyTransaction{identifier=member-2-datastore-Shard-id-ints!-fe-0-chn-5-txn-613-0, state=FLUSHED} sending request ModifyTransactionRequest {target=member-2-datasto re-Shard-id-ints!-fe-0-chn-5-txn-613-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#-2004618990 ], modifications=0, protocol=SIMPLE} callback org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyT , enqueuedTicks=503857373572} Compare to the final (possible) deadlock: 2017-07-28 14:12:35,377 | DEBUG | pool-30-thread-1 | ProduceTransactionsHandler | 257 - org.opendaylight.control 2017-07-28 14:12:35,377 | DEBUG | lt-dispatcher-26 | AbstractProxyTransaction | 199 - org.opendaylight.control ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | RemoteProxyTransaction{identifier=member-2-datastore-Shard-id-ints!-fe-0-chn-5-txn-613-0, state=FLUSHED} : initiating purge sending request ModifyTransactionRequest {target=member-2-datastore- Shard-id-ints!-fe-0-chn-5-txn-614-0, sequence=120, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#-2004618990] , modifications=37, protocol=READY} callback org.opendaylight.controller.cluster.databroker.actors.dds.RemoteProxyTran 2017-07-28 14:12:35,377 | DEBUG | pool-31-thread-2 | AbstractTransactionHandler | 257 - org.opendaylight.control ler.samples.clustering-it-provider - 1.5.2.SNAPSHOT | org.opendaylight.controller.clustering.it.provider.impl.ProduceT ransactionsHandler@1c238bc8 Future #5504 failed in state RUNNING cause org.opendaylight.controller.cluster.access.clie nt.RequestTimeoutException: Timed out after 120.023840912seconds 2017-07-28 14:12:35,377 | DEBUG | lt-dispatcher-26 | AbstractProxyTransaction | 199 - org.opendaylight.control ler.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction proxy RemoteProxyTransaction{identifier=member-2-datastore-Shard-id-ints!-fe-0-chn-5-txn-613-0, state=DONE} enqueing request TransactionPurgeRequest {target=member-2-datastore- Shard-id-ints!-fe-0-chn-5-txn-613-0, sequence=2, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#-2004618990]} , cookie=0, backend=Shar has a successor, not scheduling timer Here, the threads are:
The common part is up to "RemoteProxyTransaction {} sending request". ShardedDOMDataTreeProducer@4884f6f8 remained in synchronized state, which explains why the close method got stuck on synchronize as well. Any ideas where to put even more logs? |
| Comment by Robert Varga [ 29/Jul/17 ] |
|
This looks like it's pointing towards an AB/BA deadlock. It would be best to obtain a thread dump when this thing is stuck – that way we know which threads are holding locks and what is being waited for... |
| Comment by Vratko Polak [ 31/Jul/17 ] |
|
> obtain a thread dump Got it: [42]. Karaf log for context: [43]. Copying the deadlock report, once again pool-30-thread-1 is the writingExecutor thread and pool-31-thread-2 is the completingExecutor thread: Found one Java-level deadlock: Java stack information for the threads listed above:
[42] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/2/karaf_1_threads_after.log.gz |
| Comment by Robert Varga [ 31/Jul/17 ] |
|
This is a classic AB/BA deadlock scenario involving "pool-30-thread-1" and "opendaylight-cluster-data-akka.actor.default-dispatcher-3", each of which is dealing with a chain of connections. The former is submitting a request, hence it is taking the locks in order of succession (old to new), which is correct. The latter is running a timeout check on the last connection in the succession chain and invoking a timeout callback, which is rooted at the head of the chain. That callback ends up trying to lock take the locks in the same order, but because pool-30-thread-1 is waiting for the tail of the chain, which is already held, we arrive at a deadlock. |
| Comment by Robert Varga [ 31/Jul/17 ] |