[CONTROLLER-1631] Transaction producer aborted when local shard leader removed Created: 11/Apr/17  Updated: 25/Jul/23  Resolved: 15/Apr/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Peter Gubka Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File karaf_log_1.tar.gz     File karaf_log_2.tar.gz     File karaf_log_3.tar.gz     HTML File log.html    
External issue ID: 8205

 Description   

The test scenario follows the steps of DOMDataBroker testing: Clean Leader Shutdown: local leader shurdown
1) producer started with write-transactions rpc on the config default leader node
2) the default config shard leader was removed via remove shard replica rpc
3) producer aborted and returned http status 500

{"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Unexpected-exception","error-info":"TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-default-config#-1992099962)]] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".]]}

at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)
at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)
at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:102)
at org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.handleException(ConcurrentDOMDataBroker.java:222)
at org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker.access$000(ConcurrentDOMDataBroker.java:46)
at org.opendaylight.controller.cluster.databroker.ConcurrentDOMDataBroker$1.onFailure(ConcurrentDOMDataBroker.java:126)
at com.google.common.util.concurrent.Futures$6.run(Futures.java:1310)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)
at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:68)
at org.opendaylight.controller.cluster.datastore.SingleCommitCohortProxy$1.onComplete(SingleCommitCohortProxy.java:58)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-default-config#-1992099962)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:745)
"}]}}



 Comments   
Comment by Peter Gubka [ 11/Apr/17 ]

Attachment log.html has been added with description: robot log

Comment by Peter Gubka [ 11/Apr/17 ]

Attachment karaf_log_1.tar.gz has been added with description: node1 log

Comment by Peter Gubka [ 11/Apr/17 ]

Attachment karaf_log_2.tar.gz has been added with description: node2 log

Comment by Peter Gubka [ 11/Apr/17 ]

Attachment karaf_log_3.tar.gz has been added with description: node3 log

Comment by Tom Pantelis [ 11/Apr/17 ]

What's the problem here? Can you provide more details as to the purpose of the test and what was expected? From my understanding of the description, you had a component committing transactions on the shard leader then removed the shard replica on the leader and a transaction failed after (timed out). I would expect that to occur.

Comment by Peter Gubka [ 11/Apr/17 ]

(In reply to Tom Pantelis from comment #4)
> What's the problem here? Can you provide more details as to the purpose of
> the test and what was expected? From my understanding of the description,
> you had a component committing transactions on the shard leader then removed
> the shard replica on the leader and a transaction failed after (timed out).
> I would expect that to occur.

The odl clustering test plan document describes the following:

Clean Leader Shutdown
---------------------
The goal is to ensure that applications do not observe disruption when a shard leader is shut down cleanly. This is performed by having a steady-stream producer execute operations against the shard and then initiate leader shard shutdown, then the producer is shut down cleanly.

This test is executed in two scenarios:

  • Three-node, with shard leader being local
  • Three-node, with shard leader being remote

Success criteria are:

  • No transaction errors occur
  • Producer shuts down cleanly (i.e. all transactions complete successfully)

Test tool: test-transaction-producer, running at 1K tps

  • Steady, configurable producer started with:
    o A transaction chain
    o Single transactions (note: these cannot overlap)
  • Configurable transaction rate (i.e. transactions-per-second)
  • Single-operation transactions
  • Random mix across 1M entries

My test case start transaction producer with write-transaction rpc and the use remove-shard-replica rpc to "shutdown" the leader.
Rpc wite-transaction then return 500, but my understanding is that new leader will be elected and transactions should finish well and http status 200 is expected from write-transactions rpc.

Comment by Tom Pantelis [ 11/Apr/17 ]

I don't know who wrote the test plan but there's no guarantee that all transactions will succeed across leader changes. It does retry transactions if there's no current leader but a transaction already in flight may time out and fail. Now that's the behavior with the current implementation. With Robert's new front-end stuff tracked by CONTROLLER-1483 the intention is to retry transactions in all cases to also cover ones in flight. So I think the expectations in the test plan should be consistent with 5280. I believe you can enable the new behavior via the cfg file but it isn't on by default yet. Is that correct Robert?

Comment by Robert Varga [ 11/Apr/17 ]

Peter, is this with tell-based-protocol=true? The AskTimeoutException seems to be indicating otherwise.

Comment by Peter Gubka [ 11/Apr/17 ]

(In reply to Robert Varga from comment #7)
> Peter, is this with tell-based-protocol=true? The AskTimeoutException seems
> to be indicating otherwise.

No. tell-based-protocol was not enabled.
Is this the right way to enable it? https://git.opendaylight.org/gerrit/#/c/54340/

Because i used it here https://jenkins.opendaylight.org/sandbox/job/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/1/ and the particular log message is not in the logs.

Comment by Robert Varga [ 12/Apr/17 ]

This scenario assumes tell-based protocol, https://git.opendaylight.org/gerrit/54848 should fix its activation.

Comment by Robert Varga [ 13/Apr/17 ]

master: https://git.opendaylight.org/gerrit/54956

Generated at Wed Feb 07 19:56:02 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.