[CONTROLLER-1631] Transaction producer aborted when local shard leader removed Created: 11/Apr/17 Updated: 25/Jul/23 Resolved: 15/Apr/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Peter Gubka | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 8205 |
| Description |
|
The test scenario follows the steps of DOMDataBroker testing: Clean Leader Shutdown: local leader shurdown {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Unexpected-exception","error-info":"TransactionCommitFailedException {message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-default-config#-1992099962)]] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".]]} at org.opendaylight.controller.md.sal.dom.broker.impl.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37) |
| Comments |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
Attachment log.html has been added with description: robot log |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
Attachment karaf_log_1.tar.gz has been added with description: node1 log |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
Attachment karaf_log_2.tar.gz has been added with description: node2 log |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
Attachment karaf_log_3.tar.gz has been added with description: node3 log |
| Comment by Tom Pantelis [ 11/Apr/17 ] |
|
What's the problem here? Can you provide more details as to the purpose of the test and what was expected? From my understanding of the description, you had a component committing transactions on the shard leader then removed the shard replica on the leader and a transaction failed after (timed out). I would expect that to occur. |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
(In reply to Tom Pantelis from comment #4) The odl clustering test plan document describes the following: Clean Leader Shutdown This test is executed in two scenarios:
Success criteria are:
Test tool: test-transaction-producer, running at 1K tps
My test case start transaction producer with write-transaction rpc and the use remove-shard-replica rpc to "shutdown" the leader. |
| Comment by Tom Pantelis [ 11/Apr/17 ] |
|
I don't know who wrote the test plan but there's no guarantee that all transactions will succeed across leader changes. It does retry transactions if there's no current leader but a transaction already in flight may time out and fail. Now that's the behavior with the current implementation. With Robert's new front-end stuff tracked by |
| Comment by Robert Varga [ 11/Apr/17 ] |
|
Peter, is this with tell-based-protocol=true? The AskTimeoutException seems to be indicating otherwise. |
| Comment by Peter Gubka [ 11/Apr/17 ] |
|
(In reply to Robert Varga from comment #7) No. tell-based-protocol was not enabled. Because i used it here https://jenkins.opendaylight.org/sandbox/job/bgpcep-csit-3node-periodic-bgpclustering-only-carbon/1/ and the particular log message is not in the logs. |
| Comment by Robert Varga [ 12/Apr/17 ] |
|
This scenario assumes tell-based protocol, https://git.opendaylight.org/gerrit/54848 should fix its activation. |
| Comment by Robert Varga [ 13/Apr/17 ] |