[CONTROLLER-1660] C: ddb leader isolation: after node rejoin within transaction timout write-transactions returned 500 from all nodes Created: 07/May/17  Updated: 25/Jul/23  Resolved: 08/May/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Peter Gubka Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8393

 Description   

robot:
https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/665/archives/log.html.gz#s1-s50-t1-k2-k18-k1-k1

scenarion:
transactions are produced on each node
the leader is isolated
within few seconds new leader is eleceted (less than 10s)
isolated node is rejoined then and tests waits for shards to be sync on all nodes
write-trancasions rpc should return 200s, but 500s are returned from all nodes

logs https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/665/archives/
2017-05-07 18:11:17,081 | INFO | h for user karaf | command | 266 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Healing_Within_Transaction_Timeout

bug may be related to 8372



 Comments   
Comment by Peter Gubka [ 07/May/17 ]

All 3 nodes returned TransactionCommitFailedException cause by akka timeout, but the first stacktrace comes from
org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction
other 2 come from
org.opendaylight.controller.cluster.datastore.messages.BatchedModifications

Comment by Tom Pantelis [ 08/May/17 ]

This is expected with ask-based protocol. This was the main reason for the new tell-based. I would suggest testing this scenario with tell-based only.

Comment by Peter Gubka [ 08/May/17 ]

(In reply to Tom Pantelis from comment #2)
> This is expected with ask-based protocol. This was the main reason for the
> new tell-based. I would suggest testing this scenario with tell-based only.

It was tested with tell-based=true. Before every testing suite there was an odl restart with certain tell-based setting (true or false). In this case it was a restart with tell-based=true.

Comment by Tom Pantelis [ 08/May/17 ]

You're saying that it failed with AskTimeout for BatchedModifications and ReadyLocalTransaction. That is the ask-based protocol.

Generated at Wed Feb 07 19:56:07 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.