[CONTROLLER-1468] [Clustering] Datastore operations failure when leader is down Created: 11/Jan/16 Updated: 19/Oct/17 Resolved: 26/Feb/16 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Beryllium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Muthukumaran Kothandaraman | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 4923 |
| Description |
|
Test cases: Steps:- Please find attached logs of c1 and c3. Following similar Log is seen while attempting to delete 300 flows from follower c1 or c3 in Datastore write operation: dpid: 1, tableId: 1, sourceIp: 2 Attaching the logs for the remaining nodes when erstwhile leader (controller c2) was brought down and logs are for c3 (new leader) and c1 (follower) nodes |
| Comments |
| Comment by Muthukumaran Kothandaraman [ 11/Jan/16 ] |
|
Attachment c1.karaf.rar has been added with description: c1 inventory shard follower log |
| Comment by Muthukumaran Kothandaraman [ 11/Jan/16 ] |
|
Attachment c3.karaf.rar has been added with description: c3 node - new leader after c2 node was brought down |
| Comment by Tom Pantelis [ 19/Jan/16 ] |
|
In c3, it looks like c2 (10.183.181.42) was taken down at this point: 2015-12-14 02:16:01,780 | WARN | lt-dispatcher-49 | ClusterCoreDaemon | 124 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.183.181.43:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.183.181.42:2550, status = Up)] c3 became leader at: 2015-12-14 02:16:07,452 | INFO | lt-dispatcher-40 | ShardManager | 138 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | shard-manager-config: Received role changed for member-3-shard-inventory-config from Candidate to Leader Some 2 minutes later a tx timed out: 2015-12-14 02:18:52,127 | WARN | lt-dispatcher-48 | Shard | 135 - org.opendaylight.controller.sal-akka-raft - 1.3.0.SNAPSHOT | member-3-shard-inventory-config: Current transaction member-1-txn-6406 has timed out after 15000 ms - aborting It's hard to tell why w/o debugging. In step #4, were the 300 deletes batched in 1 tx or done in 300 tx's? It would be helpful to reproduce with debug enabled just prior to doing the deletes. In the new leader (c3), enable org.opendaylight.controller.cluster.datastore.Shard and in the remote node initiating the deletes (c1), enable org.opendaylight.controller.cluster.RemoteTransactionContext and org.opendaylight.controller.cluster.SingleCommitCohortProxy. |
| Comment by Jozef Behran [ 19/Feb/16 ] |
|
|
| Comment by Muthukumaran Kothandaraman [ 25/Feb/16 ] |
|
We do not use the switches as this is mainly to study datastore behavior just that Openflowplugin Inventory model is used. With latest stable/beryllium build, we are not able to reproduce this exact scenario. Tried it once. We will try this with more repetitions to see if this has really gone. Will come to a conclusion latest by tomorrow to close this bug |
| Comment by Muthukumaran Kothandaraman [ 26/Feb/16 ] |
|
This is NOT reproducible in latest stable/beryllium - so closing the same. |