[CONTROLLER-1468] [Clustering] Datastore operations failure when leader is down Created: 11/Jan/16  Updated: 19/Oct/17  Resolved: 26/Feb/16

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Beryllium
Fix Version/s: None

Type: Bug
Reporter: Muthukumaran Kothandaraman Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File c1.karaf.rar     File c3.karaf.rar    
External issue ID: 4923

 Description   

Test cases:
===========
Topology 3 nodes : Leader restart with 1 switches and 1K flows per DPN (add and modify and delete when one node is down)

Steps:-
=======
1. In a 3 node cluster (c1, c2, c3), c2 is Leader and rest are followers
2. Added 1000 flows via c1 and verified it is replicated all across the nodes.
3. Made the Leader c2 down, c3 became leader.
4. Deleted 300 flows from follower c1
5. Observing 1000 flows in rest of the nodes c1 and c3 when doing GET opertaion on them, Expectation was to have 700 flows in the nodes.

Please find attached logs of c1 and c3.

Following similar Log is seen while attempting to delete 300 flows from follower c1 or c3
=========================================================
2015-12-14 02:32:44,594 | INFO | qtp440143820-841 | FlowTesterConcurrentImpl | 158 - org.opendaylight.openflowplugin.applications.bulk-o-matic -
0.2.0.SNAPSHOT | Using Concurrent Flow Tester Impl
2015-12-14 02:32:47,525 | INFO | Pool-13-worker-1 | FlowTesterConcurrentImpl | 158 - org.opendaylight.openflowplugin.applications.bulk-o-matic -
0.2.0.SNAPSHOT | Total Flows read: 1001
2015-12-14 02:33:01,577 | INFO | qtp440143820-849 | FlowTesterImpl | 158 - org.opendaylight.openflowplugin.applications.bulk-o-matic -
0.2.0.SNAPSHOT | Using Sequential Flow Tester Impl
2015-12-14 02:33:06,602 | WARN | lt-dispatcher-44 | ConcurrentDOMDataBroker | 138 - org.opendaylight.controller.sal-distributed-datastore -
1.3.0.SNAPSHOT | Tx: DOM-9707 Error during phase CAN_COMMIT, starting Abort
akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.183.181.43:2550/), Path
(/user/shardmanager-config/member-3-shard-inventory-config)]] after [5000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)[123:com.typesafe.akka.actor:2.3.14]
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)[123:com.typesafe.akka.actor:2.3.14]
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
[120:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-b66a39653b]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)[120:org.scala-lang.scala-library:2.10.4.v20140209-180020-VFINAL-
b66a39653b]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)[123:com.typesafe.akka.actor:2.3.14]
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)[123:com.typesafe.akka.actor:2.3.14]
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)[123:com.typesafe.akka.actor:2.3.14]
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)[123:com.typesafe.akka.actor:2.3.14]
at java.lang.Thread.run(Thread.java:745)[:1.7.0_65]
2015-12-14 02:33:06,609 | ERROR | CommitFutures-10 | FlowTesterImpl | 158 - org.opendaylight.openflowplugin.applications.bulk-o-matic -
0.2.0.SNAPSHOT | Error: TransactionCommitFailedException

{message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.183.181.43:2550/), Path (/user/shardmanager-config/member-3-shard-inventory-config)]] after [5000 ms]]]}

in Datastore write operation: dpid: 1, tableId: 1, sourceIp: 2
2015-12-14 02:33:07,975 | INFO | qtp440143820-849 | FlowTesterConcurrentImpl | 158 - org.opendaylight.openflowplugin.applications.bulk-o-matic -
0.2.0.SNAPSHOT | Usin
g Concurrent Flow Tester Impl
2015-12-14 02:33:10,772 | INFO | Pool-15-worker-1 | FlowTesterConcurrentImpl | 158 - org.opendaylight.openflowplugin.applications.bulk-o-matic -
0.2.0.SNAPSHOT | Total Flows read: 1000

Attaching the logs for the remaining nodes when erstwhile leader (controller c2) was brought down and logs are for c3 (new leader) and c1 (follower) nodes



 Comments   
Comment by Muthukumaran Kothandaraman [ 11/Jan/16 ]

Attachment c1.karaf.rar has been added with description: c1 inventory shard follower log

Comment by Muthukumaran Kothandaraman [ 11/Jan/16 ]

Attachment c3.karaf.rar has been added with description: c3 node - new leader after c2 node was brought down

Comment by Tom Pantelis [ 19/Jan/16 ]

In c3, it looks like c2 (10.183.181.42) was taken down at this point:

2015-12-14 02:16:01,780 | WARN | lt-dispatcher-49 | ClusterCoreDaemon | 124 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.183.181.43:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.183.181.42:2550, status = Up)]

c3 became leader at:

2015-12-14 02:16:07,452 | INFO | lt-dispatcher-40 | ShardManager | 138 - org.opendaylight.controller.sal-distributed-datastore - 1.3.0.SNAPSHOT | shard-manager-config: Received role changed for member-3-shard-inventory-config from Candidate to Leader

Some 2 minutes later a tx timed out:

2015-12-14 02:18:52,127 | WARN | lt-dispatcher-48 | Shard | 135 - org.opendaylight.controller.sal-akka-raft - 1.3.0.SNAPSHOT | member-3-shard-inventory-config: Current transaction member-1-txn-6406 has timed out after 15000 ms - aborting

It's hard to tell why w/o debugging.

In step #4, were the 300 deletes batched in 1 tx or done in 300 tx's?
Is it reproducible with only 1 flow deleted? Or 10 etc?

It would be helpful to reproduce with debug enabled just prior to doing the deletes. In the new leader (c3), enable org.opendaylight.controller.cluster.datastore.Shard and in the remote node initiating the deletes (c1), enable org.opendaylight.controller.cluster.RemoteTransactionContext and org.opendaylight.controller.cluster.SingleCommitCohortProxy.

Comment by Jozef Behran [ 19/Feb/16 ]

CONTROLLER-1486 might be an incarnation of this bug. In that case the leader appears to be not elected yet when the datastore operation failed.

Comment by Muthukumaran Kothandaraman [ 25/Feb/16 ]

We do not use the switches as this is mainly to study datastore behavior just that Openflowplugin Inventory model is used.

With latest stable/beryllium build, we are not able to reproduce this exact scenario. Tried it once. We will try this with more repetitions to see if this has really gone.

Will come to a conclusion latest by tomorrow to close this bug

Comment by Muthukumaran Kothandaraman [ 26/Feb/16 ]

This is NOT reproducible in latest stable/beryllium - so closing the same.

Generated at Wed Feb 07 19:55:37 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.