[CONTROLLER-1463] [Clustering]Datastore: Unrecoverable failure when high volume of write transactions are initiated Created: 21/Dec/15 Updated: 19/Oct/17 Resolved: 21/Mar/16 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Beryllium |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Muthukumaran Kothandaraman | Assignee: | Unassigned |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 4823 |
| Priority: | High |
| Description |
|
Build used : Test Type : Objective of test : Test Steps : 3. Check if the flows are completely pushed and record various metrics like - rate, state and latency Controllers (to cross-check logs): Enclosed Logs: Observations and issue-summary: |
| Comments |
| Comment by Muthukumaran Kothandaraman [ 21/Dec/15 ] |
|
Attachment HA_LargeDataFailed.zip has been added with description: Zipped logs from all 3 nodes of cluster - c1, c2, c3 |
| Comment by Shuva Jyoti Kar [ 22/Dec/15 ] |
|
Even after the nodes have become reachable - fresh transactions initiated from the follower continues to fail, although fresh transactions from the Leader succeeds |
| Comment by Muthukumaran Kothandaraman [ 22/Dec/15 ] |
|
VM Configuration used for this test VM Configuration per VM (each of 3 nodes of cluster) VM OS : Ubuntu 14.04 LTS JVM GC Monitoring |
| Comment by Tom Pantelis [ 22/Dec/15 ] |
|
In c2's log, c1 goes unreachable at 03:55:36,917: 2015-12-17 03:55:36,917 | WARN | lt-dispatcher-61 | ClusterCoreDaemon | 124 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.183.181.42:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.183.181.41:2550, status = Up)] The ShardManager got notified of this as evidenced by the PeerDown message to the EntityOwnershipShard shortly thereafter. The ShardManager also sets the leaderAvailable flag to false in the ShardInformation and clears the primaryShardInfoCache. This will cause subsequent transactions to retry/wait for the primary shard. c1 went reachable at 03:55:37,908: 2015-12-17 03:55:37,908 | INFO | lt-dispatcher-63 | kka://opendaylight-cluster-data) | 124 - com.typesafe.akka.slf4j - 2.3.14 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.183.181.42:2550] - Marking node(s) as REACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.183.181.41:2550, status = Up)] The ShardManager should've gotten notified of this and set leaderAvailable to true. About 20 sec later we see transaction failures. One thing to note is that setLeaderAvailable doesn't notify callbacks waiting on findPrimary however it should. Transactions initiated after the reachable event should immediately succeed from findPrimary. The last transaction failure was at 03:55:57,667, which seems to have timed out waiting on findPrimary which is expected due to setLeaderAvailable not notifying callbacks. This also is the last entry in the log so it seems if there were transactions created after that time, they succeeded. I would suggest turning on ShardManager debug so we can see if it is notified of member reachable and to see what's happening in findPrimary. The underlying issue is akka temporarily marking nodes as unreachable. This could be a network blip or, more likely, GC pause. There are tuning settings in akka: akka.cluster.failure-detector.acceptable-heartbeat-pause (default 3 sec) |
| Comment by Faiz Ilahi Kothari [ 23/Dec/15 ] |
|
Since we are pushing transactions sequentially (on receiving the onSuccess() callback of the previous tx) we stop pushing transactions the moment we hit the onFailure() failure callback of the previous tx. So no transactions were created after we hit the failure. However we tried pushing new transactions from c2 and c3 after 20 min but they still failed. After this the cluster goes into a bad state and we cannot do any more transactions from any of the followers. |
| Comment by Muthukumaran Kothandaraman [ 23/Dec/15 ] |
|
Tom, Below failure (UNREACHABLE -> REACHABLE and subsequently when we again try to push flows, it fails with NoShardLeaderFound) does NOT happen when we try pushing lower volumes of flows (eg. we tried with 1 switch and upto 100K flows). Even in case of 1 switch and 100K flows, we occasionally see UNREACHABLE -> REACHABLE transition but the transactions do succeed without any issues after the blip. So, I think the functional aspect of handling UNREACHABLE and REACHABLE by ShardManager still works as expected (because transactions succeed after blip). Blip rendering cluster unrecoverable is seen only when we try cases like 10 switches and 1M flows across switches after about 500K+ transactions are completed. So, it appears like some on timing is causing this. Anyway, as you had mentioned, we will collect logs with debug enabled. Yes, we did think about tuning akka heartbeat. Just reserved it for last resort since we did not want to mask the side-effect of such blips. |
| Comment by Tom Pantelis [ 23/Dec/15 ] |
|
I thought you said that after the unreachable blip transactions always fail on the follower and don't recover (#5), which I don't see evidence of in the c2 log you posted. I gather from what you're saying that this isn't always the case. Also it seems the run from which you posted the logs did not exhibit the unrecoverable state. We need an "unrecoverable state" run with ShardManager debug on. Also take a look at the Shard mbeans on each node in jconsole, do they show raft states as expected? Also take a look at the akka cluster state in jconsole, does it show everything up? If everything looks right in the mbeans then try a single manual transaction with debug on and isolate/capture the log output snippet (I do tail -f then hit a bunch of returns to clear the screen prior). Does the output indicate that the ShardManager findPrimary failed and threw the NoShardLeaderException? If the ShardManager succeeded then the problem lies elsewhere. In that case enable Shard debug on the follower and leader and try again. |
| Comment by Tom Pantelis [ 28/Dec/15 ] |
| Comment by Muthukumaran Kothandaraman [ 05/Jan/16 ] |
|
Hi Tom, Saw this in ClusterWrapperImpl.java public void subscribeToMemberEvents(ActorRef actorRef) { Preconditions.checkNotNull(actorRef, "actorRef should not be null"); cluster.subscribe(actorRef, ClusterEvent.initialStateAsEvents(), ClusterEvent.MemberEvent.class, ClusterEvent.UnreachableMember.class); }Subscription to ClusterEvent.ReachableMember seems to be missing ? |
| Comment by Tom Pantelis [ 05/Jan/16 ] |
|
yes it is - I missed that - good catch. We can add ReachableMember.class or change UnreachableMember.class to ReachableEvent.class to cover both. Can you make this change locally and try it out? I also submitted https://git.opendaylight.org/gerrit/#/c/31866/. (In reply to Muthukumaran Kothandaraman from comment #8) > |
| Comment by Muthukumaran Kothandaraman [ 06/Jan/16 ] |
|
Sure Tom. We added ReachableMember.class (and as expected logs of memberReachable of ShardManager.java started appearing after the UNREACHABLE to REACHABLE transition) and we are continuing stressing the datastore with these changes. Will update further observations if post REACHABLE state, transactions continue successfully and we do not see NoShardLeader exception. Perhaps this can uncover next level real issue. |
| Comment by Muthukumaran Kothandaraman [ 06/Jan/16 ] |
|
So with following changes, we see that transactions fail with AskTimeoutException (ask from follower to leader failed as can be seen from log leader IP - 10.183.181.31) after about 500K txns. Changes in build:
Exception stack on the node which initiates transaction (follower of inventory- in Datastore write operation: dpid: 1, tableId: 0, sourceIp: 770923 2016-01-05 06:47:11,526 | ERROR | CommitFutures-4 | FlowTesterImpl | 179 - org.opendaylight.openflowplugin.applications.bulk-o-matic - 0.2.0.SNAPSHOT | Error: TransactionCommitFailedException {message=canCommit encountered an unexpected failure, errorList=[RpcError [message=canCommit encountered an unexpected failure, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.183.181.31:2550/), Path(/user/shardmanager-config/member-1-shard-inventory-config)]] after [30000 ms]]]} in Datastore write operation: dpid: 1, tableId: 0, sourceIp: 770923 in Datastore write operation: dpid: 2, tableId: 0, sourceIp: 770893 Important Additional observation After about 5 mins retried few 100 transactions to see if this is a transient issue. But, datastore continuously fails. So, it looks like functional issue than any transient queue outages By any chance, is this rhyming with following bug ? |
| Comment by Tom Pantelis [ 06/Jan/16 ] |
|
So it seems like you've gotten passed NoShardLeader and are now seeing AskTimeoutException I assume. The throttling should avoid AskTimeoutExceptions. I assume you're using write-only txns. Can you change to read-write and try again? |
| Comment by Tom Pantelis [ 06/Jan/16 ] |
|
Muthu - can you submit your patch for the ClusterWrapperImpl change? |
| Comment by Faiz Ilahi Kothari [ 07/Jan/16 ] |
|
Tom, > The throttling should avoid AskTimeoutExceptions. By Throttling I assume you mean throttling by TransactionRateLimiter. We haven't touched it. We don't see NoShardLeaderFoundException but see many UNREACHABLE-REACHABLE blips from which it successfully recovers but eventually fail with AskTimeout. And then no further Txs succeed, even if we wait for 5min, 30min, and hour ... > I assume you're using write-only txns. Can you change to read-write and try I'll try with read-write, but any reason/what effect will this change bring? |
| Comment by Tom Pantelis [ 07/Jan/16 ] |
|
read-write follows a different path - it uses a separate actor to prepare the transaction as opposed to write-only which is optimized to prepare directly on the Shard. I want to see if the latter is overloading the Shard actor. (In reply to Faiz Ilahi Kothari from comment #14) |
| Comment by Tom Pantelis [ 07/Jan/16 ] |
|
Submitted https://git.opendaylight.org/gerrit/#/c/32219 to subscribe to the ReachableMamber event. |
| Comment by Faiz Ilahi Kothari [ 07/Jan/16 ] |
|
AskTimeout is thrown again with read-write transaction. |
| Comment by Muthukumaran Kothandaraman [ 19/Jan/16 ] |
|
Observations based on usage of PingPongBroker and TransactionChain TransactionChain - one chain per switch was created After about ~785K flows across 10 switches, transaction failures were observed with AskTimeoutException. In order to be optimistic, transaction chains were not closed so that txn-chain contexts are retained indefinitely for each switch. Observations
So, there was an improvement in throughput. But, the volume of transactions could not sustain beyond ~785K transactions. Any other specific things to look out for in the logs ? |
| Comment by Colin Dixon [ 19/Jan/16 ] |
|
On the MD-SAL interest call today, TomP said that he didn't think this was critical and shouldn't block Beryllium. Muthu, if you agree, should we reclassify it it as major? |
| Comment by Muthukumaran Kothandaraman [ 19/Jan/16 ] |
|
Reclassified this as major from critical as it does not block the functional aspect as such |
| Comment by Muthukumaran Kothandaraman [ 02/Feb/16 ] |
|
Issue is reproducible in single-node controller without clustering also. With about 1.5M flows using Opwnflow inventory model, we observe the same issue. Other observations So, in single node we are trying to bump transaction-timeout to see if FGC is the real cause. Will continue the investigation and record the observations here |
| Comment by Moiz Raja [ 02/Feb/16 ] |
|
I believe I know what is causing this issue. I was able to reproduce it by adding a yang list with about 2 million items and then deleting the list. This only happens when you have a Data Change Listener for the large data that your are putting into the data store. When a transaction is committed, the thread which commits the transaction also generates the Data Tree Change notification and the Data Change notification and then responds with a commit success. Generating the DCN takes a very long time - far greater than the 30 second transaction timeout and this causes the transaction to appear to fail - even though in reality the data commit has not failed. Another observation is that in this scenario heap usage increases as the DCN is generated because it takes a lot of memory to create the Data Change event. Possible fixes, 2. Come up with a faster implementation of ResolveDataChangeEventsTask. I'm not sure this is possible but maybe converting it not use a non-recursive algorithm may help. 3. Make generation of DCN asynchronous. This means though that we respond with a transaction commit success before we generate the DCN and I'm not sure if that would violate the Data Store contract. Ofcourse this will simply move the problem elsewhere because now we would have slow DCNs and maybe the DCN generator queue will get backed up. |
| Comment by Muthukumaran Kothandaraman [ 02/Feb/16 ] |
|
Ah ... we do have DCN listener sitting as part of Openflowplugin app which listens for flow changes and push the same to switches "if switches are connected". Since we did not have switches connected, I thought this change listener would be pretty benign. Now I hear what you are saying - building the notification payload for a DCL itself could be a very costly oper and it sits in middle of the transaction-commit thread is a good education Moiz !! Will do a quick check by disabling this listener |
| Comment by Tony Tkacik [ 03/Feb/16 ] |
|
Which garbage collector are you using? Try same situation using G1 Garbage collector. |
| Comment by Tom Pantelis [ 04/Feb/16 ] |
|
I wrote a unit test that writes a 2M item list and then deletes it with both a DCL and a DTCL. Generating the DCN takes many seconds as Moiz noted while the DTCN takes a few ms. DTCL is definitely the way to go. I don't know at this point what could be done to speedup generating DCNs at scale or if it's even worth the time/effort to investigate. I think the time would be better spent converting client DCLs to DTCLs. I think for Boron we should deprecate DCL and associated classes/methods so we start the clock ticking for getting rid of it. |
| Comment by Muthukumaran Kothandaraman [ 04/Feb/16 ] |
|
Tom, We disabled single DCN Listener which was listening for the updates of the flows (inventory config shard) and retained the DTCN Listener for the shard. Now we see that the issue is postponed to the order of 4.3M flows on single-node. ie. after 4.3M we again get this issue. Verifying whether this number reduces in cluster and if so how much of degradation is observed. We also wanted to check what is the behavior if ALL change-listeners (including DTCN listeners) are disabled, the datastore bails-out only due to sheer heap running out of memory and NOT due to transaction timeout. This can be a good datapoint to see how the change-notification processing is influenced by the data volume in datastore (since change notification computation involves recursive computation based on data already present, change-computation overhead is unlikely to be O(1)) |
| Comment by Tom Pantelis [ 04/Feb/16 ] |
|
I submitted https://git.opendaylight.org/gerrit/#/c/34096/ to offload generation of DCNs onto another actor. It pushes the problem elsewhere but at least it doesn't block the Shard. Muthu - can you try it out when you get time? |
| Comment by Robert Varga [ 07/Feb/16 ] |
|
DCNs are unfixable, as they do not expose the tree structure to the users. The only thing that could be done would be computing the change Map lazily, which would be a huge explosion in terms of complexity. |
| Comment by Robert Varga [ 08/Feb/16 ] |
|
I have thrown some of this on the whiteboard, and it seems that we need frontend/backend recovery when ask timeouts happen. Essentially this boils down to the fact that we are tracking both liveness of the remote system and its progress via the same mechanism. The frontend has no means of recovery from ask timeout, hence it becomes all badness once the working set size gets large, which is the root cause and needs to be fixed. I think we are missing the mechanism by which the frontend can re-submit a transaction and if the backend has completed processing it, it will know this for a fact and respond with proper hard-state result of that transaction. This needs to be case during soft-timeouts (e.g. long time between submit and success message), but also for timeouts caused by leadership movement – for example if the success message gets lost, but the effects of the transaction have already been replicated. This should drive what gets replicated (e.g. client transaction IDs), as well as snapshotting rules and the API exposed to clients. The second part would apply the principles learned from frontend/backend exercise to communication between components within the backend. I am marking this down as a topic for the upcoming DDF. |
| Comment by Muthukumaran Kothandaraman [ 21/Mar/16 ] |
|
With following configurations in 3 node cluster, upto 4M flows went through without AskTimeoutException 1.Used G1GC and JDK8. Also, in Single node, we could see upto approx 8M flows. So, to conclude, apart from using g1gc of JDK, the capacity seems to be mainly limited by dimensioning based on volume of data required to be stored which is specific to the deployment and specific project requirements. |