|
Test references:
================
bgpcep_bgpclustering_txt_010_singlepeer_prefixcount_wait_for_stable_talking_ipv4_topology_2
bgpcep_bgpclustering_txt_010_singlepeer_prefixcount_wait_for_stable_talking_ipv4_topology_3
bgpcep_bgpclustering_txt_010_singlepeer_prefixcount_check_talking_ipv4_topology_count_2
bgpcep_bgpclustering_txt_010_singlepeer_prefixcount_check_talking_ipv4_topology_count_3
bgpcep_bgpclustering_txt_010_singlepeer_prefixcount_wait_for_stable_ipv4_topology_after_listening_2
bgpcep_bgpclustering_txt_010_singlepeer_prefixcount_wait_for_stable_ipv4_topology_after_listening_3
bgpcep_bgpclustering_txt_010_singlepeer_prefixcount_check_for_empty_ipv4_topology_after_listening_2
bgpcep_bgpclustering_txt_010_singlepeer_prefixcount_check_for_empty_ipv4_topology_after_listening_3
|
|
Please supply the full karaf logs.
|
|
And attach to the bug.
(In reply to Tom Pantelis from comment #2)
> Please supply the full karaf logs.
|
|
Is the operational datastore replicated? What is the JVM sizing/parameters in this test?
|
|
We have analyzed the problem manifested here. Essentially it boils down to the ShardLeader being too busy to respond to the frontend message before the timeout expires. This opened the question of how state transitions occur in the system, as leadership changes have an impact on running transactions.
This issue has two essential causes:
1) Frontend->Backend messages are not idempotent.
Frontend needs to cover the case when a message gets lost (for whatever reasons), hence it places an upper bound on how long it waits for the response to arrive. When that happens, it cannot re-submit the message because the previous message may still be processed, which could mean that we run the same transaction twice (for the case of BatchedModifications, other messages have similar problems).
2) Leader->Follower state is not fully synchronized.
We only persist the results of a transaction, not the intermediate state leading to those results. This works out (mostly) fine for cases where the client and leader are co-located, as all required state is kept on the client. For client talking to a remote Leader, we have state on both client and leader. In the event of a Leader change we end up losing the leader-side state of the transaction, hence a transaction spanning the transition will fail. The client does not keep the state it has communicated to the Leader via BatchedModifications and so it cannot replay it to the new leader. Further problems arise around transaction chain existence (e.g. the follower does not have the corresponding actors), hence these will inherently fail, too.
There is a third problem, reported in BUG-5153, where a client restart does not flush Leader-side state. While that problem has been partially addressed by using wall-time timestamps, we need to solved it again so that we minimize the replicated (and persisted) state we keep about transactions. More specifically, if the backend detects a frontend restart, it should prune any state pertaining to that frontend instance.
The solution involves introducing/formalizing/modifying a few concepts already present in the implementation and describing their semantic meaning and defining the frontend/backend interactions using those concepts. This will allow us to reason about recovery strategies in face of frontend/backend communication failures, including issues stemming from leader restarts and leader location changes. That will then drive requirements on what information the actual Shard FSM looks like, and by extension, define what events need to be persisted (and replicated via RAFT).
I will post an asciidoc document describing the intent and post a patch here.
|
|
Actually the front-end does keep the modification state prior to ready so it can be replayed - I have a prototype draft I did a while back for this.
(In reply to Robert Varga from comment #5)
The client does not keep the state it has communicated to the Leader
> via BatchedModifications and so it cannot replay it to the new leader.
|
|
I am not saying it cannot – but it currently keeps only the last batch (in RemoteTransactionContext). I also think that keeping around the entire transaction will be problematic for large transactions.
|
|
I changed it to keep the entire trail of transaction operations but in RemoteTransactionContextSupport such that it can retry the FindPrimaryShard. The RemoteTransactionContextSupport already cached the client transaction operations up to the point that if finds the primary shard - I just extended it to cache behind that point in case the RemoteTransactionContext failed. It was just a prototype, obviously needed more work to handle duplicate messages on the leader.
Anyway, I see few scenarios for front-end transaction disruption due to loss of known leader:
1) The current leader is deposed for some reason but is still running. In this case it will forward all current front-end transaction state to the new leader and subsequent messages.
2) The current leader is shutdown and successfully transfers leadership. It will forward all current front-end transaction state to the new leader (or attempt to). However subsequent messages will be lost as the front-end continues to send to the prior leader actor selection.
3) The current leader process dies ungracefully. All cached leader state is lost and all subsequent messages will also be lost.
4) The current leader is still running but the front-end loses connection due to network failure. Subsequent messages could succeed if the messages are retried and the network heals in time.
My prototype patch is an attempt to handle 2, 3 and 4 by finding the new leader and replaying all cached client operations (up to and including "ready"). Interspersed read operations for R/W transactions are problematic though as well as CanCommit and Commit messages for cross-shard transactions.
Maybe instead of caching and replaying front-end operations on loss of current leader, we can encapsulate message retries in a custom AskPattern that internally attempts to find the new leader if a message times out. The RemoteTransactionContext would create the pattern instance up-front with the initial actor selection.
No matter how and if we implemented front-end retries, the messages would have to be idempotent in the shard.
|
|
The custom ask pattern that retries the current message wouldn't handle #3, in that case the front-end would have to replay all prior operations after finding the new leader.
Or, another possibility, would be to replicate intermediate transaction operations prior to the final commit to the followers so they could pick it up should they become leader. This would be complicated for R/W transactions where the intermediate operations are handled by a separate transaction actor (we would probably have to eliminate this actor).
|
|
Good points, but I do believe we have problems also around transaction chains (and their associated actors), hence I am writing up the analysis.
For addressing this issue, I want to start we a clean-slate design and then map it to actual implementation. State management and replication needs to be documented so that it can be reviewed and known to be correct – for all interactions and all failure modes.
I know this is not a quick win and it is going to take a lot of effort, hence the approach with well-documented design first.
In any case, the journal will have to contain more events than just DataTreeCandidatePayload, and state related to transactions (like their identifiers) will have to be persisted.
|
|
Yeah this really needs to be thought thru in detail and documentation would help. There is no quick win for handling all the failure scenarios But there's time to do it right - no need to rush on it.
(In reply to Robert Varga from comment #10)
> Good points, but I do believe we have problems also around transaction
> chains (and their associated actors), hence I am writing up the analysis.
>
> For addressing this issue, I want to start we a clean-slate design and then
> map it to actual implementation. State management and replication needs to
> be documented so that it can be reviewed and known to be correct – for all
> interactions and all failure modes.
>
> I know this is not a quick win and it is going to take a lot of effort,
> hence the approach with well-documented design first.
>
> In any case, the journal will have to contain more events than just
> DataTreeCandidatePayload, and state related to transactions (like their
> identifiers) will have to be persisted.
|
|
initial drop of analysis: https://git.opendaylight.org/gerrit/37544
|
|
The following patch was submitted to the stable/boron branch to fix this bug: https://git.opendaylight.org/gerrit/#/c/45127
To better assess the impact of this bug and fix, could someone from your team please help us identify the following:
Severity: Could you elaborate on the severity of this bug? Is this a BLOCKER such that we cannot release Boron without it? Is there a workaround such that we can write a release note and fix in future Boron SR1?
Testing: Could you also elaborate on the testing of this patch? How extensively has this patch been tested? Is it covered by any unit tests or system tests?
Impact: Does this fix impact any dependent projects?
|
|
Issue still present on boron
https://logs.opendaylight.org/sandbox/jenkins091/bgpcep-csit-3node-periodic-bgpclustering-only-boron/9/archives/odl2_karaf.log.gz
2017-02-23 14:55:28,852 | INFO | lt-dispatcher-61 | EntityOwnershipShard | 187 - org.opendaylight.controller.sal-akka-raft - 1.4.3.SNAPSHOT | Peer address for peer member-1-shard-entity-ownership-operational set to akka.tcp://opendaylight-cluster-data@10.29.12.157:2550/user/shardmanager-operational/member-1-shard-entity-ownership-operational
2017-02-23 14:55:32,455 | WARN | on-dispatcher-66 | OperationLimiter | 192 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.SNAPSHOT | Failed to acquire operation permit for transaction member-2-datastore-operational-fe-0-chn-30-txn-1150
2017-02-23 14:55:33,766 | WARN | entLoopGroup-9-1 | OperationLimiter | 192 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.SNAPSHOT | Failed to acquire operation permit for transaction member-2-datastore-operational-fe-0-chn-29-txn-1157
2017-02-23 14:55:37,455 | WARN | on-dispatcher-66 | OperationLimiter | 192 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.SNAPSHOT | Failed to acquire operation permit for transaction member-2-datastore-operational-fe-0-chn-30-txn-1150
2017-02-23 14:55:38,767 | WARN | entLoopGroup-9-1 | OperationLimiter | 192 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.SNAPSHOT | Failed to acquire operation permit for transaction member-2-datastore-operational-fe-0-chn-29-txn-1157
2017-02-23 14:55:42,456 | WARN | on-dispatcher-66 | OperationLimiter | 192 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.SNAPSHOT | Failed to acquire operation permit for transaction member-2-datastore-operational-fe-0-chn-30-txn-1150
2017-02-23 14:55:43,726 | ERROR | lt-dispatcher-17 | TransactionChainProxy | 192 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.SNAPSHOT | Tx: member-2-datastore-operational-fe-0-chn-29-txn-1157 - ready future failed for previous Tx member-2-datastore-operational-fe-0-chn-29-txn-1157
2017-02-23 14:55:43,726 | ERROR | lt-dispatcher-49 | LocalThreePhaseCommitCohort | 192 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.SNAPSHOT | Failed to prepare transaction member-2-datastore-operational-fe-0-chn-29-txn-1156 on backend
akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-2-shard-default-operational#1626255105)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[174:com.typesafe.akka.actor:2.4.7]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[174:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[170:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[170:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[170:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)[174:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)[174:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)[174:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)[174:com.typesafe.akka.actor:2.4.7]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_121]
2017-02-23 14:55:43,727 | WARN | lt-dispatcher-62 | ConcurrentDOMDataBroker | 192 - org.opendaylight.controller.sal-distributed-datastore - 1.4.3.SNAPSHOT | Tx: DOM-CHAIN-28-1156 Error during phase CAN_COMMIT, starting Abort
akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-2-shard-default-operational#1626255105)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[174:com.typesafe.akka.actor:2.4.7]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[174:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[170:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[170:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[170:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)[174:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)[174:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)[174:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)[174:com.typesafe.akka.actor:2.4.7]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_121]
|
|
Any update on this? I too am seeing this with Boron, even though the SyncStatus for that shard reports 'true' over JMX. The way I reproduced this was by doing 2 rounds of bring down the quorum and healing. All write transactions fail now. Is there any logs you need to help debug this more?
|
|
(In reply to Srini Seetharaman from comment #17)
> Any update on this? I too am seeing this with Boron, even though the
> SyncStatus for that shard reports 'true' over JMX. The way I reproduced this
> was by doing 2 rounds of bring down the quorum and healing. All write
> transactions fail now. Is there any logs you need to help debug this more?
Just to add to my note - My cluster is based on 3 ODL docker containers running on the same machine. It is running a simple app that is monitoring the machine's CPU and memory using JMX, and doing 1 write transaction to the datastore every sec. I'm not stressing the system, but the writes fail with the following commit error on the shard leader:
2017-03-15 02:37:24,099 | WARN | ult-dispatcher-2 | ShardDataTree | 193 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 | member-2-shard-infinera-system-operational: Current transaction member-2-datastore-operational-fe-0-txn
-5383 has timed out after 15000 ms in state COMMIT_PENDING
2017-03-15 02:37:24,100 | WARN | ult-dispatcher-2 | ShardDataTree | 193 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 | member-2-shard-infinera-system-operational: Transaction member-2-datastore-operational-fe-0-txn-5383 is
still committing, cannot abort
2017-03-15 02:37:25,690 | ERROR | lt-dispatcher-16 | LocalThreePhaseCommitCohort | 193 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 | Failed to prepare transaction member-2-datastore-operational-fe-0-txn-5824 on backend
akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-2-shard-infinera-system-operational#742189228)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight
.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[175:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)[175:com.typesafe.akka.actor:2.4.7]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_111]
2017-03-15 02:37:25,690 | WARN | ult-dispatcher-2 | ConcurrentDOMDataBroker | 193 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 | Tx: DOM-6044 Error during phase CAN_COMMIT, starting Abort
akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-2-shard-infinera-system-operational#742189228)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight
.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[175:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)[175:com.typesafe.akka.actor:2.4.7]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_111]
|
|
patches merged: https://git.opendaylight.org/gerrit/#/q/message:5280+is:closed
|
|
Thanks! Will this be pushed to Boron-sr3 too?
|
|
Milestone will be merged in Carbon only. This is new development and therefore not fit for bugfix description within service releases.
|
|
Thanks for the note. Without this bug fix, my company's "Powered by OpenDaylight" controller is going to be pretty unstable in clustered setting because it just takes two cluster downs to freeze up all subsequent write transactions with the akka timeout errors. What is the rationale to wait to Carbon and not support Boron?
|
|
What do you mean by 'two cluster downs'?
The scope of this issue is a huge rewrite of frontend/backend communication protocol along with incompatible journal layout. Therefore I do not think it is appropriate to introduce it in an SR. The soonest this could get into Boron is SR4, which ships after the initial Carbon release...
|
|
(In reply to Robert Varga from comment #23)
> What do you mean by 'two cluster downs'?
In my stable/boron based setup, I do the following and am able to reproduce this timeout issue very easily. Sometimes it takes doing these steps once. Almost always I get it within two runs.
1. Bring up three controller instances each within a Docker container.
2. Once the cluster is up and steadily updating the operational store with the memory/CPU usage info, I bring down the cluster by making the interface of two Docker containers as down.
3. sleep 60 secs
4. Bring up the two interfaces for the containers and verify the shard members in all three instances. Once data store sync status says "True" and log prints "shard-manager-operational: All shard are ready", I look at the logs. I see the following every second after that.
2017-03-15 18:01:47,368 | ERROR | lt-dispatcher-35 | LocalThreePhaseCommitCohort | 193 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 | Failed to prepare transaction member-3-datastore-operational-fe-0-txn-5445 on backend
akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-3-shard-infinera-system-operational#-949816139)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[175:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)[175:com.typesafe.akka.actor:2.4.7]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_111]
2017-03-15 18:01:47,368 | WARN | lt-dispatcher-20 | ConcurrentDOMDataBroker | 193 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 | Tx: DOM-5445 Error during phase CAN_COMMIT, starting Abort
akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-operational/member-3-shard-infinera-system-operational#-949816139)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[175:com.typesafe.akka.actor:2.4.7]
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)[175:com.typesafe.akka.actor:2.4.7]
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)[175:com.typesafe.akka.actor:2.4.7]
at java.lang.Thread.run(Thread.java:745)[:1.8.0_111]
2017-03-15 18:01:47,374 | WARN | r-ControlerTimer | GenericTransactionUtils | 299 - com.infinera.sdn.utils.transaction - 0.1.0.SNAPSHOT | Transaction for add of object ControllerInfoAugment [_controllerRole=Slave, _controllerStatus=Up, _memberName=member-3] failed with error {}
2017-03-15 18:01:57,518 | WARN | lt-dispatcher-33 | ShardDataTree | 193 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 | member-3-shard-infinera-system-operational: Current transaction member-3-datastore-operational-fe-0-txn-5380 has timed out after 15000 ms in state COMMIT_PENDING
2017-03-15 18:01:57,519 | WARN | lt-dispatcher-33 | ShardDataTree | 193 - org.opendaylight.controller.sal-distributed-datastore - 1.4.2.Boron-SR2 | member-3-shard-infinera-system-operational: Transaction member-3-datastore-operational-fe-0-txn-5380 is still committing, cannot abort
|
|
I wonder perhaps if my akka.conf is incorrect. I pulled the latest from boron-sr2 and am using it. Here it is. Can you comment on whether anything is missing or can potential trigger this issue? Thanks!
odl-cluster-data {
akka {
loglevel = "INFO"
loggers = ["akka.event.slf4j.Slf4jLogger"]
logger-startup-timeout = 300s
remote {
netty.tcp
{
hostname = "172.17.4.23"
port = 2550
}
}
cluster
{
seed-nodes = ["akka.tcp://opendaylight-cluster-data@172.17.4.21:2550",
"akka.tcp://opendaylight-cluster-data@172.17.4.22:2550",
"akka.tcp://opendaylight-cluster-data@172.17.4.23:2550"]
seed-node-timeout = 12s
auto-down-unreachable-after = 300s
roles = ["member-3"]
}
persistence {
journal {
leveldb
{
# Set native = off to use a Java-only implementation of leveldb.
# Note that the Java-only version is not currently considered by Akka to be production quality.
# native = off
}
}
}
}
}
odl-cluster-rpc {
akka {
loglevel = "INFO"
loggers = ["akka.event.slf4j.Slf4jLogger"]
logger-startup-timeout = 300s
remote {
netty.tcp
{
hostname = "172.17.4.23"
port = 2551
}
}
cluster
{
seed-nodes = ["akka.tcp://opendaylight-cluster-data@172.17.4.21:2551",
"akka.tcp://opendaylight-cluster-data@172.17.4.22:2551",
"akka.tcp://opendaylight-cluster-data@172.17.4.23:2551"]
auto-down-unreachable-after = 300s
}
}
}
|
|
This bug is related to timeouts due to high-volume transactions by the BGP plugin. It doesn't seem like yours is the same issue. AskTimeoutExceptions can occur for various reasons.
One thing is you shouldn't set auto-down-unreachable-after. I would suggest re-testing with auto-down-unreachable-after off and, if necessary, the current Boron-SP3.
(In reply to Srini Seetharaman from comment #25)
> I wonder perhaps if my akka.conf is incorrect. I pulled the latest from
> boron-sr2 and am using it. Here it is. Can you comment on whether anything
> is missing or can potential trigger this issue? Thanks!
>
> odl-cluster-data {
> akka {
> loglevel = "INFO"
> loggers = ["akka.event.slf4j.Slf4jLogger"]
> logger-startup-timeout = 300s
>
> remote {
> netty.tcp
{
> hostname = "172.17.4.23"
> port = 2550
> }
> }
>
> cluster
{
> seed-nodes = ["akka.tcp://opendaylight-cluster-data@172.17.4.21:2550",
> "akka.tcp://opendaylight-cluster-data@172.17.4.22:2550",
> "akka.tcp://opendaylight-cluster-data@172.17.4.23:2550"]
> seed-node-timeout = 12s
> auto-down-unreachable-after = 300s
> roles = ["member-3"]
> }
>
> persistence {
> journal {
> leveldb
{
> # Set native = off to use a Java-only implementation of
> leveldb.
> # Note that the Java-only version is not currently
> considered by Akka to be production quality.
> # native = off
> }
> }
> }
> }
> }
>
> odl-cluster-rpc {
> akka {
> loglevel = "INFO"
> loggers = ["akka.event.slf4j.Slf4jLogger"]
> logger-startup-timeout = 300s
>
> remote {
> netty.tcp
{
> hostname = "172.17.4.23"
> port = 2551
> }
> }
>
> cluster
{
> seed-nodes = ["akka.tcp://opendaylight-cluster-data@172.17.4.21:2551",
> "akka.tcp://opendaylight-cluster-data@172.17.4.22:2551",
> "akka.tcp://opendaylight-cluster-data@172.17.4.23:2551"]
> auto-down-unreachable-after = 300s
> }
> }
> }
|
|
(In reply to Tom Pantelis from comment #26)
> This bug is related to timeouts due to high-volume transactions by the BGP
> plugin. It doesn't seem like yours is the same issue. AskTimeoutExceptions
> can occur for various reasons.
>
> One thing is you shouldn't set auto-down-unreachable-after. I would suggest
> re-testing with auto-down-unreachable-after off and, if necessary, the
> current Boron-SP3.
Just removing that line didn't help. But, reintroducing some of the parameters that I used to have from pre-Boron helped solve the issue and make my build stable and not have akka-timeouts. Here are the additional things it had. Not sure exactly which one helped.
odl-cluster-data {
+ bounded-mailbox
{
+ mailbox-type = "org.opendaylight.controller.cluster.common.actor.MeteredBoundedMailbox"
+ mailbox-capacity = 1000
+ mailbox-push-timeout-time = 100ms
+ }
+
+ metric-capture-enabled = true
+
akka {
loglevel = "INFO"
loggers = ["akka.event.slf4j.Slf4jLogger"]
logger-startup-timeout = 300s
+ actor {
+ provider = "akka.cluster.ClusterActorRefProvider"
+ serializers
{
+ java = "akka.serialization.JavaSerializer"
+ proto = "akka.remote.serialization.ProtobufSerializer"
+ readylocal = "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransactionSerializer"
+ }
+
+ serialization-bindings
{
+ "com.google.protobuf.Message" = proto
+ "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction" = readylocal
+ }
+
+ default-dispatcher
{
+ # Setting throughput to 1 makes the dispatcher fair. It processes 1 message from
+ # the mailbox before moving on to the next mailbox
+ throughput = 1
+ }
+
+ default-mailbox
{
+ # When not using a BalancingDispatcher it is recommended that we use the SingleConsumerOnlyUnboundedMailbox
+ # as it is the most efficient for multiple producer/single consumer use cases
+ mailbox-type="akka.dispatch.SingleConsumerOnlyUnboundedMailbox"
+ }
+ }
remote {
+ log-remote-lifecycle-events = off
netty.tcp
{
hostname = "127.0.0.1"
port = 2550
+ maximum-frame-size = 419430400
+ send-buffer-size = 52428800
+ receive-buffer-size = 52428800
}
}
|
|
You don't need all that stuff in akka.conf anymore. All of the necessary stuff is in configuration/factory/akka.conf which gets re-written with updated changes on feature re-install. Take a look at the current configuration/initial/akka.conf - it now contains just the lines needed to setup a cluster - all the rest is inherited from factory/akka.conf:
odl-cluster-data {
akka {
remote {
netty.tcp
{
hostname = "127.0.0.1"
port = 2550
}
}
cluster
{
seed-nodes = ["akka.tcp://opendaylight-cluster-data@127.0.0.1:2550"]
roles = [
"member-1"
]
}
}
Note that the odl-cluster-rpc section is no longer needed nor used.
(In reply to Srini Seetharaman from comment #27)
> (In reply to Tom Pantelis from comment #26)
> > This bug is related to timeouts due to high-volume transactions by the BGP
> > plugin. It doesn't seem like yours is the same issue. AskTimeoutExceptions
> > can occur for various reasons.
> >
> > One thing is you shouldn't set auto-down-unreachable-after. I would suggest
> > re-testing with auto-down-unreachable-after off and, if necessary, the
> > current Boron-SP3.
>
> Just removing that line didn't help. But, reintroducing some of the
> parameters that I used to have from pre-Boron helped solve the issue and
> make my build stable and not have akka-timeouts. Here are the additional
> things it had. Not sure exactly which one helped.
>
> odl-cluster-data {
> + bounded-mailbox
{
> + mailbox-type =
> "org.opendaylight.controller.cluster.common.actor.MeteredBoundedMailbox"
> + mailbox-capacity = 1000
> + mailbox-push-timeout-time = 100ms
> + }
> +
> + metric-capture-enabled = true
> +
> akka {
> loglevel = "INFO"
> loggers = ["akka.event.slf4j.Slf4jLogger"]
> logger-startup-timeout = 300s
>
> + actor {
> + provider = "akka.cluster.ClusterActorRefProvider"
> + serializers
{
> + java = "akka.serialization.JavaSerializer"
> + proto = "akka.remote.serialization.ProtobufSerializer"
> + readylocal =
> "org.opendaylight.controller.cluster.datastore.messages.
> ReadyLocalTransactionSerializer"
> + }
> +
> + serialization-bindings
{
> + "com.google.protobuf.Message" = proto
> +
> "org.opendaylight.controller.cluster.datastore.messages.
> ReadyLocalTransaction" = readylocal
> + }
> +
> + default-dispatcher
{
> + # Setting throughput to 1 makes the dispatcher fair. It processes 1
> message from
> + # the mailbox before moving on to the next mailbox
> + throughput = 1
> + }
> +
> + default-mailbox
{
> + # When not using a BalancingDispatcher it is recommended that we
> use the SingleConsumerOnlyUnboundedMailbox
> + # as it is the most efficient for multiple producer/single consumer
> use cases
> + mailbox-type="akka.dispatch.SingleConsumerOnlyUnboundedMailbox"
> + }
> + }
> remote {
> + log-remote-lifecycle-events = off
> netty.tcp
{
> hostname = "127.0.0.1"
> port = 2550
> + maximum-frame-size = 419430400
> + send-buffer-size = 52428800
> + receive-buffer-size = 52428800
> }
> }
|
|
https://git.opendaylight.org/gerrit/49263 is the last code patch, this is now waiting for review and testing.
|
Generated at Wed Feb 07 21:53:05 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.