[CONTROLLER-1626] Add an option to allow CDS FE to start its generation counting from non-zero Created: 03/Apr/17 Updated: 25/Jul/23 Resolved: 03/Nov/19 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | Magnesium, Neon SR3, Sodium SR2 |
| Type: | Improvement | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| Description |
|
Pre-Carbon clustering was allowing this accidentally. The new client-backend datastore implementation generally runs into RetiredGenerationException instead. 2017-04-02 05:06:03,789 | WARN | ote-dispatcher-7 | AbstractShardBackendResolver | 206 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Failed to resolve shard java.util.concurrent.CompletionException: org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation was superseded by 2 at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338) at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911) at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at scala.concurrent.java8.FuturesConvertersImpl$CF.apply(FutureConvertersImpl.scala:21) at scala.concurrent.java8.FuturesConvertersImpl$CF.apply(FutureConvertersImpl.scala:18) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97) at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:988) at akka.actor.Actor$class.aroundReceive(Actor.scala:497) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:452) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation was superseded by 2 at org.opendaylight.controller.cluster.datastore.Shard.getFrontend(Shard.java:390) at org.opendaylight.controller.cluster.datastore.Shard.handleConnectClient(Shard.java:428) at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:296) at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:272) at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31) at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170) at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544) at akka.actor.Actor$class.aroundReceive(Actor.scala:497) at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168) at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:664) at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183) at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168) ... 9 more 2017-04-02 05:06:03,801 | ERROR | ult-dispatcher-2 | ClientActorBehavior | 204 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-Shard-prefix-configuration-shard: failed to resolve shard 0 java.util.concurrent.CompletionException: org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation was superseded by 2 at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338) at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911) at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) at scala.concurrent.java8.FuturesConvertersImpl$CF.apply(FutureConvertersImpl.scala:21) at scala.concurrent.java8.FuturesConvertersImpl$CF.apply(FutureConvertersImpl.scala:18) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97) at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:988) at akka.actor.Actor$class.aroundReceive(Actor.scala:497) at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:452) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation was superseded by 2 at org.opendaylight.controller.cluster.datastore.Shard.getFrontend(Shard.java:390) at org.opendaylight.controller.cluster.datastore.Shard.handleConnectClient(Shard.java:428) at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:296) at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:272) at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31) at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170) at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104) at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544) at akka.actor.Actor$class.aroundReceive(Actor.scala:497) at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168) at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:664) at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183) at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168) ... 9 more [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/620/archives/log.html.gz#s1-s5-t3-k3-k5-k1-k2-k1-k1-k2-k1-k4-k1 |
| Comments |
| Comment by Vratko Polak [ 04/Apr/17 ] |
|
> See failure in csit [0] when journal and snapshots directories are deleted before re-start. Unfortunately, Sandbox run [2] shows the suite fails also without deleting the directories. There may be an unrelated bug and this is just a temporary side effect. [2] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-all-carbon/1 |
| Comment by Robert Varga [ 11/Apr/17 ] |
|
The only way to fix this is to resync client generations across all member nodes when we bring the frontend up. That is extremely problematic, given that not all nodes may actually be present. While this is worth considering, it certainly lacks any sort of priority. On the side of integration: do not wipe the entire persistence, just the backend store bits. |
| Comment by Vratko Polak [ 25/Apr/17 ] |
|
I finally looked at Raft algorithm. It really only works if persistence is reliable. If a member loses its persisted data on restart, it has to be treated as a brand new cluster member, that includes all the operations for cluster reconfiguration (remove the old incarnation from raft, add the new incarnation). As the usual "member-2" identifier does not change with new incarnation, there should be some random nonce generated (or restored from persisted data) when a member starts. I feel "how to make raft work when persisted data can be lost" is a problem somebody already solved, it is just hard to google for that. |
| Comment by Robert Varga [ 25/Apr/17 ] |
|
The key thing to realize here is that you are using 'persistence' as a black box. There are two separate sets of data in akka persistence:
You cannot assume that you can wipe the entire persistence store, as that will include non-replicated private data. If you wipe the shard journal only, things should recover. |
| Comment by Tom Pantelis [ 25/Apr/17 ] |
|
(In reply to Vratko Polák from comment #3) Raft works fine if a member is restarted with its persisted data cleaned. On restart the leader will install a snapshot to sync it up and restore its state to match the leader's state. |
| Comment by Tom Pantelis [ 25/Apr/17 ] |
|
(In reply to Robert Varga from comment #4) If you wipe the shard journal, you also have to wipe any snapshots for those shards. I'm not entirely clear what the test does but it seems the problem is that a member is restarted with a clean persistence store and thus restarts with a generation ID of 0 but the remote shard still has a cached front-end state with generation 2. Since the new generation ID is less it rejects it as retired/stale. Is it necessary for the test to wipe the persistence store? If so perhaps it could retain just the front-end generation state, at least for now so it doesn't fail. Longer term I think we need to revisit the retired generation logic. Does it need to fail if the new generation ID < the cached generation? At what point does the shard purge the cached front-end state? This scenario is valid and will occur in production where a node is restarted clean for some reason. This will also occur when a node is restored from a backup. For the latter, perhaps the front-end generation records should be included in the backup. |
| Comment by Vratko Polak [ 25/Apr/17 ] |
|
> do not wipe the entire persistence By the way, we stopped wiping at all now [3], so this does not affect existing CSIT results anymore. |
| Comment by Vratko Polak [ 25/Apr/17 ] |
|
> revisit the retired generation logic Is the generation number just a (part of) identifier of a session between client and backend? The the client should not be the only party deciding its value. I envision this: |
| Comment by Colin Dixon [ 08/May/17 ] |
|
I agree with Vratko that the generation-ID (if it's required for correct functioning of clustering) really needs to be something that can be learned by clients and is consistently stored in the datastore so that it doesn't need to be somehow magically recreated and can't accidentally get into a value which is out of range and hang everything. |
| Comment by Tom Pantelis [ 08/May/17 ] |
|
No matter who generates/stores the generation Id (FE or BE), it still needs to be persisted and will suffer the same problem if the persistent store is wiped clean. The reason for the generation Id in the first place was to alleviate the issue where a read tx was created with say id 10 on a remote leader, then the node that created the tx was restarted. If it cycled fast enough, which tends to happen in robot tests, it could generate id 10 again before the remote leader had timed out and removed it. Hence the generation Id is generated and persisted by the FE. I think we should just remove the retired generation Id checking altogether in the shard. |
| Comment by Vratko Polak [ 15/Jun/17 ] |
|
> No matter who generates/stores the generation Id (FE or BE), it still needs Backend becomes active only on the shard leader, and to become the leader majority has to agree the reported state is updated enough. A reply from backend may inform the frontend to wait for sync. That would help if everything important is in the raft committed state (as opposed to just in locally persisted store). > I think we should just remove the retired generation Id checking altogether That would be best (if possible). |
| Comment by Robert Varga [ 14/Nov/18 ] |
|
No we should not, it detects when a FE inconsistency is detected. If you wipe frontend state, that is a manual and exceptional transition, which requires a manual recovery – it must be ensured that the newer FE generation will never (ever) come around. The proper way of addressing this is to add an operator option for the newly-created FE state to start counting not at 0, but at some operator-specified number. |
| Comment by Tom Pantelis [ 14/Nov/18 ] |
|
So if the journal/snapshots is cleaned for DR, the end-user operator has to perform a manual operation to alleviate this issue? IMO, that is not acceptable for those that have to document/explain and support it - currently this scenario is seamless and should remain that way. skitt vorburger what do you guys think from the RH downstream perspective? |
| Comment by Robert Varga [ 14/Nov/18 ] |
|
I think the removal of journal/snapshots needs to be documented to differentiated between frontend and backend state – if differentiating in documentation is not enough, we need to reconfigure akka pesistence to use two separate directories. |
| Comment by Stephen Kitt [ 14/Nov/18 ] |
|
I agree with tpantelis; we run into this issue fairly regularly (thanks for the ping, I’m investigating on a bug just like this...). Incidentally, what is the manual operation? Does it exist? |
| Comment by Robert Varga [ 14/Nov/18 ] |
|
It does not currently exist – we are still discussion how to fix it. There are three options:
|
| Comment by Tom Pantelis [ 14/Nov/18 ] |
|
So the only scenario that I know of where this occurs is the case where the journal/snapshots is deleted where the FE counter is reset to 0. In this case it's OK to ignore the inconsistency and clear the stale state. rovarga what other real production scenario(s) do you see where this could occur? |
| Comment by Robert Varga [ 14/Nov/18 ] |
|
The counter guards against the old generation being left behind – for example a failure of the FE to die completely, such that two generations end up running, or its messages still being somewhere in flight (akka, requeues, whatever) and arriving after the new generation has established contact and thereby becoming operational. Granted, it is largely hypothetical, but if such a bug ever happens, we could very easily end up with corrupted data and diagnosing the root cause would be extremely problematic. Wiping frontend state shows up exactly this sort of bug to the backend, and hence the guard kicks in. The solution is either adjust FE so it is a newer than anything observed by backend, or purging all knowledge of all previous generations from BE. I think the former is a lot easier. |
| Comment by Vratko Polak [ 14/Nov/18 ] |
|
>> 3. implement and document the manual generation bump > adjust FE so it is a newer than anything observed by backend Why does it have to be manual? On startup, the frontend could ask the backed for the latest observed generation and use that+1 as its generation. |
| Comment by Tom Pantelis [ 14/Nov/18 ] |
|
yeah - I think if the backend sees that new FE counter is 0 then assume it was restarted clean and purge the cached state and proceed. |
| Comment by Robert Varga [ 14/Nov/18 ] |
|
That would mean that startup needs to be blocked until we have connectivity to all backend shards, they are stable and can give us an authoritative information. That communication is also subject to round-trip delays, hence by the time we get that information it is already stale. The authoritative source of its current generation is the frontend – it's eight bytes of information, which simply needs to be restored without relying on the normal communication clusters. |
| Comment by Robert Varga [ 14/Nov/18 ] |
|
That won't work, as if ever there are to FEs with counter=0, you will mistake them for the same entity, which will break like hell. Furthermore that various shards will observe the resurrected FE at different times, possibly with the FE restarting in between – at which point when BE sees the resurrected FE, counter != 0 --> boom. Guys, I have thought about this long and hard, and a locally-maintained monotonic counter is really the only safe option. |
| Comment by Tom Pantelis [ 14/Nov/18 ] |
|
Then perhaps let the BE throw the ex and have the FE realize that it was just restarted, sync to the BE generation # and persist, then re-issue the request. Let's discuss this next Tue. As it stands now, this introduces a failure scenario with tell-based that wasn't present before and requiring some manual operator intervention to resolve it is not acceptable/viable in a real production environment. |
| Comment by Robert Varga [ 28/Nov/18 ] |
|
The simplest way I can see out of this is to instruct shards to forget about the FE's existence – i.e. have them remove the client from FrontendMetadata and persist a record of that purge. |
| Comment by Robert Varga [ 28/Nov/18 ] |
|
Going off for the year, putting this back into backlog. |
| Comment by Vratko Polak [ 04/Nov/19 ] |
|
> Add an option to allow CDS FE to start its generation counting from non-zero Ok. We are missing a launcher script that would execute the call from 85373 and use the response to set the correct value for 85397. |