[CONTROLLER-1742] RetiredGenerationException in cluster sanity suite Created: 24/Jul/17  Updated: 25/Jul/23  Resolved: 14/Sep/18

Status: Verified
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: Fluorine, Oxygen SR3

Type: Bug Priority: High
Reporter: Vratko Polak Assignee: Jamo Luhrsen
Resolution: Cannot Reproduce Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks CONTROLLER-1626 Add an option to allow CDS FE to star... Resolved
Relates
relates to NETVIRT-1315 Troubleshooting Controller CSIT In Progress
External issue ID: 8881
Priority: Highest

 Description   

The "cluster sanity suite" has been created to verify standalone Robot calls to cluster-admin and odl-mdsal-lowlevel RPCs do not fail.
Incidentally, the "standalone" calls are called one after another, sometimes (depending on election winner) leading to produce-transactions failing on "Originating generation was superseded by 1" [0]:

{"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Unexpected-exception","error-info":"TransactionCommitFailedException{message=submit execution failed, errorList=[RpcError [message=submit execution failed, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation was superseded by 1]]}\n\tat org.opendaylight.mdsal.dom.broker.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)\n\tat org.opendaylight.mdsal.dom.broker.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)\n\tat org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:87)\n\tat org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:37)\n\tat com.google.common.util.concurrent.Futures$MappingCheckedFuture.mapException(Futures.java:1809)\n\tat com.google.common.util.concurrent.AbstractCheckedFuture.checkedGet(AbstractCheckedFuture.java:114)\n\tat org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler.start(ProduceTransactionsHandler.java:79)\n\tat org.opendaylight.controller.clustering.it.provider.MdsalLowLevelTestProvider.produceTransactions(MdsalLowLevelTestProvider.java:570)\n\tat org.opendaylight.yangtools.yang.binding.util.RpcMethodInvokerWithInput.invokeOn(RpcMethodInvokerWithInput.java:31)\n\tat org.opendaylight.yangtools.yang.binding.util.AbstractMappedRpcInvoker.invokeRpc(AbstractMappedRpcInvoker.java:52)\n\tat org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invoke(BindingDOMRpcImplementationAdapter.java:83)\n\tat org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invokeRpc(BindingDOMRpcImplementationAdapter.java:70)\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.GlobalDOMRpcRoutingTableEntry.invokeRpc(GlobalDOMRpcRoutingTableEntry.java:39)\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRoutingTable.invokeRpc(DOMRpcRoutingTable.java:177)\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRouter.invokeRpc(DOMRpcRouter.java:102)\n\tat Proxy09f0c7cc_0210_4c64_938c_26c684cb1dbe.invokeRpc(Unknown Source)\n\tat org.opendaylight.netconf.sal.restconf.impl.BrokerFacade.invokeRpc(BrokerFacade.java:508)\n\tat org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.invokeRpc(RestconfImpl.java:467)\n\tat org.opendaylight.netconf.sal.restconf.impl.StatisticsRestconfServiceWrapper.invokeRpc(StatisticsRestconfServiceWrapper.java:83)\n\tat org.opendaylight.netconf.sal.rest.impl.RestconfCompositeWrapper.invokeRpc(RestconfCompositeWrapper.java:64)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)\n\tat com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)\n\tat com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)\n\tat com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)\n\tat com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)\n\tat com.sun.jersey.server.impl.uri.rules.ResourceObjectRule.accept(ResourceObjectRule.java:100)\n\tat com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)\n\tat com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)\n\tat com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511)\n\tat com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442)\n\tat com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391)\n\tat com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381)\n\tat com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)\n\tat com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:538)\n\tat com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:716)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:668)\n\tat org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1507)\n\tat org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:247)\n\tat org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:210)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)\n\tat org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:82)\n\tat org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:256)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)\n\tat org.opendaylight.aaa.filterchain.filters.CustomFilterAdapter.doFilter(CustomFilterAdapter.java:83)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)\n\tat org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:118)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)\n\tat org.ops4j.pax.web.service.jetty.internal.HttpServiceServletHandler.doHandle(HttpServiceServletHandler.java:69)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)\n\tat org.ops4j.pax.web.service.jetty.internal.HttpServiceContext.doHandle(HttpServiceContext.java:240)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:427)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.ops4j.pax.web.service.jetty.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:75)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:370)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:984)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1045)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:236)\n\tat org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)\n\tat org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)\n\tat org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation was superseded by 1\n\tat org.opendaylight.controller.cluster.datastore.Shard.findFrontend(Shard.java:428)\n\tat org.opendaylight.controller.cluster.datastore.Shard.getFrontend(Shard.java:442)\n\tat org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:517)\n\tat org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:295)\n\tat org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)\n\tat org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)\n\tat akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)\n\tat org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)\n\tat akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)\n\tat akka.actor.Actor$class.aroundReceive(Actor.scala:502)\n\tat akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)\n\tat akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)\n\tat akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)\n\tat akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)\n\tat akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)\n\tat akka.actor.ActorCell.invoke(ActorCell.scala:495)\n\tat akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)\n\tat akka.dispatch.Mailbox.run(Mailbox.scala:224)\n\tat akka.dispatch.Mailbox.exec(Mailbox.scala:234)\n\tat scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\n"}]}}

The minimal reproducing steps are not clear yet, this is my current hypothesis:

0. The shard leader (prefix-based shard, tell-based protocol) is moved onto member-1.
1. Shard replica on member-1 is removed and re-added (each time verifying leader and followers exist).
2. Shard replica on member-1 (currently first follower) is removed and re-added, again.
3. produce-transactions is called on member-1 (sometimes, detected as a leader for unrelated shard by mistake), resulting in fairly immediate status code 500.

I believe some steps are not required, but I was not able to confirm (or even successfully reproduce, due to typos) on Sandbox just yet.

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/791/log.html.gz#s1-s16-t7-k7



 Comments   
Comment by Vratko Polak [ 25/Jul/17 ]

> I believe some steps are not required

At the end, I just run the whole sanity suite as is.

The huge log [1] shows the exception below, the test calls produce-transactions right after 15:00:46,566. I still do not know which preceding actions are causing this. The robot log to compare timestamps: look before [2].

2017-07-25 15:00:46,859 | ERROR | lt-dispatcher-34 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.2.SNAPSHOT | member-1-frontend-datastore-Shard-id-ints!: current generation ClientIdentifier

{frontend=member-1-frontend-datastore-Shard-id-ints!, generation=0}

has been superseded
org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation was superseded by 1
at org.opendaylight.controller.cluster.datastore.Shard.findFrontend(Shard.java:428)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.getFrontend(Shard.java:442)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:517)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.2.SNAPSHOT]
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:295)[199:org.opendaylight.controller.sal-distributed-datastore:1.5.2.SNAPSHOT]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)[193:org.opendaylight.controller.sal-akka-raft:1.5.2.SNAPSHOT]
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)[192:org.opendaylight.controller.sal-clustering-commons:1.5.2.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.18]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)[192:org.opendaylight.controller.sal-clustering-commons:1.5.2.SNAPSHOT]
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)[173:com.typesafe.akka.actor:2.4.18]
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[173:com.typesafe.akka.actor:2.4.18]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.18]
at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]

[1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-san-only-carbon/9/odl1_karaf.log.gz
[2] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-san-only-carbon/9/log.html.gz#s1-s2-t7-k7

Comment by Robert Varga [ 24/Aug/17 ]

Is this still happening?

Comment by Vratko Polak [ 18/Sep/17 ]

> Is this still happening?

Yes, this is still happening: [3].

But is not a part of an official suite, or otherwise a regression, so I was not monitoring the results closely enough.

[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-all-carbon/444/log.html.gz#s1-s16-t7-k8-k1-k1-k1

Comment by Jamo Luhrsen [ 28/Jun/18 ]

btw, this is still happening and as we drive to get clustering more stable, these are the kinds of sporadic
failure noise we want to get rid of.

Comment by Tom Pantelis [ 28/Jun/18 ]

This is an issue with tell-based only.

Comment by Robert Varga [ 28/Jun/18 ]

Do you have a newer reference?

Does the test perhaps include wiping local data (i.e. akka persistence?)

Comment by Vratko Polak [ 28/Jun/18 ]

> Do you have a newer reference?

Present in recent run, the first Robot failure (a RequestTimeoutException) is [4], RetiredGenerationException is visible in member-1 karaf log [5], at 2018-06-28T02:28:05,676 (so in the subsequent test case). I have not checked other members.

> Does the test perhaps include wiping local data (i.e. akka persistence?)

No, as far as I can see.

[4] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-clustering-all-fluorine/130/robot-plugin/log.html.gz#s1-s16-t7-k8-k1-k1-k1
[5] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-clustering-all-fluorine/130/odl1_karaf.log.gz

Comment by Robert Varga [ 28/Jun/18 ]

Let's see what info can we glean from https://git.opendaylight.org/gerrit/73529

Comment by Tom Pantelis [ 26/Jul/18 ]

CONTROLLER-1626 is another bug related to the retired generation checking although that involves wiping the persisted store on restart. While it's not normally recommended to wipe the persisted store, it can happen in a disaster recovery scenario where a new node instance is spun up new or if the persisted store was corrupted.

Comment by Tom Pantelis [ 26/Jul/18 ]

Looking at the robot log in [4] above, the following steps are performed:

  • Create a new prefix-based shard named "id-ints" on all nodes
  • Initially the leader is odl1. The Become_Prefix_Leader step promotes odl2 to leader via /restconf/operations/odl-mdsal-lowlevel-control:become-prefix-leader.
  • The Remove_Leader_Prefix_Shard_Replica_And_Add_It_Back step issues /restconf/operations/cluster-admin:remove-prefix-shard-replica to odl2 to remove the "id-ints"
    shard replica.
  • The shard is re-added on odl2 via /restconf/operations/cluster-admin:add-prefix-shard-replica
  • At this point odl3 is the "id-ints" shard leader with odl1 and odl3 as followers.
  • Remove_Follower_Prefix_Shard_Replica_And_Add_It_Back picks the first follower old1 and removes the shard replica.
  • The shard is re-added on odl1. odl3 remains leader with followers odl1 and odl2 as expected
  • Produce_Transactions_One_Node_Leader is a little unclear to me at this point. The first part of it is to extract the leader, odl2, for the default-config shard. Not sure why it does this. Then it issues the /restconf/operations/odl-mdsal-lowlevel-control:produce-transactions RPC to odl2 to produce 1000 txns per sec for 30 sec. The code for that is in MdsalLowLevelTestProvider. It uses the new data tree producer APIs and I think it commits txns to the "id-ints" shard, or at least that would seem to make sense. This RPC blocks and returned the RequestTimeoutException error.

So this test suite is exercising the prefix-based shard and data tree producer functionality, neither of which is currently used to my knowledge, at least not in any upstream project.

So it appears the RetiredGenerationEx failure in this test is related to removing and re-adding a dynamically-created prefix-based shard where the FE generation ID reverts back to 0 while the existing shard leader's cached state contains generation ID 1 from the previous incarnation. Perhaps when a prefix shard replica is removed, the leader needs to flush its cached state but that's just a guess at this point.

Comment by Jamo Luhrsen [ 30/Jul/18 ]

So this test suite is exercising the prefix-based shard and data tree producer functionality, neither of which is currently used to my knowledge, at least not in any upstream project.

I know it's probably too simple-minded, but do we need to keep this functionality around if it's not used? If we remove
the functionality and any tests exercising it, we reduce the amount of noise we are dealing with while trying to stabilize
the controller 3node csit.

Comment by Robert Varga [ 30/Jul/18 ]

Splitting off would be fair, we do not want to stop testing this, though – strategic target and all...

Comment by Jamo Luhrsen [ 30/Jul/18 ]

what does "splitting off" mean?

Comment by Robert Varga [ 30/Jul/18 ]

... into a separate job, I meant to say

Comment by Tom Pantelis [ 30/Jul/18 ]

The functionality may be useful in the future, specifically for scale out use cases

Comment by Jamo Luhrsen [ 30/Jul/18 ]

well then, maybe it's worth it now while things are hotter than usual, so actually dig in and see if we
can get a fix? instead of making a new job that will just go stale and forgotten and be a giant
pain in someone's side when they need to look at it.

Comment by Robert Varga [ 02/Aug/18 ]

Based on the discussion in the kernel call, where it was raised that people tend to wipe even frontend persistence (as opposed to backend persistence, which contains datastore data), if we confirm this indeed is the issue, we need to provide a way for the operator to bump the frontend generation for a particular ODL member. Once that version is bump and the node restarted, the problem should be cleared.

Comment by Tom Pantelis [ 02/Aug/18 ]

End users of course aren't aware of frontend vs backend persistence and frankly shouldn't be - that's a technical implementation detail. In a disaster recovery scenario, a node will naturally restart clean. This needs to work seamlessly as it does today. Users/operators/integrators should not have to worry about knowing to somehow bump a generation ID (and per shard) and what value to bump it to, IMO.  I certainly would not want to have to field a support ticket to explain it to an end user and fix up their environment.

For this Jira, this is not the issue but it is a bigger issue we'll have to address before switching to tell-based.

Comment by Jamo Luhrsen [ 25/Aug/18 ]

rovarga, I finally had time to dig through this job looking for what I assume is the smoking
gun (seeing "Originating generation was superseded by 1") in a karaf.log file.

The most recent job was #130 , in odl2's log . But, the patch above you were hoping could help us glean something new
was merged that same day (Jun 28th). I'm not sure if it was before or after this job ran. Still that's very curious.

Anyway, no other jobs saw this message in any of the 3 controller log files since June 28th. I do have 13 examples
in our logs server before the 28th. (jobs 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129)

Is this bug no longer reproducible in this job, or possibly fixed?

Comment by Jamo Luhrsen [ 14/Sep/18 ]

This has not occured since 6/28 with this job. There have been 80 jobs run since
the last occurance. Marking this resolved as "cannot reproduce". we can re-open if it's seen
again.

Generated at Wed Feb 07 19:56:20 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.