[CONTROLLER-1742] RetiredGenerationException in cluster sanity suite Created: 24/Jul/17 Updated: 25/Jul/23 Resolved: 14/Sep/18 |
|
| Status: | Verified |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | Fluorine, Oxygen SR3 |
| Type: | Bug | Priority: | High |
| Reporter: | Vratko Polak | Assignee: | Jamo Luhrsen |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||||||
| External issue ID: | 8881 | ||||||||||||||||
| Priority: | Highest | ||||||||||||||||
| Description |
|
The "cluster sanity suite" has been created to verify standalone Robot calls to cluster-admin and odl-mdsal-lowlevel RPCs do not fail. {"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Unexpected-exception","error-info":"TransactionCommitFailedException{message=submit execution failed, errorList=[RpcError [message=submit execution failed, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation was superseded by 1]]}\n\tat org.opendaylight.mdsal.dom.broker.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)\n\tat org.opendaylight.mdsal.dom.broker.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)\n\tat org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:87)\n\tat org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:37)\n\tat com.google.common.util.concurrent.Futures$MappingCheckedFuture.mapException(Futures.java:1809)\n\tat com.google.common.util.concurrent.AbstractCheckedFuture.checkedGet(AbstractCheckedFuture.java:114)\n\tat org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler.start(ProduceTransactionsHandler.java:79)\n\tat org.opendaylight.controller.clustering.it.provider.MdsalLowLevelTestProvider.produceTransactions(MdsalLowLevelTestProvider.java:570)\n\tat org.opendaylight.yangtools.yang.binding.util.RpcMethodInvokerWithInput.invokeOn(RpcMethodInvokerWithInput.java:31)\n\tat org.opendaylight.yangtools.yang.binding.util.AbstractMappedRpcInvoker.invokeRpc(AbstractMappedRpcInvoker.java:52)\n\tat org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invoke(BindingDOMRpcImplementationAdapter.java:83)\n\tat org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invokeRpc(BindingDOMRpcImplementationAdapter.java:70)\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.GlobalDOMRpcRoutingTableEntry.invokeRpc(GlobalDOMRpcRoutingTableEntry.java:39)\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRoutingTable.invokeRpc(DOMRpcRoutingTable.java:177)\n\tat org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRouter.invokeRpc(DOMRpcRouter.java:102)\n\tat Proxy09f0c7cc_0210_4c64_938c_26c684cb1dbe.invokeRpc(Unknown Source)\n\tat org.opendaylight.netconf.sal.restconf.impl.BrokerFacade.invokeRpc(BrokerFacade.java:508)\n\tat org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.invokeRpc(RestconfImpl.java:467)\n\tat org.opendaylight.netconf.sal.restconf.impl.StatisticsRestconfServiceWrapper.invokeRpc(StatisticsRestconfServiceWrapper.java:83)\n\tat org.opendaylight.netconf.sal.rest.impl.RestconfCompositeWrapper.invokeRpc(RestconfCompositeWrapper.java:64)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)\n\tat com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)\n\tat com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)\n\tat com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)\n\tat com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)\n\tat com.sun.jersey.server.impl.uri.rules.ResourceObjectRule.accept(ResourceObjectRule.java:100)\n\tat com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)\n\tat com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)\n\tat com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511)\n\tat com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442)\n\tat com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391)\n\tat com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381)\n\tat com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)\n\tat com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:538)\n\tat com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:716)\n\tat javax.servlet.http.HttpServlet.service(HttpServlet.java:668)\n\tat org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1507)\n\tat org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:247)\n\tat org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:210)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)\n\tat org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:82)\n\tat org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:256)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)\n\tat org.opendaylight.aaa.filterchain.filters.CustomFilterAdapter.doFilter(CustomFilterAdapter.java:83)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)\n\tat org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:118)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)\n\tat org.ops4j.pax.web.service.jetty.internal.HttpServiceServletHandler.doHandle(HttpServiceServletHandler.java:69)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)\n\tat org.ops4j.pax.web.service.jetty.internal.HttpServiceContext.doHandle(HttpServiceContext.java:240)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:427)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)\n\tat org.ops4j.pax.web.service.jetty.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:75)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:370)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)\n\tat org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:984)\n\tat org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1045)\n\tat org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)\n\tat org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:236)\n\tat org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)\n\tat org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)\n\tat org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: org.opendaylight.controller.cluster.access.concepts.RetiredGenerationException: Originating generation was superseded by 1\n\tat org.opendaylight.controller.cluster.datastore.Shard.findFrontend(Shard.java:428)\n\tat org.opendaylight.controller.cluster.datastore.Shard.getFrontend(Shard.java:442)\n\tat org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:517)\n\tat org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:295)\n\tat org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)\n\tat org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)\n\tat akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)\n\tat org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)\n\tat akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)\n\tat akka.actor.Actor$class.aroundReceive(Actor.scala:502)\n\tat akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)\n\tat akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)\n\tat akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)\n\tat akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)\n\tat akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)\n\tat akka.actor.ActorCell.invoke(ActorCell.scala:495)\n\tat akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)\n\tat akka.dispatch.Mailbox.run(Mailbox.scala:224)\n\tat akka.dispatch.Mailbox.exec(Mailbox.scala:234)\n\tat scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\n"}]}}
The minimal reproducing steps are not clear yet, this is my current hypothesis: 0. The shard leader (prefix-based shard, tell-based protocol) is moved onto member-1. I believe some steps are not required, but I was not able to confirm (or even successfully reproduce, due to typos) on Sandbox just yet. |
| Comments |
| Comment by Vratko Polak [ 25/Jul/17 ] |
|
> I believe some steps are not required At the end, I just run the whole sanity suite as is. The huge log [1] shows the exception below, the test calls produce-transactions right after 15:00:46,566. I still do not know which preceding actions are causing this. The robot log to compare timestamps: look before [2]. 2017-07-25 15:00:46,859 | ERROR | lt-dispatcher-34 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.2.SNAPSHOT | member-1-frontend-datastore-Shard-id-ints!: current generation ClientIdentifier {frontend=member-1-frontend-datastore-Shard-id-ints!, generation=0} has been superseded [1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-san-only-carbon/9/odl1_karaf.log.gz |
| Comment by Robert Varga [ 24/Aug/17 ] |
|
Is this still happening? |
| Comment by Vratko Polak [ 18/Sep/17 ] |
|
> Is this still happening? Yes, this is still happening: [3]. But is not a part of an official suite, or otherwise a regression, so I was not monitoring the results closely enough. |
| Comment by Jamo Luhrsen [ 28/Jun/18 ] |
|
btw, this is still happening and as we drive to get clustering more stable, these are the kinds of sporadic |
| Comment by Tom Pantelis [ 28/Jun/18 ] |
|
This is an issue with tell-based only. |
| Comment by Robert Varga [ 28/Jun/18 ] |
|
Do you have a newer reference? Does the test perhaps include wiping local data (i.e. akka persistence?) |
| Comment by Vratko Polak [ 28/Jun/18 ] |
|
> Do you have a newer reference? Present in recent run, the first Robot failure (a RequestTimeoutException) is [4], RetiredGenerationException is visible in member-1 karaf log [5], at 2018-06-28T02:28:05,676 (so in the subsequent test case). I have not checked other members. > Does the test perhaps include wiping local data (i.e. akka persistence?) No, as far as I can see. [4] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/controller-csit-3node-clustering-all-fluorine/130/robot-plugin/log.html.gz#s1-s16-t7-k8-k1-k1-k1 |
| Comment by Robert Varga [ 28/Jun/18 ] |
|
Let's see what info can we glean from https://git.opendaylight.org/gerrit/73529 |
| Comment by Tom Pantelis [ 26/Jul/18 ] |
|
|
| Comment by Tom Pantelis [ 26/Jul/18 ] |
|
Looking at the robot log in [4] above, the following steps are performed:
So this test suite is exercising the prefix-based shard and data tree producer functionality, neither of which is currently used to my knowledge, at least not in any upstream project. So it appears the RetiredGenerationEx failure in this test is related to removing and re-adding a dynamically-created prefix-based shard where the FE generation ID reverts back to 0 while the existing shard leader's cached state contains generation ID 1 from the previous incarnation. Perhaps when a prefix shard replica is removed, the leader needs to flush its cached state but that's just a guess at this point. |
| Comment by Jamo Luhrsen [ 30/Jul/18 ] |
I know it's probably too simple-minded, but do we need to keep this functionality around if it's not used? If we remove |
| Comment by Robert Varga [ 30/Jul/18 ] |
|
Splitting off would be fair, we do not want to stop testing this, though – strategic target and all... |
| Comment by Jamo Luhrsen [ 30/Jul/18 ] |
|
what does "splitting off" mean? |
| Comment by Robert Varga [ 30/Jul/18 ] |
|
... into a separate job, I meant to say |
| Comment by Tom Pantelis [ 30/Jul/18 ] |
|
The functionality may be useful in the future, specifically for scale out use cases |
| Comment by Jamo Luhrsen [ 30/Jul/18 ] |
|
well then, maybe it's worth it now while things are hotter than usual, so actually dig in and see if we |
| Comment by Robert Varga [ 02/Aug/18 ] |
|
Based on the discussion in the kernel call, where it was raised that people tend to wipe even frontend persistence (as opposed to backend persistence, which contains datastore data), if we confirm this indeed is the issue, we need to provide a way for the operator to bump the frontend generation for a particular ODL member. Once that version is bump and the node restarted, the problem should be cleared. |
| Comment by Tom Pantelis [ 02/Aug/18 ] |
|
End users of course aren't aware of frontend vs backend persistence and frankly shouldn't be - that's a technical implementation detail. In a disaster recovery scenario, a node will naturally restart clean. This needs to work seamlessly as it does today. Users/operators/integrators should not have to worry about knowing to somehow bump a generation ID (and per shard) and what value to bump it to, IMO. I certainly would not want to have to field a support ticket to explain it to an end user and fix up their environment. For this Jira, this is not the issue but it is a bigger issue we'll have to address before switching to tell-based. |
| Comment by Jamo Luhrsen [ 25/Aug/18 ] |
|
rovarga, I finally had time to dig through this job looking for what I assume is the smoking The most recent job was #130 , in odl2's log . But, the patch above you were hoping could help us glean something new Anyway, no other jobs saw this message in any of the 3 controller log files since June 28th. I do have 13 examples Is this bug no longer reproducible in this job, or possibly fixed? |
| Comment by Jamo Luhrsen [ 14/Sep/18 ] |
|
This has not occured since 6/28 with this job. There have been 80 jobs run since |