[CONTROLLER-1653] C: UnknownHistoryException: Last known history is null; for prefix based shard Created: 04/May/17  Updated: 25/Jul/23  Resolved: 21/Jun/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Peter Gubka Assignee: Robert Varga
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: HTML File log.html    
Issue Links:
Blocks
is blocked by CONTROLLER-1628 Unknown history for commited transaction Resolved
is blocked by CONTROLLER-1664 C: OutOfOrderRequestException: Expect... Resolved
External issue ID: 8371

 Description   

my job: (1 #592)
used odl: https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/integration/distribution/distribution-karaf/0.6.0-SNAPSHOT/distribution-karaf-0.6.0-20170503.152424-68.zip

is built from controller=33/56433/5
https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/11/

after prefix based shard is created, procude-transactions rpc is triggered on every node. One of the response contained

cause=org.opendaylight.controller.cluster.access.commands.UnknownHistoryException: Last known history is null]]}
at org.opendaylight.mdsal.dom.broker.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:37)
at org.opendaylight.mdsal.dom.broker.TransactionCommitFailedExceptionMapper.newWithCause(TransactionCommitFailedExceptionMapper.java:18)
at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:87)
at org.opendaylight.yangtools.util.concurrent.ExceptionMapper.apply(ExceptionMapper.java:37)
at com.google.common.util.concurrent.Futures$MappingCheckedFuture.mapException(Futures.java:1809)
at com.google.common.util.concurrent.AbstractCheckedFuture.checkedGet(AbstractCheckedFuture.java:85)
at org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler.fillInitialList(ProduceTransactionsHandler.java:149)
at org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler.start(ProduceTransactionsHandler.java:118)
at org.opendaylight.controller.clustering.it.provider.MdsalLowLevelTestProvider.produceTransactions(MdsalLowLevelTestProvider.java:540)
at org.opendaylight.yangtools.yang.binding.util.RpcMethodInvokerWithInput.invokeOn(RpcMethodInvokerWithInput.java:31)
at org.opendaylight.yangtools.yang.binding.util.AbstractMappedRpcInvoker.invokeRpc(AbstractMappedRpcInvoker.java:52)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invoke(BindingDOMRpcImplementationAdapter.java:83)
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invokeRpc(BindingDOMRpcImplementationAdapter.java:70)
at org.opendaylight.controller.md.sal.dom.broker.impl.GlobalDOMRpcRoutingTableEntry.invokeRpc(GlobalDOMRpcRoutingTableEntry.java:39)
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRoutingTable.invokeRpc(DOMRpcRoutingTable.java:177)
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRouter.invokeRpc(DOMRpcRouter.java:102)
at Proxy24478179_b535_4a0a_8f9f_b8d4fc200323.invokeRpc(Unknown Source)
at org.opendaylight.netconf.sal.restconf.impl.BrokerFacade.invokeRpc(BrokerFacade.java:508)
at org.opendaylight.netconf.sal.restconf.impl.RestconfImpl.invokeRpc(RestconfImpl.java:467)
at org.opendaylight.netconf.sal.restconf.impl.StatisticsRestconfServiceWrapper.invokeRpc(StatisticsRestconfServiceWrapper.java:83)
at org.opendaylight.netconf.sal.rest.impl.RestconfCompositeWrapper.invokeRpc(RestconfCompositeWrapper.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.ResourceObjectRule.accept(ResourceObjectRule.java:100)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:538)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:716)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:668)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1507)
at org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:247)
at org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:210)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)
at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:82)
at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:256)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)
at org.opendaylight.aaa.filterchain.filters.CustomFilterAdapter.doFilter(CustomFilterAdapter.java:83)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)
at org.apache.shiro.web.servlet.ProxiedFilterChain.doFilter(ProxiedFilterChain.java:61)
at org.apache.shiro.web.servlet.AdviceFilter.executeChain(AdviceFilter.java:108)
at org.apache.shiro.web.servlet.AdviceFilter.doFilterInternal(AdviceFilter.java:137)
at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
at org.apache.shiro.web.servlet.ProxiedFilterChain.doFilter(ProxiedFilterChain.java:66)
at org.apache.shiro.web.servlet.AdviceFilter.executeChain(AdviceFilter.java:108)
at org.apache.shiro.web.servlet.AdviceFilter.doFilterInternal(AdviceFilter.java:137)
at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
at org.apache.shiro.web.servlet.ProxiedFilterChain.doFilter(ProxiedFilterChain.java:66)
at org.apache.shiro.web.servlet.AbstractShiroFilter.executeChain(AbstractShiroFilter.java:449)
at org.apache.shiro.web.servlet.AbstractShiroFilter$1.call(AbstractShiroFilter.java:365)
at org.apache.shiro.subject.support.SubjectCallable.doCall(SubjectCallable.java:90)
at org.apache.shiro.subject.support.SubjectCallable.call(SubjectCallable.java:83)
at org.apache.shiro.subject.support.DelegatingSubject.execute(DelegatingSubject.java:383)
at org.apache.shiro.web.servlet.AbstractShiroFilter.doFilterInternal(AbstractShiroFilter.java:362)
at org.apache.shiro.web.servlet.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:125)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1478)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)
at org.ops4j.pax.web.service.jetty.internal.HttpServiceServletHandler.doHandle(HttpServiceServletHandler.java:69)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at org.ops4j.pax.web.service.jetty.internal.HttpServiceContext.doHandle(HttpServiceContext.java:240)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:427)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at org.ops4j.pax.web.service.jetty.internal.JettyServerHandlerCollection.handle(JettyServerHandlerCollection.java:75)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:984)
at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1045)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:236)
at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:696)
at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:53)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.opendaylight.controller.cluster.access.commands.UnknownHistoryException: Last known history is null
at org.opendaylight.controller.cluster.datastore.LeaderFrontendState.handleTransactionRequest(LeaderFrontendState.java:192)
at org.opendaylight.controller.cluster.datastore.Shard.handleRequest(Shard.java:450)
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:283)
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)
at org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)
at akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)
at akka.actor.Actor$class.aroundReceive(Actor.scala:497)
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:664)
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Logs areover 1G for each node, debug was on.



 Comments   
Comment by Peter Gubka [ 04/May/17 ]

Attachment log.html has been added with description: robot log

Comment by Viera Zelcamova [ 09/May/17 ]

might be fixed, keep track just if appears again, lower priority

Comment by Robert Varga [ 16/May/17 ]

Needs to be tested with 8402 and 8159 patches in place.

Comment by Peter Gubka [ 18/May/17 ]

fixed as a side effect

Comment by Vratko Polak [ 18/May/17 ]

The subject exception appeared again [2], albeit in a scenario (brief isolation of prefix-based shard leader) where other producers [3] failed with CONTROLLER-1674 symptom "Timeout waiting for task".

[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/694/archives/log.html.gz#s1-s30-t1-k2-k25-k1-k2-k1-k1-k1
[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/694/archives/log.html.gz#s1-s30-t1-k2-k25-k1-k1

Comment by Peter Gubka [ 19/May/17 ]

The exception appeared also here: https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/697/archives/log.html.gz#s1-s30-t1-k2-k26-k1-k1
It is the Leader isolation within transaction timeout scenario (for prefix shards).

Comment by Vratko Polak [ 22/May/17 ]

> brief isolation of prefix-based shard leader
> where other producers [3] failed with CONTROLLER-1674 symptom

This time [5] the other producers also report "Last known history is null".

[5] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s30-t1-k2-k26-k1-k1

Comment by Peter Gubka [ 23/May/17 ]

New debug logs in
https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-3rd/5/

Exceptions itself https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-3rd/5/archives/log.html.gz#s1-s2-t1-k2-k27-k1-k2-k1-k1-k1

The odl zip was built https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/37/

Comment by Robert Varga [ 23/May/17 ]

New set of logs seem to indicate that this is happening after a follower->leader transition:

2017-05-23 13:05:44,012 | INFO | lt-dispatcher-23 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config (Candidate): Starting new election term 3
2017-05-23 13:05:44,015 | INFO | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config (Candidate): LastApplied index 2 is behind last index 4 - switching to PreLeader
2017-05-23 13:05:44,018 | INFO | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!config (Candidate) : Switching from behavior Candidate to PreLeader, election term: 3
2017-05-23 13:05:44,019 | INFO | ult-dispatcher-4 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-config: Received LeaderStateChanged message: LeaderStateChanged [memberId=member-1-shard-id-ints!-config, leaderId=member-1-shard-id-int
s!-config, leaderPayloadVersion=5]
2017-05-23 13:05:44,019 | INFO | ult-dispatcher-5 | RoleChangeNotifier | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | RoleChangeNotifier for member-1-shard-id-ints!-config , received role change from Candidate to PreLeader
2017-05-23 13:05:44,019 | INFO | ult-dispatcher-5 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-config: Received role changed for member-1-shard-id-ints!-config from Candidate to PreLeader
2017-05-23 13:05:44,024 | INFO | ult-dispatcher-4 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!config (PreLeader) : Switching from behavior PreLeader to Leader, election term: 3
2017-05-23 13:05:44,024 | INFO | ult-dispatcher-4 | RoleChangeNotifier | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | RoleChangeNotifier for member-1-shard-id-ints!-config , received role change from PreLeader to Leader
2017-05-23 13:05:44,024 | INFO | lt-dispatcher-31 | ShardManager | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | shard-manager-config: Received role changed for member-1-shard-id-ints!-config from PreLeader to Leader

the previous leader was:

2017-05-23 13:05:45,525 | INFO | lt-dispatcher-22 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-id-ints!config (Leader) : Switching from behavior Leader to Follower, election term: 2

and we have moved back its journal:

2017-05-23 13:05:50,830 | INFO | ult-dispatcher-4 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config (Leader): handleAppendEntriesReply: follower member-3-shard-id-ints!-config lastIndex 8 is ahead of our lastIndex 5 (snapshotIndex 1) - forcing install snaphot

2017-05-23 13:05:50,833 | INFO | ult-dispatcher-4 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config: Initiating snapshot capture CaptureSnapshot [lastAppliedIndex=5, lastAppliedTerm=3, lastIndex=5, lastTerm=3, installSnapshotInitiated=, replicatedToAllIndex=-1, replicatedToAllTerm=-1, unAppliedEntries size=0] to install on member-3-shard-id-ints!-config
2017-05-23 13:05:50,844 | INFO | lt-dispatcher-24 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config: Persisting of snapshot done: Snapshot [lastIndex=5, lastTerm=3, lastAppliedIndex=5, lastAppliedTerm=3, unAppliedEntries size=0, state=org.opendaylight.controller.cluster.datastore.persisted.ShardSnapshotState@6e5621d1, electionTerm=3, electionVotedFor=member-1-shard-id-ints!-config, ServerConfigPayload=null]
2017-05-23 13:05:50,844 | INFO | lt-dispatcher-24 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config: Removed in-memory snapshotted entries, adjusted snaphsotIndex: 1 and term: 1
2017-05-23 13:05:50,848 | INFO | lt-dispatcher-24 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config: SaveSnapshotSuccess received for snapshot, sequenceNr: 13

and then we have rejected a request from that node:

2017-05-23 13:05:59,683 | WARN | ult-dispatcher-4 | LeaderFrontendState | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config: rejecting unknown history request ModifyTransactionRequest{target=member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-0-0, sequence=0, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.208:2550/user/$m#-1415893195], operations=[TransactionWrite{path=/(tag:opendaylight.org,2017:controller:yang:lowlevel:target?revision=2017-02-15)id-int/id-int[

{(tag:opendaylight.org,2017:controller:yang:lowlevel:target?revision=2017-02-15)id=prefix-3}

]/item}], protocol=SIMPLE}

So this looks like we've lost the history ID during transition and it was not replayed for some reason. Part of it may be locality of the client/frontend, I will dig into the member-3 frontend logs to see what went down.

Comment by Robert Varga [ 23/May/17 ]

So it seems that there is no actual attempt to re-create the history in replay, which explains the failure.

As it turns out, we only ever send the create request on initial enqueue and once it succeeds we do not ever replay it once it succeeds. This points towards a problem on the backend, as it should not respond before the appropriate payload has been replicated (and then the history would be known to the follower).

Examining the code, ShardDataTree really calls replicatePayload() without a callback. Given that we want to allocate state properly, this needs some more thought.

Comment by Robert Varga [ 23/May/17 ]

Incorrect state-keeping on the backend: raising to critical.

Comment by Robert Varga [ 24/May/17 ]

https://git.opendaylight.org/gerrit/57751

Comment by Peter Gubka [ 25/May/17 ]

Debug logs (including akka)
https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-3rd/10/

Odl was built from controller=63/57763/2:99/57699/3
https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/43

The bug is still present
https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-3rd/10/archives/log.html.gz#s1-s2-t1-k2-k27-k1-k2-k1-k1-k1
but the new debug run is in the queue.

Comment by Peter Gubka [ 25/May/17 ]

Debug logs (including akka)
https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-3rd/13/

Odl built from controller=70/57770/4:99/57699/3 at
https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/45/

Exception present here:
https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-3rd/13/archives/log.html.gz#s1-s2-t1-k2-k27-k1-k2-k1-k1-k1

Comment by Peter Gubka [ 26/May/17 ]

Testing without akka logs:

Odl built from: Odl built from:
https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/49/

Testing job: https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-3rd/17/

Link to exception:
https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-3rd/17/archives/log.html.gz#s1-s2-t1-k2-k27-k1-k2-k1-k1-k1

Comment by Robert Varga [ 29/May/17 ]

Latest logs look as though the same problem as the patch was designed to address. What I can see is that the leader was local and it was supposed to respond only after replication has completed (hence the history should have been known in the logs). Yet what I am seeing is:

2017-05-29 09:57:05,587 | DEBUG | qtp1225240590-76 | TransmitQueue | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | Queue ConnectionEntry{request=CommitLocalTransactionRequest

{target=member-3-datastore-Shard-id-ints!-fe-0-chn-7-txn-0-0, sequence=0, replyTo=Actor[akka://open daylight-cluster-data/user/$m#-908684957], coordinated=false}

, enqueuedTicks=489198929633} transmitting entry {}
2017-05-29 09:57:08,270 | INFO | h for user karaf | command | 265 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Isolating node 3
2017-05-29 09:57:17,527 | WARN | ult-dispatcher-4 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-id-ints!-config (Leader) : handleAppendEntriesReply delayed beyond election timeout, appendEntriesReply : AppendEntriesReply [term=1, succ
ess=true, followerId=member-2-shard-id-ints!-config, logLastIndex=5, logLastTerm=1, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], timeSinceLastActivity : 12552, lastApplied : 2, commitIndex : 2
2017-05-29 09:57:17,528 | DEBUG | ult-dispatcher-4 | TransmitQueue | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | Completing request CreateLocalHistoryRequest{target=LocalHistoryIdentifier{client=ClientIdentifier

{frontend=member-3-frontend-datastore-Shard-id-ints!, genera tion=0}, history=7, cookie=0}, sequence=0, replyTo=Actorakka://opendaylight-cluster-data/user/$m#-908684957} with Envelope{sessionId=0, txSequence=0, message=LocalHistorySuccess{target=LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-3-frontend-datastore-Shard-id-ints!, generation=0}

, history=7, cook
ie=0}, sequence=0}}
2017-05-29 09:57:17,528 | WARN | lt-dispatcher-32 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-id-ints!-config (Leader) : handleAppendEntriesReply delayed beyond election timeout, appendEntriesReply : AppendEntriesReply [term=1, succ
ess=true, followerId=member-1-shard-id-ints!-config, logLastIndex=5, logLastTerm=1, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], timeSinceLastActivity : 12552, lastApplied : 5, commitIndex : 5
2017-05-29 09:57:17,529 | DEBUG | ult-dispatcher-4 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | Completing TransmittedConnectionEntry{request=CreateLocalHistoryRequest{target=LocalHistoryIdentifier{client=ClientIdentifier

{frontend=member-3-frontend-datastore-Shard-id-ints!, generation=0}, history=7, cookie=0}, sequence=0, replyTo=Actorakka://opendaylight-cluster-data/user/$m#-908684957}, enqueuedTicks=488843276375, sessionId=0, txSequence=0} with Envelope{sessionId=0, txSequence=0, message=LocalHistorySuccess{target=LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-3-frontend-datastore-Shard-id-ints!, generation=0}

, history=7, cookie=0}, sequence=0}}
2017-05-29 09:57:17,529 | DEBUG | ult-dispatcher-4 | AbstractClientHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Create history response LocalHistorySuccess{target=LocalHistoryIdentifier{client=ClientIdentifier

{frontend=member-3-frontend-datastore-Shard-id-ints!, generation=0}

, history=7, cookie=0}, sequence=0}
2017-05-29 09:57:23,948 | WARN | ult-dispatcher-5 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-3-shard-id-ints!-config: Current transaction member-3-datastore-Shard-id-ints!-fe-0-chn-7-txn-0-0 has timed out after 15000 ms in state COMMIT_PENDING
2017-05-29 09:57:23,949 | WARN | ult-dispatcher-5 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-3-shard-id-ints!-config: Transaction member-3-datastore-Shard-id-ints!-fe-0-chn-7-txn-0-0 is still committing, cannot abort
2017-05-29 09:57:26,118 | WARN | lt-dispatcher-22 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-car-people-operational: At least 1 followers need to be active, Switching member-3-shard-car-people-operational from Leader to IsolatedLeader
2017-05-29 09:57:26,118 | WARN | ult-dispatcher-3 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-prefix-configuration-shard-config: At least 1 followers need to be active, Switching member-3-shard-prefix-configuration-shard-config from Leader to IsolatedLeader
2017-05-29 09:57:26,118 | WARN | lt-dispatcher-23 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-inventory-operational: At least 1 followers need to be active, Switching member-3-shard-inventory-operational from Leader to IsolatedLeader
2017-05-29 09:57:26,118 | WARN | ult-dispatcher-4 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-people-operational: At least 1 followers need to be active, Switching member-3-shard-people-operational from Leader to IsolatedLeader
2017-05-29 09:57:26,119 | WARN | lt-dispatcher-20 | EntityOwnershipShard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-entity-ownership-operational: At least 1 followers need to be active, Switching member-3-shard-entity-ownership-operational from Leader to IsolatedLeader
2017-05-29 09:57:26,119 | WARN | ult-dispatcher-5 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.1.SNAPSHOT | member-3-shard-topology-operational: At least 1 followers need to be active, Switching member-3-shard-topology-operational from Leader to IsolatedLeader

I assume this run included https://git.opendaylight.org/gerrit/57751, which would mean that ShardDataTree has seen the data as replicated, but in fact it wasn't (at least not visible on the new leader).

Comment by Robert Varga [ 29/May/17 ]

Tom, is it possible that appendentries batching or something around this is playing tricks on us?

I mean, in case of going into leader isolation, is there any way we can see a payload as replicated, but then the next leader does not see it (which would be raft violation, I think) ?

Comment by Tom Pantelis [ 29/May/17 ]

(In reply to Robert Varga from comment #16)
> Tom, is it possible that appendentries batching or something around this is
> playing tricks on us?
>
> I mean, in case of going into leader isolation, is there any way we can see
> a payload as replicated, but then the next leader does not see it (which
> would be raft violation, I think) ?

That shouldn't happen. In 3-node, if a payload was replicated then at least one of the other nodes, at least persisted it, even if it's not committed/applied to state yet. If it was If it was replicated to say node2 but not node3 and the leader is then isolated and an election occurs in the other partition, then node2 should become leader b/c it's log is more up-to-date.

Comment by Robert Varga [ 29/May/17 ]

Right, now I am fishing a bit, does that also mean it will get committed to state by odl2?

I think the intent of onPayloadReplicated at shard level is to say that the entry is safely replicated no matter what and I am pretty sure (but will recheck) that we respond to the create request only after persistence comes back.

Comment by Tom Pantelis [ 29/May/17 ]

(In reply to Robert Varga from comment #18)
> Right, now I am fishing a bit, does that also mean it will get committed to
> state by odl2?
>
> I think the intent of onPayloadReplicated at shard level is to say that the
> entry is safely replicated no matter what and I am pretty sure (but will
> recheck) that we respond to the create request only after persistence comes
> back.

I assume you mean Shard#applyState -> ShardDataTree#applyReplicatedPayload. applyState is called when an entry is committed after consensus is reached.

When a node wins an election, it will first try to commit/apply any unapplied entries from the prior term before switching to Leader (PreLeader state). It does this by persisting a NoopPayload with the new term - if that gets consensus then the NoopPayload and all prior entries are applied.

Comment by Robert Varga [ 29/May/17 ]

Right, so from that perspective it should work. Oh well, back to log-digging.

Comment by Vratko Polak [ 08/Jun/17 ]

Not sure if there was a regression or this is a heisenbug, but recent run on RelEng sees the same behavior as in comment #4 (as opposed to #6).

https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/736/log.html.gz#s1-s30-t1-k2-k25-k1-k1

Generated at Wed Feb 07 21:54:06 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.