[CONTROLLER-1653] C: UnknownHistoryException: Last known history is null; for prefix based shard Created: 04/May/17 Updated: 25/Jul/23 Resolved: 21/Jun/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Peter Gubka | Assignee: | Robert Varga |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| External issue ID: | 8371 | ||||||||||||
| Description |
|
my job: (1 #592) is built from controller=33/56433/5 after prefix based shard is created, procude-transactions rpc is triggered on every node. One of the response contained cause=org.opendaylight.controller.cluster.access.commands.UnknownHistoryException: Last known history is null]]} Logs areover 1G for each node, debug was on. |
| Comments |
| Comment by Peter Gubka [ 04/May/17 ] |
|
Attachment log.html has been added with description: robot log |
| Comment by Viera Zelcamova [ 09/May/17 ] |
|
might be fixed, keep track just if appears again, lower priority |
| Comment by Robert Varga [ 16/May/17 ] |
|
Needs to be tested with 8402 and 8159 patches in place. |
| Comment by Peter Gubka [ 18/May/17 ] |
|
fixed as a side effect |
| Comment by Vratko Polak [ 18/May/17 ] |
|
The subject exception appeared again [2], albeit in a scenario (brief isolation of prefix-based shard leader) where other producers [3] failed with [2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/694/archives/log.html.gz#s1-s30-t1-k2-k25-k1-k2-k1-k1-k1 |
| Comment by Peter Gubka [ 19/May/17 ] |
|
The exception appeared also here: https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/697/archives/log.html.gz#s1-s30-t1-k2-k26-k1-k1 |
| Comment by Vratko Polak [ 22/May/17 ] |
|
> brief isolation of prefix-based shard leader This time [5] the other producers also report "Last known history is null". |
| Comment by Peter Gubka [ 23/May/17 ] |
|
New debug logs in Exceptions itself https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-3rd/5/archives/log.html.gz#s1-s2-t1-k2-k27-k1-k2-k1-k1-k1 The odl zip was built https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/37/ |
| Comment by Robert Varga [ 23/May/17 ] |
|
New set of logs seem to indicate that this is happening after a follower->leader transition: 2017-05-23 13:05:44,012 | INFO | lt-dispatcher-23 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config (Candidate): Starting new election term 3 the previous leader was: 2017-05-23 13:05:45,525 | INFO | lt-dispatcher-22 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-id-ints! and we have moved back its journal: 2017-05-23 13:05:50,830 | INFO | ult-dispatcher-4 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config (Leader): handleAppendEntriesReply: follower member-3-shard-id-ints!-config lastIndex 8 is ahead of our lastIndex 5 (snapshotIndex 1) - forcing install snaphot 2017-05-23 13:05:50,833 | INFO | ult-dispatcher-4 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config: Initiating snapshot capture CaptureSnapshot [lastAppliedIndex=5, lastAppliedTerm=3, lastIndex=5, lastTerm=3, installSnapshotInitiated=, replicatedToAllIndex=-1, replicatedToAllTerm=-1, unAppliedEntries size=0] to install on member-3-shard-id-ints!-config and then we have rejected a request from that node: 2017-05-23 13:05:59,683 | WARN | ult-dispatcher-4 | LeaderFrontendState | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-1-shard-id-ints!-config: rejecting unknown history request ModifyTransactionRequest{target=member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-0-0, sequence=0, replyTo=Actor[akka.tcp://opendaylight-cluster-data@10.29.13.208:2550/user/$m#-1415893195], operations=[TransactionWrite{path=/(tag:opendaylight.org,2017:controller:yang:lowlevel:target?revision=2017-02-15)id-int/id-int[ {(tag:opendaylight.org,2017:controller:yang:lowlevel:target?revision=2017-02-15)id=prefix-3}]/item}], protocol=SIMPLE} So this looks like we've lost the history ID during transition and it was not replayed for some reason. Part of it may be locality of the client/frontend, I will dig into the member-3 frontend logs to see what went down. |
| Comment by Robert Varga [ 23/May/17 ] |
|
So it seems that there is no actual attempt to re-create the history in replay, which explains the failure. As it turns out, we only ever send the create request on initial enqueue and once it succeeds we do not ever replay it once it succeeds. This points towards a problem on the backend, as it should not respond before the appropriate payload has been replicated (and then the history would be known to the follower). Examining the code, ShardDataTree really calls replicatePayload() without a callback. Given that we want to allocate state properly, this needs some more thought. |
| Comment by Robert Varga [ 23/May/17 ] |
|
Incorrect state-keeping on the backend: raising to critical. |
| Comment by Robert Varga [ 24/May/17 ] |
| Comment by Peter Gubka [ 25/May/17 ] |
|
Debug logs (including akka) Odl was built from controller=63/57763/2:99/57699/3 The bug is still present |
| Comment by Peter Gubka [ 25/May/17 ] |
|
Debug logs (including akka) Odl built from controller=70/57770/4:99/57699/3 at Exception present here: |
| Comment by Peter Gubka [ 26/May/17 ] |
|
Testing without akka logs: Odl built from: Odl built from: Testing job: https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-3rd/17/ Link to exception: |
| Comment by Robert Varga [ 29/May/17 ] |
|
Latest logs look as though the same problem as the patch was designed to address. What I can see is that the leader was local and it was supposed to respond only after replication has completed (hence the history should have been known in the logs). Yet what I am seeing is: 2017-05-29 09:57:05,587 | DEBUG | qtp1225240590-76 | TransmitQueue | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | Queue ConnectionEntry{request=CommitLocalTransactionRequest {target=member-3-datastore-Shard-id-ints!-fe-0-chn-7-txn-0-0, sequence=0, replyTo=Actor[akka://open daylight-cluster-data/user/$m#-908684957], coordinated=false}, enqueuedTicks=489198929633} transmitting entry {} , history=7, cook , history=7, cookie=0}, sequence=0}} , history=7, cookie=0}, sequence=0} I assume this run included https://git.opendaylight.org/gerrit/57751, which would mean that ShardDataTree has seen the data as replicated, but in fact it wasn't (at least not visible on the new leader). |
| Comment by Robert Varga [ 29/May/17 ] |
|
Tom, is it possible that appendentries batching or something around this is playing tricks on us? I mean, in case of going into leader isolation, is there any way we can see a payload as replicated, but then the next leader does not see it (which would be raft violation, I think) ? |
| Comment by Tom Pantelis [ 29/May/17 ] |
|
(In reply to Robert Varga from comment #16) That shouldn't happen. In 3-node, if a payload was replicated then at least one of the other nodes, at least persisted it, even if it's not committed/applied to state yet. If it was If it was replicated to say node2 but not node3 and the leader is then isolated and an election occurs in the other partition, then node2 should become leader b/c it's log is more up-to-date. |
| Comment by Robert Varga [ 29/May/17 ] |
|
Right, now I am fishing a bit, does that also mean it will get committed to state by odl2? I think the intent of onPayloadReplicated at shard level is to say that the entry is safely replicated no matter what and I am pretty sure (but will recheck) that we respond to the create request only after persistence comes back. |
| Comment by Tom Pantelis [ 29/May/17 ] |
|
(In reply to Robert Varga from comment #18) I assume you mean Shard#applyState -> ShardDataTree#applyReplicatedPayload. applyState is called when an entry is committed after consensus is reached. When a node wins an election, it will first try to commit/apply any unapplied entries from the prior term before switching to Leader (PreLeader state). It does this by persisting a NoopPayload with the new term - if that gets consensus then the NoopPayload and all prior entries are applied. |
| Comment by Robert Varga [ 29/May/17 ] |
|
Right, so from that perspective it should work. Oh well, back to log-digging. |
| Comment by Vratko Polak [ 08/Jun/17 ] |
|
Not sure if there was a regression or this is a heisenbug, but recent run on RelEng sees the same behavior as in comment #4 (as opposed to #6). |