[CONTROLLER-1628] Unknown history for commited transaction Created: 05/Apr/17 Updated: 03/Jun/19 Resolved: 03/Jun/19 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Neon, Fluorine SR1, Oxygen SR4 |
| Fix Version/s: | Sodium, Neon SR2 |
| Type: | Bug | ||
| Reporter: | Guy Sela | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| External issue ID: | 8159 | ||||||||||||||||
| Description |
|
Latest Carbon code. |
| Comments |
| Comment by Guy Sela [ 05/Apr/17 ] |
|
Attachment odl1karaf.zip has been added with description: ODL 1 log |
| Comment by Guy Sela [ 05/Apr/17 ] |
|
Attachment odl2karaf.zip has been added with description: ODL 2 log |
| Comment by Guy Sela [ 05/Apr/17 ] |
|
Attachment odl3karaf.zip has been added with description: ODL 3 log |
| Comment by Robert Varga [ 05/Apr/17 ] |
|
Guy, can you run this with debug enabled on org.opendaylight.controller.cluster.datastore? It looks like we have lost the record of a transaction chain being created somehow |
| Comment by Guy Sela [ 05/Apr/17 ] |
|
(In reply to Robert Varga from comment #3) Sure, anything to take that upset emoji off your face |
| Comment by Guy Sela [ 09/Apr/17 ] |
|
Attachment odl3.tar.gz has been added with description: ODL3 LOG with debug |
| Comment by Guy Sela [ 09/Apr/17 ] |
|
Attachment odl2.tar.gz has been added with description: ODL2 LOG with debug |
| Comment by Guy Sela [ 09/Apr/17 ] |
|
Attachment odl1.tar.gz has been added with description: ODL1 LOG with debug |
| Comment by Robert Varga [ 12/Apr/17 ] |
|
So this is a follower-side of things, but unfortunately the logs are incomplete. The following histories are candidates: member-3-datastore-operational-fe-0-chn-1 But the logs from master (odl3) do not contain the records of those chains starting nor is there overlap between transactions on master and follower. Digging into journal entries, the most recent inconsistency is detected at: 2017-04-09 17:20:40,052 | DEBUG | lt-dispatcher-34 | Shard | 207 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-topology-operational: Applying state for log index 817 data org.opendaylight.controller.cluster.datastore.persisted.CommitTransactionPayload@54aae041 but the logs are missing that part, because the oldest we have on the leader side is: 2017-04-09 17:20:45,030 | DEBUG | lt-dispatcher-40 | Shard | 207 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-topology-operational (Leader): handleAppendEntriesReply - FollowerLogInformation for member-2-shard-topology-operational updated: matchIndex: 846, nextIndex: 847 At the very least we need some overlap in the logs related to the what state was exchanged between leader and follower. The best would be to have logs containing the complete transaction chain lifetime (i.e. from its first transaction). |
| Comment by Robert Varga [ 12/Apr/17 ] |
|
patch to improve logging details: https://git.opendaylight.org/gerrit/54884 |
| Comment by Viera Zelcamova [ 18/Apr/17 ] |
|
Hi Guy, |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
We were on holiday until now. I will try to produce the logs today. |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
Attachment odl1.tar.gz has been added with description: ODL 1 Log with Patch |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
Attachment odl2.tar.gz has been added with description: ODL 2 Log with Patch |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
Attachment odl3.tar.gz has been added with description: ODL 3 Log with Patch |
| Comment by Robert Varga [ 20/Apr/17 ] |
|
Thans Guy, but the logs are still incomplete – they seem to be only last 3 log files (.10, .9, .8) for each node. I really do need the complete thing, as the local history is communicated when a transaction chain is first created and on snapshot transfers. None of that is captured in the files |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
Attachment odl1full.tar.gz has been added with description: ODL1 Full Log |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
Attachment odl2A.tar.gz has been added with description: ODL2 Part A Full Log |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
Attachment odl2B.tar.gz has been added with description: ODL2 Part B Full Log |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
Attachment odl3A.tar.gz has been added with description: ODL3 Part A Full Log |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
Attachment odl3B.tar.gz has been added with description: ODL3 Part B Full Log |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
Attachment odl3C.tar.gz has been added with description: ODL3 Part C Full Log |
| Comment by Guy Sela [ 20/Apr/17 ] |
|
Done |
| Comment by Robert Varga [ 20/Apr/17 ] |
|
Thanks, this looks promising |
| Comment by Robert Varga [ 20/Apr/17 ] |
|
The problem is that ShardCommitCoordinator.handleReadyLocalTransaction() invokes ShardDataTree.createReadyCohort() directly rather than going through transaction chain tracking in ShardDataTree. This is a consistency problem in ShardDataTree interfaces, as ShardDataTree.newReadWriteTransaction() actually does the tracking, leaving ShardCommitCoordinator totally ignorant of it. |
| Comment by Robert Varga [ 20/Apr/17 ] |
| Comment by Robert Varga [ 24/Apr/17 ] |
|
Agreed on killing DataTreeCohortActor. I have filed |
| Comment by Robert Varga [ 24/Apr/17 ] |
|
Ugh, that comment should have gone someplace else. |
| Comment by Robert Varga [ 16/May/17 ] |
|
This is still occurring in testing, notably in Logs indicate that the history was known when in leader role: 2017-05-15 12:48:06,725 | DEBUG | lt-dispatcher-22 | AbstractFrontendHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-2-shard-default-config: allocating new open transaction member-3-datastore-config-fe-0-chn-2-txn-0-0 but after transition to follower: 2017-05-15 12:48:06,828 | INFO | lt-dispatcher-22 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-config (Leader) :- Switching from behavior Leader to Follower, election term: 3 we get: 2017-05-15 12:49:11,865 | WARN | lt-dispatcher-26 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-2-shard-default-config: Unknown history for commited transaction member-3-datastore-config-fe-0-chn-2-txn-2-0, ignoring The transaction chain in question is a long-lived one, which survives the leader ship change event. This evidence points to towards leader-side persistence codepath, which does not seem to apply lifecycle events to its own FrontendClientMetadataBuilder. |
| Comment by Robert Varga [ 16/May/17 ] |
| Comment by Robert Varga [ 26/Nov/18 ] |
|
This is still surfacing in Genius 3-node CSIT. |
| Comment by Robert Varga [ 26/Nov/18 ] |
2018-11-26T08:54:04,046 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Created local history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=4, cookie=0}
2018-11-26T08:54:04,047 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Purged transaction member-2-datastore-operational-fe-0-txn-177-0
2018-11-26T08:54:04,050 | DEBUG | opendaylight-cluster-data-shard-dispatcher-44 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Created local history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=5, cookie=0}
2018-11-26T08:54:04,050 | DEBUG | opendaylight-cluster-data-shard-dispatcher-44 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Closed history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=4, cookie=0}
2018-11-26T08:54:04,051 | DEBUG | opendaylight-cluster-data-shard-dispatcher-44 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Purged history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=4, cookie=0}
2018-11-26T08:54:04,051 | WARN | opendaylight-cluster-data-shard-dispatcher-44 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Unknown history for commited transaction member-2-datastore-operational-fe-0-chn-4-txn-0-0, ignoring
2018-11-26T08:54:04,054 | DEBUG | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-topology-operational: Committed transaction member-1-datastore-operational-fe-0-chn-1-txn-54-0
2018-11-26T08:54:04,058 | WARN | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Unknown history for purged transaction member-2-datastore-operational-fe-0-chn-4-txn-0-0, ignoring
2018-11-26T08:54:04,060 | DEBUG | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-topology-operational: Purged transaction member-1-datastore-operational-fe-0-chn-1-txn-54-0
2018-11-26T08:54:04,068 | DEBUG | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Committed transaction member-1-datastore-operational-fe-0-txn-115-0
2018-11-26T08:54:04,081 | DEBUG | opendaylight-cluster-data-shard-dispatcher-43 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Closed history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=5, cookie=0}
2018-11-26T08:54:04,081 | DEBUG | opendaylight-cluster-data-shard-dispatcher-43 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Purged history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=5, cookie=0}
2018-11-26T08:54:04,082 | DEBUG | opendaylight-cluster-data-shard-dispatcher-43 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Purged transaction member-1-datastore-operational-fe-0-txn-115-0
2018-11-26T08:54:04,082 | WARN | opendaylight-cluster-data-shard-dispatcher-43 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Unknown history for commited transaction member-2-datastore-operational-fe-0-chn-5-txn-0-0, ignoring
2018-11-26T08:54:04,086 | DEBUG | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-topology-operational: Committed transaction member-1-datastore-operational-fe-0-chn-1-txn-55-0
2018-11-26T08:54:04,089 | DEBUG | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-topology-operational: Purged transaction member-1-datastore-operational-fe-0-chn-1-txn-55-0
2018-11-26T08:54:04,097 | DEBUG | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-config: Aborted transaction member-2-datastore-config-fe-0-txn-214-0
2018-11-26T08:54:04,098 | DEBUG | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-config: Committed transaction member-2-datastore-config-fe-0-txn-216-0
2018-11-26T08:54:04,098 | DEBUG | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-config: Aborted transaction member-2-datastore-config-fe-0-txn-215-0
2018-11-26T08:54:04,105 | DEBUG | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-config: Aborted transaction member-2-datastore-config-fe-0-txn-217-0
2018-11-26T08:54:04,105 | DEBUG | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-config: Aborted transaction member-2-datastore-config-fe-0-txn-218-0
2018-11-26T08:54:04,106 | WARN | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Unknown history for purged transaction member-2-datastore-operational-fe-0-chn-5-txn-0-0, ignoring
|
| Comment by Robert Varga [ 26/Nov/18 ] |
|
This points towards CloseTransactionChain being too eager with purge. We are still tracking an unpurged transaction in the chain, hence we should not be removing it. The above happens if the user does: TransactionChain chain; WriteTransaction tx = chain.newWriteOnlyTransaction(); // some modifications Future<?> future = tx.commit(); chain.close(); Transaction commit will be waiting in the queue, while chain.close() will have an immediate effect – hence when the transaction completes, FrontendClientMetadataBuilder already has no record of the history. Under normal operation this has no ill effects aside from the annoying warning. During failover operations, this may cause the transaction to be aborted, as the new leader will have no memory of the history, preventing the transaction from being re-created. tell-based protocol is unaffected as far as I can tell, as it is using two separate messages {Destroy,Purge}HistoryRequest instead of a single CloseTransactionChain. |
| Comment by Robert Varga [ 27/Nov/18 ] |
|
This will require postponing the purge until after all transactions are done. Tracking them down, though, may be a bit tricky, as they different states are tracked in different places (pending, committing, committed, purged). Since this is only for ask-based, we can probably not care about committed (a.k.a. closed) transactions and need to track down only transactions which are being committed. |