[CONTROLLER-1628] Unknown history for commited transaction Created: 05/Apr/17  Updated: 03/Jun/19  Resolved: 03/Jun/19

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Neon, Fluorine SR1, Oxygen SR4
Fix Version/s: Sodium, Neon SR2

Type: Bug
Reporter: Guy Sela Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File odl1.tar.gz     File odl1.tar.gz     File odl1full.tar.gz     Zip Archive odl1karaf.zip     File odl2.tar.gz     File odl2.tar.gz     File odl2A.tar.gz     File odl2B.tar.gz     Zip Archive odl2karaf.zip     File odl3.tar.gz     File odl3.tar.gz     File odl3A.tar.gz     File odl3B.tar.gz     File odl3C.tar.gz     Zip Archive odl3karaf.zip    
Issue Links:
Blocks
blocks CONTROLLER-1653 C: UnknownHistoryException: Last know... Resolved
Duplicate
is duplicated by CONTROLLER-1638 Spam in karaf logs Resolved
External issue ID: 8159

 Description   

Latest Carbon code.
3-node cluster as part of an openstack solution.
Connecting 200 computes(OVSs).
Log is filled in non-stopping WARNs:
WARN | lt-dispatcher-15 | FrontendClientMetadataBuilder | 214 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | Unknown history for commited transaction member-3-datastore-operational-fe-0-chn-6-txn-24-0, ignoring



 Comments   
Comment by Guy Sela [ 05/Apr/17 ]

Attachment odl1karaf.zip has been added with description: ODL 1 log

Comment by Guy Sela [ 05/Apr/17 ]

Attachment odl2karaf.zip has been added with description: ODL 2 log

Comment by Guy Sela [ 05/Apr/17 ]

Attachment odl3karaf.zip has been added with description: ODL 3 log

Comment by Robert Varga [ 05/Apr/17 ]

Guy, can you run this with debug enabled on org.opendaylight.controller.cluster.datastore?

It looks like we have lost the record of a transaction chain being created somehow

Comment by Guy Sela [ 05/Apr/17 ]

(In reply to Robert Varga from comment #3)
> Guy, can you run this with debug enabled on
> org.opendaylight.controller.cluster.datastore?
>
> It looks like we have lost the record of a transaction chain being created
> somehow

Sure, anything to take that upset emoji off your face
I'll do it tomorrow.

Comment by Guy Sela [ 09/Apr/17 ]

Attachment odl3.tar.gz has been added with description: ODL3 LOG with debug

Comment by Guy Sela [ 09/Apr/17 ]

Attachment odl2.tar.gz has been added with description: ODL2 LOG with debug

Comment by Guy Sela [ 09/Apr/17 ]

Attachment odl1.tar.gz has been added with description: ODL1 LOG with debug

Comment by Robert Varga [ 12/Apr/17 ]

So this is a follower-side of things, but unfortunately the logs are incomplete. The following histories are candidates:

member-3-datastore-operational-fe-0-chn-1
member-3-datastore-operational-fe-0-chn-3
member-3-datastore-operational-fe-0-chn-6
member-3-datastore-operational-fe-0-chn-9
member-3-datastore-operational-fe-0-chn-11
member-3-datastore-operational-fe-0-chn-12

But the logs from master (odl3) do not contain the records of those chains starting nor is there overlap between transactions on master and follower.

Digging into journal entries, the most recent inconsistency is detected at:

2017-04-09 17:20:40,052 | DEBUG | lt-dispatcher-34 | Shard | 207 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-topology-operational: Applying state for log index 817 data org.opendaylight.controller.cluster.datastore.persisted.CommitTransactionPayload@54aae041

but the logs are missing that part, because the oldest we have on the leader side is:

2017-04-09 17:20:45,030 | DEBUG | lt-dispatcher-40 | Shard | 207 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-3-shard-topology-operational (Leader): handleAppendEntriesReply - FollowerLogInformation for member-2-shard-topology-operational updated: matchIndex: 846, nextIndex: 847

At the very least we need some overlap in the logs related to the what state was exchanged between leader and follower. The best would be to have logs containing the complete transaction chain lifetime (i.e. from its first transaction).

Comment by Robert Varga [ 12/Apr/17 ]

patch to improve logging details: https://git.opendaylight.org/gerrit/54884

Comment by Viera Zelcamova [ 18/Apr/17 ]

Hi Guy,
can you review the last comment from Robert and attach new logs for us?
Thank you.

Comment by Guy Sela [ 20/Apr/17 ]

We were on holiday until now. I will try to produce the logs today.

Comment by Guy Sela [ 20/Apr/17 ]

Attachment odl1.tar.gz has been added with description: ODL 1 Log with Patch

Comment by Guy Sela [ 20/Apr/17 ]

Attachment odl2.tar.gz has been added with description: ODL 2 Log with Patch

Comment by Guy Sela [ 20/Apr/17 ]

Attachment odl3.tar.gz has been added with description: ODL 3 Log with Patch

Comment by Robert Varga [ 20/Apr/17 ]

Thans Guy, but the logs are still incomplete – they seem to be only last 3 log files (.10, .9, .8) for each node. I really do need the complete thing, as the local history is communicated when a transaction chain is first created and on snapshot transfers. None of that is captured in the files

Comment by Guy Sela [ 20/Apr/17 ]

Attachment odl1full.tar.gz has been added with description: ODL1 Full Log

Comment by Guy Sela [ 20/Apr/17 ]

Attachment odl2A.tar.gz has been added with description: ODL2 Part A Full Log

Comment by Guy Sela [ 20/Apr/17 ]

Attachment odl2B.tar.gz has been added with description: ODL2 Part B Full Log

Comment by Guy Sela [ 20/Apr/17 ]

Attachment odl3A.tar.gz has been added with description: ODL3 Part A Full Log

Comment by Guy Sela [ 20/Apr/17 ]

Attachment odl3B.tar.gz has been added with description: ODL3 Part B Full Log

Comment by Guy Sela [ 20/Apr/17 ]

Attachment odl3C.tar.gz has been added with description: ODL3 Part C Full Log

Comment by Guy Sela [ 20/Apr/17 ]

Done

Comment by Robert Varga [ 20/Apr/17 ]

Thanks, this looks promising

Comment by Robert Varga [ 20/Apr/17 ]

The problem is that ShardCommitCoordinator.handleReadyLocalTransaction() invokes ShardDataTree.createReadyCohort() directly rather than going through transaction chain tracking in ShardDataTree.

This is a consistency problem in ShardDataTree interfaces, as ShardDataTree.newReadWriteTransaction() actually does the tracking, leaving ShardCommitCoordinator totally ignorant of it.

Comment by Robert Varga [ 20/Apr/17 ]

carbon: https://git.opendaylight.org/gerrit/55739

Comment by Robert Varga [ 24/Apr/17 ]

Agreed on killing DataTreeCohortActor. I have filed CONTROLLER-1641 to track that.

Comment by Robert Varga [ 24/Apr/17 ]

Ugh, that comment should have gone someplace else.

Comment by Robert Varga [ 16/May/17 ]

This is still occurring in testing, notably in CONTROLLER-1664. It happens on clean leader movement, when the leader transitions to being a follower.

Logs indicate that the history was known when in leader role:

2017-05-15 12:48:06,725 | DEBUG | lt-dispatcher-22 | AbstractFrontendHistory | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-2-shard-default-config: allocating new open transaction member-3-datastore-config-fe-0-chn-2-txn-0-0

but after transition to follower:

2017-05-15 12:48:06,828 | INFO | lt-dispatcher-22 | Shard | 192 - org.opendaylight.controller.sal-clustering-commons - 1.5.0.SNAPSHOT | member-2-shard-default-config (Leader) :- Switching from behavior Leader to Follower, election term: 3

we get:

2017-05-15 12:49:11,865 | WARN | lt-dispatcher-26 | FrontendClientMetadataBuilder | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.0.SNAPSHOT | member-2-shard-default-config: Unknown history for commited transaction member-3-datastore-config-fe-0-chn-2-txn-2-0, ignoring

The transaction chain in question is a long-lived one, which survives the leader ship change event.

This evidence points to towards leader-side persistence codepath, which does not seem to apply lifecycle events to its own FrontendClientMetadataBuilder.

Comment by Robert Varga [ 16/May/17 ]

https://git.opendaylight.org/gerrit/57138

Comment by Robert Varga [ 26/Nov/18 ]

This is still surfacing in Genius 3-node CSIT.

Comment by Robert Varga [ 26/Nov/18 ]
2018-11-26T08:54:04,046 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Created local history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=4, cookie=0}
2018-11-26T08:54:04,047 | DEBUG | opendaylight-cluster-data-shard-dispatcher-20 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Purged transaction member-2-datastore-operational-fe-0-txn-177-0
2018-11-26T08:54:04,050 | DEBUG | opendaylight-cluster-data-shard-dispatcher-44 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Created local history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=5, cookie=0}
2018-11-26T08:54:04,050 | DEBUG | opendaylight-cluster-data-shard-dispatcher-44 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Closed history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=4, cookie=0}
2018-11-26T08:54:04,051 | DEBUG | opendaylight-cluster-data-shard-dispatcher-44 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Purged history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=4, cookie=0}
2018-11-26T08:54:04,051 | WARN  | opendaylight-cluster-data-shard-dispatcher-44 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Unknown history for commited transaction member-2-datastore-operational-fe-0-chn-4-txn-0-0, ignoring
2018-11-26T08:54:04,054 | DEBUG | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-topology-operational: Committed transaction member-1-datastore-operational-fe-0-chn-1-txn-54-0
2018-11-26T08:54:04,058 | WARN  | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Unknown history for purged transaction member-2-datastore-operational-fe-0-chn-4-txn-0-0, ignoring
2018-11-26T08:54:04,060 | DEBUG | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-topology-operational: Purged transaction member-1-datastore-operational-fe-0-chn-1-txn-54-0
2018-11-26T08:54:04,068 | DEBUG | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Committed transaction member-1-datastore-operational-fe-0-txn-115-0
2018-11-26T08:54:04,081 | DEBUG | opendaylight-cluster-data-shard-dispatcher-43 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Closed history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=5, cookie=0}
2018-11-26T08:54:04,081 | DEBUG | opendaylight-cluster-data-shard-dispatcher-43 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Purged history LocalHistoryIdentifier{client=ClientIdentifier{frontend=member-2-frontend-datastore-operational, generation=0}, history=5, cookie=0}
2018-11-26T08:54:04,082 | DEBUG | opendaylight-cluster-data-shard-dispatcher-43 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Purged transaction member-1-datastore-operational-fe-0-txn-115-0
2018-11-26T08:54:04,082 | WARN  | opendaylight-cluster-data-shard-dispatcher-43 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Unknown history for commited transaction member-2-datastore-operational-fe-0-chn-5-txn-0-0, ignoring
2018-11-26T08:54:04,086 | DEBUG | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-topology-operational: Committed transaction member-1-datastore-operational-fe-0-chn-1-txn-55-0
2018-11-26T08:54:04,089 | DEBUG | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-topology-operational: Purged transaction member-1-datastore-operational-fe-0-chn-1-txn-55-0
2018-11-26T08:54:04,097 | DEBUG | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-config: Aborted transaction member-2-datastore-config-fe-0-txn-214-0
2018-11-26T08:54:04,098 | DEBUG | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-config: Committed transaction member-2-datastore-config-fe-0-txn-216-0
2018-11-26T08:54:04,098 | DEBUG | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-config: Aborted transaction member-2-datastore-config-fe-0-txn-215-0
2018-11-26T08:54:04,105 | DEBUG | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-config: Aborted transaction member-2-datastore-config-fe-0-txn-217-0
2018-11-26T08:54:04,105 | DEBUG | opendaylight-cluster-data-shard-dispatcher-45 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-config: Aborted transaction member-2-datastore-config-fe-0-txn-218-0
2018-11-26T08:54:04,106 | WARN  | opendaylight-cluster-data-shard-dispatcher-42 | FrontendClientMetadataBuilder    | 232 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-3-shard-default-operational: Unknown history for purged transaction member-2-datastore-operational-fe-0-chn-5-txn-0-0, ignoring
Comment by Robert Varga [ 26/Nov/18 ]

This points towards CloseTransactionChain being too eager with purge. We are still tracking an unpurged transaction in the chain, hence we should not be removing it. The above happens if the user does:

TransactionChain chain;
WriteTransaction tx = chain.newWriteOnlyTransaction();
// some modifications
Future<?> future = tx.commit();
chain.close();

Transaction commit will be waiting in the queue, while chain.close() will have an immediate effect – hence when the transaction completes, FrontendClientMetadataBuilder already has no record of the history.

Under normal operation this has no ill effects aside from the annoying warning. During failover operations, this may cause the transaction to be aborted, as the new leader will have no memory of the history, preventing the transaction from being re-created.

tell-based protocol is unaffected as far as I can tell, as it is using two separate messages {Destroy,Purge}HistoryRequest instead of a single CloseTransactionChain.

Comment by Robert Varga [ 27/Nov/18 ]

This will require postponing the purge until after all transactions are done. Tracking them down, though, may be a bit tricky, as they different states are tracked in different places (pending, committing, committed, purged). Since this is only for ask-based, we can probably not care about committed (a.k.a. closed) transactions and need to track down only transactions which are being committed.

Generated at Wed Feb 07 19:56:02 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.