Details
-
Bug
-
Status: Resolved
-
Resolution: Done
-
Post-Helium
-
None
-
None
-
Operating System: All
Platform: All
-
3206
-
High
Description
I created a unit test in DistributedDataStoreIntegrationTest that accesses 2 shards using a Tx chain:
DOMStoreWriteTransaction writeTx = txChain.newWriteOnlyTransaction();
writeTx.write(CarsModel.BASE_PATH, CarsModel.emptyContainer());
writeTx.write(PeopleModel.BASE_PATH, PeopleModel.emptyContainer());
writeTx.write(CarsModel.CAR_LIST_PATH, CarsModel.newCarMapNode());
writeTx.write(PeopleModel.PERSON_LIST_PATH, PeopleModel.newPersonMapNode());
DOMStoreThreePhaseCommitCohort cohort1 = writeTx.ready();
ListenableFuture<Boolean> canCommit1 = cohort1.canCommit();
DOMStoreReadWriteTransaction readWriteTx = txChain.newReadWriteTransaction();
MapEntryNode car = CarsModel.newCarEntry("optima", BigInteger.valueOf(20000));
YangInstanceIdentifier carPath = CarsModel.newCarPath("optima");
readWriteTx.write(carPath, car);
MapEntryNode person = PeopleModel.newPersonEntry("jack");
YangInstanceIdentifier personPath = PeopleModel.newPersonPath("jack");
readWriteTx.merge(personPath, person);
DOMStoreThreePhaseCommitCohort cohort2 = readWriteTx.ready();
ListenableFuture<Boolean> canCommit2 = cohort2.canCommit();
writeTx = txChain.newWriteOnlyTransaction();
writeTx.delete(personPath);
DOMStoreThreePhaseCommitCohort cohort3 = writeTx.ready();
doCommit(canCommit1, cohort1);
doCommit(canCommit2, cohort2);
doCommit(cohort3);
The first 2 txns write to both shards and thus are coordinated commits, ie they they go thru the normal 3PC coordinated by the front-end. The 3rd txn only accesses 1 shard and thus does a direct commit, eliding the front-end 3PC coordination.
The test fails with an AskTimeoutEx on the "dummy" canCommit for the 3rd txn, which doesn't actually send the CanCommit message but simply waits on the direct commit ReadyLocalTransaction message.
I found 2 issues causing the AskTimeoutEx. When the ReadyLocalTransaction is processed for the 3rd txn, the 2nd txn's commit is still in progress so the 3rd txn's CohortEntry is queued. When the 2nd txn finishes commit, the 3rd txn's CohortEntry is dequeued and proceeds to doCanCommit. Since it's an immediate commit, it proceeds right to commit which calls the RaftActor to persist. In the debug output I saw the call to persistence but it never got the callback that persist completed. Thus it never finished the commit and sent the CommitTransactionReply which resulted in the AskTimeoutEx.
The problem is that the 2nd tnx finished it's commit via a direct call to applyState from the RaftActor's persist apply callback. The shard then immediately proceeds to the 3rd txn and tries to persist the its entry but, apparently, akka's persistence isn't re-entrant. So you can't try to do another persist from a previous persist callback.
I change the RaftActor to send the ApplyState message instead of calling applyState directly. However I still saw the AskTimeoutEx. The 2nd issue is that the shard was using the wrong sender when sending the CommitTransactionReply for the 3rd txn. It was using getSender() which, in this case, was the sender for the first txn. After changing it to use the correct sender from tnx 3's CohortEntry, the test succeeded.
However the test still fails sporadically b/c, due to timing, the 3rd direct commit txn can be committed before the prior coordinated commit. I still need to look into that.