[CONTROLLER-1687] Transaction writer does not fail expected time after isolation Created: 17/May/17 Updated: 25/Jul/23 Resolved: 27/Jul/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Robert Varga |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||
| External issue ID: | 8494 | ||||||||
| Description |
|
This happened once so far, usually the failure happens earlier ( The suite calls write-transactions on a member, then isolates the member, then waits some time (60 second right now) expecting no failure, and then some time (20 s) expecting a failure. This failure (or any other response) did not happen [1] when the target module-based shard has leader on the same member. Tell-based protocol, simple (as in not chained) transactions. In kkaraf.log [2] between 13:46:27,686 and 13:50:38,617 I see usual isolation-related messages only. [0] https://git.opendaylight.org/gerrit/57074 |
| Comments |
| Comment by Tomas Cere [ 18/May/17 ] |
|
I think with tell based protocol the correct time to wait until the transaction should time out is 2 minutes after: https://git.opendaylight.org/gerrit/#/c/56874/ |
| Comment by Vratko Polak [ 22/May/17 ] |
|
Currently, the suite waits for 121 seconds [0], which may be too close. The strange thing is that the responses report different times [1] timeout happened. The smallest time is 128 seconds, other times are 162s and 174s. Should we make the suite more tolerant to the timeout timing? [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s28-t3-k2-k2 |
| Comment by Vratko Polak [ 22/May/17 ] |
|
Something similar happened in another scenario. Still module-based shard with tell-based protocol, but now only one writer (co-located with the leader). The suite computes end time from the time transactions stream started [2], which is less precise than computing from isolation time. But the difference is less than 6 seconds, and we wait for the failure for additional 20 seconds, so I believe it is the same Bug. |
| Comment by Vratko Polak [ 22/May/17 ] |
|
> module-based shard Prefix-based shards now behave the same way [3]. The interesting fact is that in the last run, this Bug always happened when writer and leader were on the same member, but never when the writer was isolated on a follower member. |
| Comment by Viera Zelcamova [ 23/May/17 ] |
|
seems to be similar problem as #8445. We need to have debug logs, to confirm a duplicat. |
| Comment by Peter Gubka [ 23/May/17 ] |
|
New run with debug logs https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-4th/5/ Odl bundle used from https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/37/ |
| Comment by Peter Gubka [ 25/May/17 ] |
|
Run with debug logs (includung akka) Odl build from from controller=63/57763/2:99/57699/3 |
| Comment by Peter Gubka [ 25/May/17 ] |
|
New logs (akka included) Odl built from controller=70/57770/4:99/57699/3 at |
| Comment by Vratko Polak [ 29/May/17 ] |
|
With [11] codebase this sometimes happens (writer does not fail within 30 seconds) [12] and sometimes not (writer fails as expected, in just three seconds) [13]. [11] https://git.opendaylight.org/gerrit/#/c/57822/1 |
| Comment by Robert Varga [ 29/May/17 ] |
|
In those particular runs I do not see anything unexpected. It would seem our timer is not running in ReconnectingConnection (the original timer is stopped when it reaches reconnection). |
| Comment by Robert Varga [ 29/May/17 ] |
|
I have noticed the following in what looks to be a user thead: 2017-05-29 03:13:03,006 | DEBUG | pool-30-thread-1 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | Sleeping for 2654453ms Which means that thread is being blocked for almost 45 minutes, and that came form the ProgressTracker. The is the first 'Sleeping for' message, which is indicating we were not exerting backpressure while the connection was connecting. That process finished just a millisecond before: 2017-05-29 03:13:03,005 | INFO | ult-dispatcher-4 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-config: replaced connection ReconnectingClientConnection{client=ClientIdentifier {frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-1620047755, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.15:2550/user/shardmanager-config/member-2-shard-default-config#-20940815], sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=false}} We clearly need to put an upper bound on sleeping, but also we need fix the backpressure – the client should have been throttled before that point (i.e. there should have been 'Sleeping for' messages. |
| Comment by Robert Varga [ 29/May/17 ] |
|
Note this is the 'reconnecting' case, the connection has previously been connected and there is a forwarder involved. |
| Comment by Robert Varga [ 01/Jun/17 ] |
| Comment by Peter Gubka [ 02/Jun/17 ] |
|
(In reply to Robert Varga from comment #13) The problem still present. Job: https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-4th/2/ |
| Comment by Robert Varga [ 03/Jun/17 ] |
| Comment by Vratko Polak [ 06/Jun/17 ] |
|
> https://git.opendaylight.org/gerrit/58194 Build that had 58194/3 chain with 57699/5 on top still failed [18] on Sandbox. [18] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/4/log.html.gz#s1-s16-t1-k2-k16-k1-k1 |
| Comment by Vratko Polak [ 07/Jun/17 ] |
|
> Build that had 58194/3 chain with 57699/5 on top The same run with some CDS related debug messages. The karaf.log [20] of the member with transaction producer and original prefix shard leader is huge. |
| Comment by Robert Varga [ 08/Jun/17 ] |
|
The victim transaction should be: 2017-06-06 18:51:56,112 | WARN | lt-dispatcher-20 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-3-shard-id-ints!-config: Current transaction member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0 has timed out after 15000 ms in state COMM It's request happened here: 2017-06-06 18:50:03,702 | DEBUG | lt-dispatcher-28 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Transaction proxy LocalReadWriteProxyTransaction {identifier=member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0, state=FLUSHED} sending request CommitLocalTransactionRequest{target=member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#531635654], coordinated=false} callback org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction$$Lambda$128/1535548548@1a032e87 and the timeout occurred pretty much spot-on: 2017-06-06 18:52:03,732 | DEBUG | lt-dispatcher-27 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-3-frontend-datastore-Shard-id-ints!: running timer on ReconnectingClientConnection{client=ClientIdentifier{frontend=member-3-frontend-datastore-Shard-i d-ints!, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-id-ints!-config#821446964, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=id-ints!, dataTree present=true}} 2017-06-06 18:52:03,732 | DEBUG | lt-dispatcher-27 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-3-frontend-datastore-Shard-id-ints!: Connection ReconnectingClientConnection{client=ClientIdentifier{frontend=member-3-frontend-datastore-Shard-id-ints!, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-id-ints!-config#821446964, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=id-ints!, dataTree present=true}} timed out entry TransmittedConnectionEntry{request=CommitLocalTransactionRequest{target=member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#531635654], coordinated=false}, enqueuedTicks=5430269723299, sessionId=0, txSequence=5706} and it was routed to the Future: 2017-06-06 18:52:03,815 | DEBUG | lt-dispatcher-27 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Transaction LocalReadWriteProxyTransaction{identifier=member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0, state=FLUSHED} directCommit completed |
| Comment by Vratko Polak [ 15/Jun/17 ] |
|
This is recently happening very reliably, in every test where the client is isolated for a longer time, except for one combination. That combination is when transaction writer (module-based shard) uses transaction chain (as opposed to free-standing transactions). That combination is passing last 7 days. It might suggest that the core works correctly, but the test driver does not detect the first transaction failure correctly. |
| Comment by Vratko Polak [ 27/Jun/17 ] |
|
Sandbox run with multiple test cases relevant to this Bug, verbose logs and 59527/1 codebase: [21] [21] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/5/ |
| Comment by Robert Varga [ 27/Jun/17 ] |
| Comment by Vratko Polak [ 28/Jun/17 ] |
|
Latest Sandbox run, one test case (waiting time increased to 30 minutes) against codebase [22] (with cherry-pick [23]) with verbose logs, still shows [24] that transaction-writer using simple transactions (not chains) does not report failure when the node (member-2, containing both shard leader and writer) is isolated indefinitely. [22] https://git.opendaylight.org/gerrit/#/c/59604/1 |
| Comment by Vratko Polak [ 29/Jun/17 ] |
|
> one test case The same test case is still failing [25] on releng, but other 6 test cases have passed on releng today. |
| Comment by Vratko Polak [ 30/Jun/17 ] |
|
> other 6 test cases have passed Today only 5 of them. |
| Comment by Vratko Polak [ 03/Jul/17 ] |
|
Still present, as two test case failing. [27] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/1/log.html.gz#s1-s2-t3-k2-k16-k1-k1 |
| Comment by Vratko Polak [ 03/Jul/17 ] |
|
With [29] the message [30] is different: [29] https://git.opendaylight.org/gerrit/#/c/59895/1 |
| Comment by Vratko Polak [ 17/Jul/17 ] |
|
This week Sandbox run: [31]. [31] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-ci-only-carbon/2/ |
| Comment by Vratko Polak [ 20/Jul/17 ] |
|
I think I have a fix [32], but it probably needs a rebase and a better commit message. |
| Comment by Vratko Polak [ 24/Jul/17 ] |
|
> [32] https://git.opendaylight.org/gerrit/#/c/60606/2 Not merged yet, Bug still present on releng [33]. |
| Comment by Vratko Polak [ 26/Jul/17 ] |
|
[32] is merged, tests are mostly passing, but not 100%. Here is a recent releng failure: [34]. |
| Comment by Vratko Polak [ 26/Jul/17 ] |
|
> recent releng failure: [34]. Oh, that happened for prefix-based shards, but before the fix those tests were passing (only module-based tests were failing on this Bug). So that may be a similar but different Bug. More investigation under way. |
| Comment by Vratko Polak [ 26/Jul/17 ] |
|
> More investigation Does not seem to be reproducible on Sandbox so far (runs 4-10 of [35]), I suspect there is some inconsistency in the current implementation of produce-transactions which needs a specific timing to cause this symptom. [35] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-ci-only-carbon/ |
| Comment by Vratko Polak [ 26/Jul/17 ] |
|
> Does not seem to be reproducible on Sandbox so far (runs 4-10 of [35]) Run 15 succeeded at reproducing the failure [36]. 2017-07-26 16:04:02,856 | DEBUG | pool-31-thread-2 | AbstractTransactionHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler Note that it means checkComplete has never been called, even though completigFuture has not been cancelled. [36] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/15/log.html.gz#s1-s2-t3-k2-k16-k1-k1 |
| Comment by Vratko Polak [ 26/Jul/17 ] |
|
> I will try to reproduce with more logging. Reproduced. This time the karaf.log [39] shows that the progress stops when closing itemProducer. Will try to repeat with even more logs [40]. 2017-07-26 17:28:10,129 | TRACE | pool-31-thread-2 | ProduceTransactionsHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler [39] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/18/odl1_karaf.log.gz |
| Comment by Vratko Polak [ 27/Jul/17 ] |
|
> Will try to repeat with even more logs [40]. No success reproducing with new trace logs yet. > progress stops when closing itemProducer This feels different enough from the previous comments on this Bug, opened |