[CONTROLLER-1687] Transaction writer does not fail expected time after isolation Created: 17/May/17  Updated: 25/Jul/23  Resolved: 27/Jul/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
blocks CONTROLLER-1700 Timeout waiting for task from writer ... Resolved
External issue ID: 8494

 Description   

This happened once so far, usually the failure happens earlier (CONTROLLER-1679).
This behavior is expected to change after the fix [0] goes in, we will probably reuse this Bug if the failure does not happen after that.

The suite calls write-transactions on a member, then isolates the member, then waits some time (60 second right now) expecting no failure, and then some time (20 s) expecting a failure.

This failure (or any other response) did not happen [1] when the target module-based shard has leader on the same member. Tell-based protocol, simple (as in not chained) transactions. In kkaraf.log [2] between 13:46:27,686 and 13:50:38,617 I see usual isolation-related messages only.

[0] https://git.opendaylight.org/gerrit/57074
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/693/archives/log.html.gz#s1-s32-t3-k2-k17-k1-k1
[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/693/archives/odl2_karaf.log.gz



 Comments   
Comment by Tomas Cere [ 18/May/17 ]

I think with tell based protocol the correct time to wait until the transaction should time out is 2 minutes after: https://git.opendaylight.org/gerrit/#/c/56874/

Comment by Vratko Polak [ 22/May/17 ]

Currently, the suite waits for 121 seconds [0], which may be too close. The strange thing is that the responses report different times [1] timeout happened. The smallest time is 128 seconds, other times are 162s and 174s.

Should we make the suite more tolerant to the timeout timing?

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s28-t3-k2-k2
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s28-t3-k2-k26-k1-k7

Comment by Vratko Polak [ 22/May/17 ]

Something similar happened in another scenario. Still module-based shard with tell-based protocol, but now only one writer (co-located with the leader).

The suite computes end time from the time transactions stream started [2], which is less precise than computing from isolation time. But the difference is less than 6 seconds, and we wait for the failure for additional 20 seconds, so I believe it is the same Bug.

[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s32-t3-k2-k12

Comment by Vratko Polak [ 22/May/17 ]

> module-based shard

Prefix-based shards now behave the same way [3].

The interesting fact is that in the last run, this Bug always happened when writer and leader were on the same member, but never when the writer was isolated on a follower member.

[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/715/archives/log.html.gz#s1-s34-t1-k2-k19-k1-k1

Comment by Viera Zelcamova [ 23/May/17 ]

seems to be similar problem as #8445. We need to have debug logs, to confirm a duplicat.

Comment by Peter Gubka [ 23/May/17 ]

New run with debug logs https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-4th/5/

Odl bundle used from https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/37/

Comment by Peter Gubka [ 25/May/17 ]

Run with debug logs (includung akka)
https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-4th/6/

Odl build from from controller=63/57763/2:99/57699/3
https://nexus.opendaylight.org/content/repositories/opendaylight.snapshot/org/opendaylight/integration/integration/distribution/distribution-karaf/0.6.0-SNAPSHOT/distribution-karaf-0.6.0-20170524.144642-162.zip

Comment by Peter Gubka [ 25/May/17 ]

New logs (akka included)
https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-4th/7/

Odl built from controller=70/57770/4:99/57699/3 at
https://jenkins.opendaylight.org/releng/view/integration/job/integration-multipatch-test-carbon/45/

Not-failed rpc:
https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-4th/7/archives/log.html.gz#s1-s2-t1-k2-k20-k1-k2-k1

Comment by Vratko Polak [ 29/May/17 ]

With [11] codebase this sometimes happens (writer does not fail within 30 seconds) [12] and sometimes not (writer fails as expected, in just three seconds) [13].

[11] https://git.opendaylight.org/gerrit/#/c/57822/1
[12] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/4/archives/log.html.gz#s1-s14-t1-k2-k19-k1-k1
[13] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/4/archives/log.html.gz#s1-s14-t3-k2-k19

Comment by Robert Varga [ 29/May/17 ]

In those particular runs I do not see anything unexpected. It would seem our timer is not running in ReconnectingConnection (the original timer is stopped when it reaches reconnection).

Comment by Robert Varga [ 29/May/17 ]

I have noticed the following in what looks to be a user thead:

2017-05-29 03:13:03,006 | DEBUG | pool-30-thread-1 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | Sleeping for 2654453ms

Which means that thread is being blocked for almost 45 minutes, and that came form the ProgressTracker. The is the first 'Sleeping for' message, which is indicating we were not exerting backpressure while the connection was connecting. That process finished just a millisecond before:

2017-05-29 03:13:03,005 | INFO | ult-dispatcher-4 | ClientActorBehavior | 197 - org.opendaylight.controller.cds-access-client - 1.1.0.SNAPSHOT | member-1-frontend-datastore-config: replaced connection ReconnectingClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#-1620047755, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=true}} with ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actor[akka.tcp://opendaylight-cluster-data@10.29.15.15:2550/user/shardmanager-config/member-2-shard-default-config#-20940815], sessionId=1, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree present=false}}

We clearly need to put an upper bound on sleeping, but also we need fix the backpressure – the client should have been throttled before that point (i.e. there should have been 'Sleeping for' messages.

Comment by Robert Varga [ 29/May/17 ]

Note this is the 'reconnecting' case, the connection has previously been connected and there is a forwarder involved.

Comment by Robert Varga [ 01/Jun/17 ]

https://git.opendaylight.org/gerrit/57996

Comment by Peter Gubka [ 02/Jun/17 ]

(In reply to Robert Varga from comment #13)
> https://git.opendaylight.org/gerrit/57996

The problem still present.

Job: https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-only-carbon-4th/2/
Problem: https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon-4th/2/archives/log.html.gz#s1-s2-t1-k2-k17-k1-k2-k1
Zip built from: controller=96/57996/5:99/57699/5

Comment by Robert Varga [ 03/Jun/17 ]

https://git.opendaylight.org/gerrit/58194

Comment by Vratko Polak [ 06/Jun/17 ]

> https://git.opendaylight.org/gerrit/58194

Build that had 58194/3 chain with 57699/5 on top still failed [18] on Sandbox.
That run was without debug messages, so karaf.log [19] does not show much between 07:29:35,869 and 07:32:13,384.

[18] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/4/log.html.gz#s1-s16-t1-k2-k16-k1-k1
[19] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/4/odl1_karaf.log.gz

Comment by Vratko Polak [ 07/Jun/17 ]

> Build that had 58194/3 chain with 57699/5 on top

The same run with some CDS related debug messages. The karaf.log [20] of the member with transaction producer and original prefix shard leader is huge.
Transaction producer starts at 18:49:52,337, impeding isolation is announced at 18:50:02,850, progress is stopped at 18:50:03,842, rejoin (after the test failed) is announced at 18:52:23,138.

[20] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/6/odl3_karaf.log.gz

Comment by Robert Varga [ 08/Jun/17 ]

The victim transaction should be:

2017-06-06 18:51:56,112 | WARN | lt-dispatcher-20 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-3-shard-id-ints!-config: Current transaction member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0 has timed out after 15000 ms in state COMM
IT_PENDING
2017-06-06 18:51:56,112 | WARN | lt-dispatcher-20 | ShardDataTree | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | member-3-shard-id-ints!-config: Transaction member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0 is still committing, cannot abort

It's request happened here:

2017-06-06 18:50:03,702 | DEBUG | lt-dispatcher-28 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Transaction proxy LocalReadWriteProxyTransaction

{identifier=member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0, state=FLUSHED} sending request Co
mmitLocalTransactionRequest{target=member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#531635654], coordinated=false} callback org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction$$Lambda$128/1535548548@1a032e87


and the timeout occurred pretty much spot-on:

2017-06-06 18:52:03,732 | DEBUG | lt-dispatcher-27 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-3-frontend-datastore-Shard-id-ints!: running timer on ReconnectingClientConnection{client=ClientIdentifier{frontend=member-3-frontend-datastore-Shard-i d-ints!, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-id-ints!-config#821446964, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=id-ints!, dataTree present=true}}
2017-06-06 18:52:03,732 | DEBUG | lt-dispatcher-27 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.1.SNAPSHOT | member-3-frontend-datastore-Shard-id-ints!: Connection ReconnectingClientConnection{client=ClientIdentifier{frontend=member-3-frontend-datastore-Shard-id-ints!, generation=0}, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-3-shard-id-ints!-config#821446964, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=id-ints!, dataTree present=true}} timed out entry TransmittedConnectionEntry{request=CommitLocalTransactionRequest{target=member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0, sequence=0, replyTo=Actor[akka://opendaylight-cluster-data/user/$m#531635654], coordinated=false}, enqueuedTicks=5430269723299, sessionId=0, txSequence=5706}

and it was routed to the Future:

2017-06-06 18:52:03,815 | DEBUG | lt-dispatcher-27 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.1.SNAPSHOT | Transaction LocalReadWriteProxyTransaction{identifier=member-3-datastore-Shard-id-ints!-fe-0-chn-5-txn-2853-0, state=FLUSHED}

directCommit completed

Comment by Vratko Polak [ 15/Jun/17 ]

This is recently happening very reliably, in every test where the client is isolated for a longer time, except for one combination. That combination is when transaction writer (module-based shard) uses transaction chain (as opposed to free-standing transactions). That combination is passing last 7 days.

It might suggest that the core works correctly, but the test driver does not detect the first transaction failure correctly.
The test drivers are ProduceTransactionsHandler and WriteTransactionsHandler (which has a separate codepath for failing the RPC call when transaction chain fails).

Comment by Vratko Polak [ 27/Jun/17 ]

Sandbox run with multiple test cases relevant to this Bug, verbose logs and 59527/1 codebase: [21]

[21] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/5/

Comment by Robert Varga [ 27/Jun/17 ]

https://git.opendaylight.org/gerrit/59576

Comment by Vratko Polak [ 28/Jun/17 ]

Latest Sandbox run, one test case (waiting time increased to 30 minutes) against codebase [22] (with cherry-pick [23]) with verbose logs, still shows [24] that transaction-writer using simple transactions (not chains) does not report failure when the node (member-2, containing both shard leader and writer) is isolated indefinitely.

[22] https://git.opendaylight.org/gerrit/#/c/59604/1
[23] https://git.opendaylight.org/gerrit/#/c/59575/1
[24] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/11/log.html.gz#s1-s2-t1-k2-k16-k1

Comment by Vratko Polak [ 29/Jun/17 ]

> one test case

The same test case is still failing [25] on releng, but other 6 test cases have passed on releng today.

[25] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/765/log.html.gz#s1-s32-t3-k2-k16-k1-k1

Comment by Vratko Polak [ 30/Jun/17 ]

> other 6 test cases have passed

Today only 5 of them.
The another failing case is [26].

[26] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/767/log.html.gz#s1-s32-t7-k2-k16-k1-k1

Comment by Vratko Polak [ 03/Jul/17 ]

Still present, as two test case failing.
Sandbox links to Robot expecting but not seeing writer response: [27] [28].

[27] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/1/log.html.gz#s1-s2-t3-k2-k16-k1-k1
[28] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/1/log.html.gz#s1-s2-t7-k2-k16-k1

Comment by Vratko Polak [ 03/Jul/17 ]

With [29] the message [30] is different:
"Backend timeout in state READY after 15000ms" is returned 35 second after the leader movement.

[29] https://git.opendaylight.org/gerrit/#/c/59895/1
[30] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-ddb-expl-lead-movement-longevity-only-carbon/5/log.html.gz#s1-s2-t1-k2-k1-k1-k1-k1-k1-k1-k2-k1-k1-k2-k10

Comment by Vratko Polak [ 17/Jul/17 ]

This week Sandbox run: [31].

[31] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-ci-only-carbon/2/

Comment by Vratko Polak [ 20/Jul/17 ]

I think I have a fix [32], but it probably needs a rebase and a better commit message.

[32] https://git.opendaylight.org/gerrit/#/c/60606/2

Comment by Vratko Polak [ 24/Jul/17 ]

> [32] https://git.opendaylight.org/gerrit/#/c/60606/2

Not merged yet, Bug still present on releng [33].

[33] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/791/log.html.gz#s1-s32-t3-k2-k16-k1-k1

Comment by Vratko Polak [ 26/Jul/17 ]

[32] is merged, tests are mostly passing, but not 100%.

Here is a recent releng failure: [34].

[34] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/794/log.html.gz#s1-s34-t7-k2-k16-k1-k1

Comment by Vratko Polak [ 26/Jul/17 ]

> recent releng failure: [34].

Oh, that happened for prefix-based shards, but before the fix those tests were passing (only module-based tests were failing on this Bug).

So that may be a similar but different Bug. More investigation under way.

Comment by Vratko Polak [ 26/Jul/17 ]

> More investigation

Does not seem to be reproducible on Sandbox so far (runs 4-10 of [35]),
and (as usual for this Bug) nothing suspicious is visible on INFO level of releng logs.

I suspect there is some inconsistency in the current implementation of produce-transactions which needs a specific timing to cause this symptom.

[35] https://jenkins.opendaylight.org/sandbox/job/controller-csit-3node-clustering-ci-only-carbon/

Comment by Vratko Polak [ 26/Jul/17 ]

> Does not seem to be reproducible on Sandbox so far (runs 4-10 of [35])

Run 15 succeeded at reproducing the failure [36].
In the huge karaf.log [37] I do not see anything suspicious before RequestTimeoutException makes it into AbstractTransactionHandler#txFailure, but then possibly something wrong happens, as there is nothing interesting logged after:

2017-07-26 16:04:02,856 | DEBUG | pool-31-thread-2 | AbstractTransactionHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler
@bee1b2c Future #5634 failed in state RUNNING cause org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 120.024877455seconds

Note that it means checkComplete has never been called, even though completigFuture has not been cancelled.
I will try to reproduce with more logging.

[36] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/15/log.html.gz#s1-s2-t3-k2-k16-k1-k1
[37] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/15/odl2_karaf.log.gz
[38] https://git.opendaylight.org/gerrit/#/c/60187/15

Comment by Vratko Polak [ 26/Jul/17 ]

> I will try to reproduce with more logging.

Reproduced. This time the karaf.log [39] shows that the progress stops when closing itemProducer. Will try to repeat with even more logs [40].

2017-07-26 17:28:10,129 | TRACE | pool-31-thread-2 | ProduceTransactionsHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | org.opendaylight.controller.clustering.it.provider.impl.ProduceTransactionsHandler
@30b30b0 Entering runFailed.

[39] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-ci-only-carbon/18/odl1_karaf.log.gz
40https://git.opendaylight.org/gerrit/#/c/60783/1

Comment by Vratko Polak [ 27/Jul/17 ]

> Will try to repeat with even more logs [40].

No success reproducing with new trace logs yet.

> progress stops when closing itemProducer

This feels different enough from the previous comments on this Bug, opened CONTROLLER-1745.

Generated at Wed Feb 07 19:56:11 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.