[CONTROLLER-1708] write-transactions does not respond (no evident reason) Created: 07/Jun/17  Updated: 25/Jul/23  Resolved: 24/Jul/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Robert Varga
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8619

 Description   

This is basically CONTROLLER-1672 but that evolved into issues with datastore and jolokia reads; this is just a missing RPC response.
CONTROLLER-1665 is basically the same, it that was seen in different scenarios and is perhaps fixed by now, so not sure if this is a duplicate.

This happened on Sandbox, when the original module-based leader member-1 was isolated for a brief time. The unresponsive member was member-2, which stayed a Follower, but it is not clear whether it was considering member-3 as a Leader.
Karaf.log [0] is huge, the test started at 17:52:27,543

Writers on the other two members have reported [1] success.

[0] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/6/odl2_karaf.log.gz
[1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-only-carbon/6/log.html.gz#s1-s10-t1-k2-k25-k1-k1



 Comments   
Comment by Vratko Polak [ 09/Jun/17 ]

This is now happening [2] on RelEng. The unresponsive member-2 is the one which stayed follower.
Still nothing obvious in karaf.log [3] between 06:26:02,391 and 06:35:27,863.

[2] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/737/log.html.gz#s1-s28-t1-k2-k25-k1-k1
[3] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/737/odl2_karaf.log.gz

Comment by Vratko Polak [ 09/Jun/17 ]

Note that this time Robot waits for: duration writer should work, plus three 125s timeouts plus 10 seconds just to be sure something is really wrong.

Comment by Vratko Polak [ 15/Jun/17 ]

This is still happening, does not matter if the isolation time is brief [4] or long [5]. Only the member which stays follower acts this way, and only for module-based shards (only tell-based protocol is tested).

Prefix-based shard tests either pass, or run into CONTROLLER-1687.

[4] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/749/log.html.gz#s1-s28-t1-k2-k25-k1-k2-k3
[5] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/749/log.html.gz#s1-s28-t3-k2-k25-k1-k9-k2

Comment by Vratko Polak [ 27/Jun/17 ]

Sandbox run [6] with verbose debug logs, codebase [7] with first two failures as they happen on releng.

[6] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/5/
[7] https://git.opendaylight.org/gerrit/#/c/59527/1

Comment by Vratko Polak [ 30/Jun/17 ]

This is still happening, but in the test (3 writers, previous Leader isolated longer than 2 minutes) where CONTROLLER-1734 appears too. Releng [8] and Sandbox [9] had different members (relative to leader placement) show this, so it would be better to fix CONTROLLER-1734 first. This may even be an extreme case of CONTROLLER-1734 when the RequestTimeout would came after very long time.

[8] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/767/log.html.gz#s1-s28-t3-k2-k25-k1-k8
[9] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/17/log.html.gz#s1-s2-t3-k2-k25-k1-k8

Comment by Vratko Polak [ 03/Jul/17 ]

Recent Sandbox occurrence: [10].
I mean member-1, other members exhibit CONTROLLER-1734 instead.
Currently that is the only test case showing this Bug.

[10] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/1/log.html.gz#s1-s2-t3-k2-k25-k1-k8

Comment by Vratko Polak [ 07/Jul/17 ]

Based on one Sandbox run, code [11] may be fixing [12] this Bug.

[11] https://git.opendaylight.org/gerrit/#/c/60033/5
[12] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/4/log.html.gz#s1-s2-t3-k2-k25-k1-k8

Comment by Vratko Polak [ 10/Jul/17 ]

> code [11] may be fixing [12] this Bug.

No. This week Sandbox run on the same .zip shows [13] read timeout again.

[13] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/1/log.html.gz#s1-s2-t3-k2-k25-k1-k8

Comment by Vratko Polak [ 17/Jul/17 ]

This week Sandbox (codebase [14]) fails [15], huge log [16].

As far as I can see, every future completed successfully, no idea why writer has not returned success. Will retry with extended logs [17].

[14] https://git.opendaylight.org/gerrit/#/c/59506/32
[15] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/2/log.html.gz#s1-s2-t3-k2-k25-k1-k8
[16] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/2/odl2_karaf.log.gz
[17] https://git.opendaylight.org/gerrit/#/c/60187/2

Comment by Vratko Polak [ 17/Jul/17 ]

> every future completed successfully

That was a case when this Bug hits the member which stayed follower.

In the next run [18] however, this bug happened on the old leader after rejoin (after long isolation).
We may have two different bugs with the same Robot symptom here.

[18] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/4/

Comment by Vratko Polak [ 18/Jul/17 ]

Sandbox keeps reproducing [19] the variant with old leader failure.
In the karaf.log [20] I see the second writer got stuck on line 147 [21] and also that ReconnectingClientConnection has never reconnected after rejoin.

I will keep looking into this some more.

[19] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/8/log.html.gz#s1-s2-t1-k2-k25-k1-k8
[20] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/8/odl3_karaf.log.gz
[21] https://git.opendaylight.org/gerrit/#/c/60187/5/opendaylight/md-sal/samples/clustering-test-app/provider/src/main/java/org/opendaylight/controller/clustering/it/provider/impl/WriteTransactionsHandler.java@146

Comment by Vratko Polak [ 20/Jul/17 ]

> ReconnectingClientConnection has never reconnected after rejoin.

Looks like in lockedCheckTimeout we are completing (failing) connection entries, which leads to purge request entering the queue and sleeping the thread.

Segment of karaf.log [22]:

2017-07-20 10:38:35,854 | DEBUG | lt-dispatcher-25 | AbstractClientConnection | 197 - org.opendaylight.control
ler.cds-access-client - 1.1.2.SNAPSHOT | member-1-frontend-datastore-config: Connection ReconnectingClientConnection{c
lient=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}, cookie=0, backend=ShardBackendInfo{
actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#879061049, sessio
nId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} timed out entry TransmittedConnect
ionEntry{request=CommitLocalTransactionRequest{target=member-1-datastore-config-fe-0-chn-2-txn-4402-0, sequence=0, rep lyTo=Actor[akka://opendaylight-cluster-data/user/$a#-451310230], coordinated=false}, enqueuedTicks=443234413299, sessi
onId=0, txSequence=7862}
2017-07-20 10:38:35,854 | WARN | lt-dispatcher-25 | ConcurrentDOMDataBroker | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Tx: DOM-CHAIN-1-4402 Error during phase CAN_COMMIT, starting Abort
org.opendaylight.controller.cluster.access.client.RequestTimeoutException: Timed out after 120.022340136seconds
at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.lockedCheckTimeout(AbstractClientConnection.java:390)[197:org.opendaylight.controller.cds-access-client:1.1.2.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.AbstractClientConnection.runTimer(AbstractClientConnection.java:321)[197:org.opendaylight.controller.cds-access-client:1.1.2.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:123)[197:org.opendaylight.controller.cds-access-client:1.1.2.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.ClientActorBehavior.onReceiveCommand(ClientActorBehavior.java:46)[197:org.opendaylight.controller.cds-access-client:1.1.2.SNAPSHOT]
at org.opendaylight.controller.cluster.access.client.AbstractClientActor.onReceiveCommand(AbstractClientActor.java:60)[197:org.opendaylight.controller.cds-access-client:1.1.2.SNAPSHOT]
at akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)[180:com.typesafe.akka.persistence:2.4.18]
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)[173:com.typesafe.akka.actor:2.4.18]
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)[173:com.typesafe.akka.actor:2.4.18]
at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:727)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.Eventsourced$class.aroundReceive(Eventsourced.scala:183)[180:com.typesafe.akka.persistence:2.4.18]
at akka.persistence.UntypedPersistentActor.aroundReceive(PersistentActor.scala:168)[180:com.typesafe.akka.persistence:2.4.18]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)[173:com.typesafe.akka.actor:2.4.18]
at akka.actor.ActorCell.invoke(ActorCell.scala:495)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.run(Mailbox.scala:224)[173:com.typesafe.akka.actor:2.4.18]
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[173:com.typesafe.akka.actor:2.4.18]
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[169:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-4402-0, state=FLUSHED} directCommit completed
2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-4402-0, state=FLUSHED}: initiating purge
2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-4402-0, state=DONE} allocated sequence 1
2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractProxyTransaction | 199 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Transaction proxy LocalReadWriteProxyTransaction{identifier=member-1-datastore-config-fe-0-chn-2-txn-4402-0, state=DONE} enqueing request TransactionPurgeRequest{target=member-1-datastore-config-fe-0-chn-2-txn-4402-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$a#-451310230]} callback org.opendaylight.controller.cluster.databroker.actors.dds.AbstractProxyTransaction$$Lambda$152/951909138@6759f4d3
2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.2.SNAPSHOT | member-1-frontend-datastore-config: connection ConnectedClientConnection{client=ClientIdentifier{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#879061049, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}} has a successor, not scheduling timer
2017-07-20 10:38:35,855 | TRACE | lt-dispatcher-25 | TransmitQueue | 197 - org.opendaylight.controller.cds-access-client - 1.1.2.SNAPSHOT | Queue is at capacity, delayed sending of request TransactionPurgeRequest

{target=member-1-datastore-config-fe-0-chn-2-txn-4402-0, sequence=1, replyTo=Actor[akka://opendaylight-cluster-data/user/$a#-451310230]}

2017-07-20 10:38:35,855 | DEBUG | lt-dispatcher-25 | AbstractClientConnection | 197 - org.opendaylight.controller.cds-access-client - 1.1.2.SNAPSHOT | member-1-frontend-datastore-config: Sleeping for 498ms on connection ReconnectingClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend=ShardBackendInfo{actor=Actorakka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#879061049, sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}}

[22] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/13/odl1_karaf.log.gz

Comment by Robert Varga [ 20/Jul/17 ]

The problem now is the TransmitQueue:233, which switches from enqueue to send mechanics. This needs to be investigated more.

Comment by Robert Varga [ 20/Jul/17 ]

https://git.opendaylight.org/gerrit/60614

Comment by Vratko Polak [ 21/Jul/17 ]

> https://git.opendaylight.org/gerrit/60614

Merged, but the Bug is still present. Big karaf.log [24] with additional [25] logs.

The scenario starts with transaction-writers on each node. Then Leader is isolated for more than 120 seconds, its writer fails as expected. Then the member is rejoined and a new writer is started on that member. That is the writer which never returns any RPC result.

Previously that was because it got stuck in "write only the top list" step (possibly because ReconnectingClientConnection never connected after the rejoin). More details after I take a look into the log.

[24] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/16/odl1_karaf.log.gz
[25] https://git.opendaylight.org/gerrit/#/c/60187/9

Comment by Vratko Polak [ 21/Jul/17 ]

> More details after I take a look into the log.

>> ReconnectingClientConnection never connected after the rejoin

Still happens.

Contrary to previous attempt at 11:02:32,894, the attempt at 11:02:53,934 never failed (nor resolved the new backend), so ReconnectingClientConnection has never been replaced by ConnectedClientConnection after rejoin (started at 11:03:16,111).

2017-07-21 11:02:53,934 | DEBUG | lt-dispatcher-26 | ClientActorBehavior | 197 - org.opendaylight.control
ler.cds-access-client - 1.1.2.SNAPSHOT | member-1-frontend-datastore-config: resolving shard 0 connection Reconnecting
ClientConnection{client=ClientIdentifier

{frontend=member-1-frontend-datastore-config, generation=0}

, cookie=0, backend
=ShardBackendInfo{actor=Actor[akka://opendaylight-cluster-data/user/shardmanager-config/member-1-shard-default-config#
-208338531], sessionId=0, version=BORON, maxMessages=1000, cookie=0, shard=default, dataTree=present}}

>> it got stuck in "write only the top list" step

This time it got stuck after "Submitting list merge".

2017-07-21 11:03:29,669 | TRACE | qtp608302457-332 | WriteTransactionsHandler | 257 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.2.SNAPSHOT | Submitting list merge.

Currently I have no idea why any of this happens, I will experiment some more.

Comment by Vratko Polak [ 21/Jul/17 ]

I have partially good news.

The CONTROLLER-1687 fix [26] makes this bug disappear as well.
The bad part of the news is that a client application which handles callbacks from a single thread shared with writes (incurring sleeps if backend is not fast enough) can cause this bug.

Should a separate bug be open if the aforementioned fix gets merged?

[26] https://git.opendaylight.org/gerrit/#/c/60606/7

Comment by Vratko Polak [ 24/Jul/17 ]

> client application which handles callbacks from a single thread

That should not make a difference.

The test passed [27] today on releng, but the test was not always failing before and I believe this Bug is not fixed yet. Will create more runs on Sandbox.

[27] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/791/log.html.gz#s1-s28-t3-k2-k25-k1-k8

Comment by Vratko Polak [ 24/Jul/17 ]

>>> https://git.opendaylight.org/gerrit/60614
>>
>> Merged, but the Bug is still present.

> I believe this Bug is not fixed yet.

Under the weight of Sandbox passes, now I believe that last week I was testing some different build by mistake, and [28] has fixed this Bug indeed.

[28] https://git.opendaylight.org/gerrit/#/c/60614/2

Generated at Wed Feb 07 21:54:17 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.