[CONTROLLER-1734] RequestTimeoutException after ~250s after brief isolation Created: 30/Jun/17  Updated: 25/Jul/23  Resolved: 18/Sep/17

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8782

 Description   

The robot symptom is basically the same as in CONTROLLER-1723 but the test scenario is different, so this is perhaps a different bug, especially when the suite when CONTROLLER-1723 appeared first is now passing.

This suite starts transaction writer (duration 180 seconds) on each member, then isolates the leader for 40 seconds, the rejoins it again. It is expected each writer finishes successfully. Two days ago two writers were passing, one failed on CONTROLLER-1708. Yesterday (two releng runs) this bug appeared alongside CONTROLLER-1708, and today only this symptom is there [0]. Two writers failed (161s ans 181s reported), one passed (the one on the new leader), not sure how reliable that is.

Sandbox run with verbose log also sees [1] this bug.

[0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/767/log.html.gz#s1-s28-t1-k2-k25-k1-k1
[1] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/17/log.html.gz#s1-s2-t1-k2-k25-k1-k1



 Comments   
Comment by Vratko Polak [ 03/Jul/17 ]

Recent Sandbox occurrence: [2].

[2] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/1/log.html.gz#s1-s2-t1-k2-k25-k1-k1

Comment by Vratko Polak [ 06/Jul/17 ]

Contrary to most other bugs, this show different symptom with [3] codebase. Instead of RequestTimeoutException, writer responds [4] with "Backend timeout in state READY":

Caused by: java.util.concurrent.TimeoutException: Backend timeout in state READY after 15000ms\n\tat org.opendaylight.controller.cluster.datastore.ShardDataTree.checkForExpiredTransactions(ShardDataTree.java:999)\n\tat org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:333)\n\tat org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)\n\tat org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)\n\tat akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)\n\tat org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)\n\tat akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)\n\t... 14 more

[3] https://git.opendaylight.org/gerrit/#/c/60032/2
[4] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/3/log.html.gz#s1-s2-t1-k2-k25-k1-k1

Comment by Vratko Polak [ 07/Jul/17 ]

Code [5] does NOT fix [6] this.

[5] https://git.opendaylight.org/gerrit/#/c/60033/5
[6] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/4/log.html.gz#s1-s2-t1-k2-k25-k1-k1

Comment by Vratko Polak [ 10/Jul/17 ]

> Code [5] does NOT fix [6] this.

This weeks run in the same [5] code shows it can cause one writer to change its message.
In this test, each member has a writer, and a leader (member-3) is isolated for a short time needed for the new leader (member-2) to get elected, after which member-3 is rejoined back. Member-2 writer finishes correctly and member-3 writer shows the original symptom of RequestTimeoutException.

But member-1 writer returns [7] the new "Backend timeout in state READY" message. This might help with other bugs predominantly affecting the member which starys follower.

[7] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/1/log.html.gz#s1-s2-t1-k2-k25-k1-k2-k1-k1-k1

Comment by Vratko Polak [ 11/Jul/17 ]

Today, releng showed a different [8] writer response:
{"errors":{"error":[{"error-type":"application","error-tag":"operation-failed","error-message":"Final submit was timed out by the test provider or was interrupted","error-info":"java.util.concurrent.TimeoutException: Collection did not finish in 300 seconds\n\tat org.opendaylight.controller.clustering.it.provider.impl.AbstractTransactionHandler.checkComplete(AbstractTransactionHandler.java:187)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n

But looking at karaf.log [9] I believe that is what this bug looks like when "~250s" is longer than 300s.

[8] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/778/log.html.gz#s1-s36-t3-k2-k13-k3-k1-k1
[9] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/778/odl1_karaf.log.gz

Comment by Vratko Polak [ 11/Jul/17 ]

One run on Sandbox with code [10] had the usually failing test pass [11]. Will retry to see it that is reliable.

[10] https://git.opendaylight.org/gerrit/#/c/60137/4
[11] https://logs.opendaylight.org/sandbox/jenkins091/controller-csit-3node-clustering-li-only-carbon/5/log.html.gz#s1-s2-t1-k2-k25-k1-k1

Comment by Robert Varga [ 24/Aug/17 ]

Is this still happening?

Comment by Vratko Polak [ 18/Sep/17 ]

> Is this still happening?

This is no longer happening, marking as fixed.

Generated at Wed Feb 07 19:56:19 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.