[CONTROLLER-1734] RequestTimeoutException after ~250s after brief isolation Created: 30/Jun/17 Updated: 25/Jul/23 Resolved: 18/Sep/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 8782 |
| Description |
|
The robot symptom is basically the same as in This suite starts transaction writer (duration 180 seconds) on each member, then isolates the leader for 40 seconds, the rejoins it again. It is expected each writer finishes successfully. Two days ago two writers were passing, one failed on Sandbox run with verbose log also sees [1] this bug. [0] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/767/log.html.gz#s1-s28-t1-k2-k25-k1-k1 |
| Comments |
| Comment by Vratko Polak [ 03/Jul/17 ] |
|
Recent Sandbox occurrence: [2]. |
| Comment by Vratko Polak [ 06/Jul/17 ] |
|
Contrary to most other bugs, this show different symptom with [3] codebase. Instead of RequestTimeoutException, writer responds [4] with "Backend timeout in state READY": Caused by: java.util.concurrent.TimeoutException: Backend timeout in state READY after 15000ms\n\tat org.opendaylight.controller.cluster.datastore.ShardDataTree.checkForExpiredTransactions(ShardDataTree.java:999)\n\tat org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:333)\n\tat org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270)\n\tat org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveCommand(AbstractUntypedPersistentActor.java:31)\n\tat akka.persistence.UntypedPersistentActor.onReceive(PersistentActor.scala:170)\n\tat org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:104)\n\tat akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544)\n\t... 14 more [3] https://git.opendaylight.org/gerrit/#/c/60032/2 |
| Comment by Vratko Polak [ 07/Jul/17 ] |
|
Code [5] does NOT fix [6] this. [5] https://git.opendaylight.org/gerrit/#/c/60033/5 |
| Comment by Vratko Polak [ 10/Jul/17 ] |
|
> Code [5] does NOT fix [6] this. This weeks run in the same [5] code shows it can cause one writer to change its message. But member-1 writer returns [7] the new "Backend timeout in state READY" message. This might help with other bugs predominantly affecting the member which starys follower. |
| Comment by Vratko Polak [ 11/Jul/17 ] |
|
Today, releng showed a different [8] writer response: But looking at karaf.log [9] I believe that is what this bug looks like when "~250s" is longer than 300s. [8] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/778/log.html.gz#s1-s36-t3-k2-k13-k3-k1-k1 |
| Comment by Vratko Polak [ 11/Jul/17 ] |
|
One run on Sandbox with code [10] had the usually failing test pass [11]. Will retry to see it that is reliable. [10] https://git.opendaylight.org/gerrit/#/c/60137/4 |
| Comment by Robert Varga [ 24/Aug/17 ] |
|
Is this still happening? |
| Comment by Vratko Polak [ 18/Sep/17 ] |
|
> Is this still happening? This is no longer happening, marking as fixed. |