[CONTROLLER-1700] Timeout waiting for task from writer started after heal after long isolation Created: 29/May/17 Updated: 25/Jul/23 Resolved: 26/Jun/17 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||
| External issue ID: | 8562 | ||||||||||||
| Description |
|
This symptom is very similar to Still three writers (module-based shards, tell-based protocol) and leader is isolated, for more than request timeout. The isolated writer fails (with TimeoutException instead of RequestTimeoutException, that might be a separate bug or a cause of this one). Then the member is rejoined and it is verified each shard has a Leader and two Followers. At the end of the scenario, we start a writer on the rejoined node, and we expect it to finish writing without errors. Instead finishing after 67 seconds, TimeoutException is seen. The other two writers finish correctly. In a recent Sandbox test [0] (which failed to upload archive with logs) the response starts with: |
| Comments |
| Comment by Vratko Polak [ 05/Jun/17 ] |
|
The same behavior is now seen [1] in production Jenkins. Karaf.log [2] from the isolated member-1 shows two suspicious messages. Second, there was a failure from the second writer (started after rejoin). If it is an error in the writer, we can change the suite to use different list item id. 2017-06-04 14:44:12,349 | WARN | qtp1030155268-77 | WriteTransactionsHandler | 256 - org.opendaylight.controller.samples.clustering-it-provider - 1.5.1.SNAPSHOT | Unable to ensure IdInts list for id: prefix-1 exists. [1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-carbon/733/log.html.gz#s1-s28-t3-k2-k25-k1-k1 |
| Comment by Robert Varga [ 05/Jun/17 ] |
|
The test failed because the previous test case ended up not cleaning up properly: 2017-06-04 13:40:57,384 | ERROR | ult-dispatcher-7 | ClusterAdminRpcService | 201 - org.opendaylight.controller.sal-cluster-admin-impl - 1.5.1.SNAPSHOT | Failed to remove replica for shard /(tag:opendaylight.org,2017:controller Which means the frontend was instantiated multiple times: 2017-06-04 13:40:57,486 | INFO | h for user karaf | command | 265 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Remove_Follower_Prefix_Shard_Replica_And_Add_It_Back has been superseded further down it lead to a failure to bind the producer: 2017-06-04 13:43:20,754 | INFO | h for user karaf | command | 265 - org.apache.karaf.log.command - 3.0.8 | ROBOT MESSAGE: Starting test Produce_Transactions ]} is attached to producer org.opendaylight.mdsal.dom.broker.ShardedDOMDataTreeProducer@517147f9 |
| Comment by Tomas Cere [ 06/Jun/17 ] |
|
(In reply to Robert Varga from comment #2) > Which means the frontend was instantiated multiple times: I checked the logs, and I dont see how the frontend can be instantiated twice. Once the removal of the replica fails the test doesnt attempt to add the replica back, but even if it did, when you add a replica of a shard you dont create new frontend since this is completely separated from the layer that handles the interaction with the ShardingService. |
| Comment by Vratko Polak [ 12/Jun/17 ] |
|
A symptom similar to this (except the restconf call did not finish, instead of returning TimeoutException) happened together with |