[GENIUS-166] Genius CSIT Intermittent RESTCONF ReadTimeOut Errors for POST/DELETE requests Created: 31/May/18 Updated: 27/Jun/18 Resolved: 27/Jun/18 |
|
| Status: | Resolved |
| Project: | genius |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Fluorine |
| Type: | Bug | Priority: | High |
| Reporter: | Faseela K | Assignee: | Tom Pantelis |
| Resolution: | Done | Votes: | 0 |
| Labels: | csit:3node, csit:failures | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Priority: | High | ||||||||||||
| Description |
|
Genius Fluorine CSIT is randomly hitting some RESTCONF ReadTimeOuts on some of the DELETE/POST requests recently. https://jenkins.opendaylight.org/sandbox/job/jamo-genius-csit-1node-gate-all-fluorine/
We see the below exception in karaf log on all failing runs, but not sure whether that is the reason for the failure though: 2018-05-30T15:08:06,198 | WARN | opendaylight-cluster-data-shard-dispatcher-88 | ShardDataTree 240 - org.opendaylight.controller.sal-distributed-datastore -| 1.8.0.SNAPSHOT | member-1-shard-default-config: Current transaction member-1-datastore-config-fe-0-txn-1477-0 has timed out after 19233 ms in state CAN_COMMIT_COMPLETE 2018-05-30T15:08:06,198 | WARN | opendaylight-cluster-data-shard-dispatcher-65 | ShardDataTree | 240 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | member-1-shard-inventory-config: Current transaction member-1-datastore-config-fe-0-txn-1478-0 has timed out after 19234 ms in state READY 2018-05-30T15:08:06,199 | ERROR | opendaylight-cluster-data-shard-dispatcher-88 | Shard | 232 - org.opendaylight.controller.sal-clustering-commons - | 1.8.0.SNAPSHOT | member-1-shard-inventory-config: Cannot canCommit transaction member-1-datastore-config-fe-0-txn-1478-0 - no cohort entry found 2018-05-30T15:08:06,199 | ERROR | opendaylight-cluster-data-shard-dispatcher-65 | Shard | 232 - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-1-shard-default-config: Cannot commit transaction member-1-datastore-config-fe-0-txn-1477-0 - no cohort entry found |
| Comments |
| Comment by Michael Vorburger [ 01/Jun/18 ] |
|
Perhaps this is related to & caused by |
| Comment by Michael Vorburger [ 01/Jun/18 ] |
|
tpantelis on this reply also pointed to https://git.opendaylight.org/gerrit/#/c/72525/ - still happening, with that? |
| Comment by Faseela K [ 01/Jun/18 ] |
|
Ran with the DEBUG logs asked by tpantelis With the logs enabled, this particular error came in one of the runs. I had to run with two suites as the issue was not coming that frequently when I ran with one suite.
after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='10.30.170.18', port=8181): Read timed out. (read timeout=1.0)",)': /restconf/config/itm:transport-zones/transport-zone/TZA/
tpantelis after taking a look at the logs informed that "there's an indirect deadlock scenario in CDS wrt txns that span multiple shards. I'll work on a patch soon. No ETA yet..." |
| Comment by Tom Pantelis [ 04/Jun/18 ] |
| Comment by Robert Varga [ 04/Jun/18 ] |
|
The patch looks good for a stop-gap. In order to allow concurrency across shard-sets, we'd need a per-shard lock somewhere and acquire them in order. We can do that later if it proves to be a problem. |
| Comment by sathwik boggarapu [ 11/Jun/18 ] |
|
As per tpantelis latest reply, The easiest solution to avoid this issue and probably help with performance is to just go with one shard like you've been talking about anyway - that's easy enough - just customize the modules-shards-config. Otherwise, to optimize for sharding you really have to understand and manage the app's access patterns but it seems the genius/netvirt patterns are too complex and erratic for that
|
| Comment by Michael Vorburger [ 14/Jun/18 ] |
|
according to tpantelis, k.faseela verify and close this? |
| Comment by Faseela K [ 17/Jun/18 ] |
|
vorburger : c/72874 is not merged yet. We will run the CSIT in a loop to verify the issue is fixed or not, once the patch is merged. |
| Comment by Jamo Luhrsen [ 21/Jun/18 ] |
|
k.faseela this sandbox job will run every 20m with the distro from c/72874 I am very curious to see what we get. I think I have stumbled across similar failures in other suites recently, |
| Comment by Faseela K [ 21/Jun/18 ] |
|
jluhrsen The patch is merged in master. tpantelis : Any plans to get this in for stable/oxygen? |
| Comment by Faseela K [ 21/Jun/18 ] |
|
jluhrsen Can u edit the sandbox job to run only Configure_ITM suite? Looks like the newly added suite has some other failures, and it might distract. |
| Comment by Jamo Luhrsen [ 21/Jun/18 ] |
|
k.faseela, the sandbox job is edited now. It will not use the custom distro, since the patch was merged yesterday. (BTW, I thought we BUT!!!! I did look at all the failures in the sandbox from overnight, and none of them were because of this issue. Seems like we are good big thanks to tpantelis for the fix. |
| Comment by Faseela K [ 26/Jun/18 ] |
|
jluhrsen : I think the CSIT runs are looking good atleast from this error perspective.. Should we close this Jira now? |
| Comment by Jamo Luhrsen [ 26/Jun/18 ] |
|
agreed |