[GENIUS-166] Genius CSIT Intermittent RESTCONF ReadTimeOut Errors for POST/DELETE requests Created: 31/May/18  Updated: 27/Jun/18  Resolved: 27/Jun/18

Status: Resolved
Project: genius
Component/s: None
Affects Version/s: None
Fix Version/s: Fluorine

Type: Bug Priority: High
Reporter: Faseela K Assignee: Tom Pantelis
Resolution: Done Votes: 0
Labels: csit:3node, csit:failures
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
is blocked by NETCONF-546 404 returning empty response Resolved
is blocked by CONTROLLER-1836 Deadlock scenario with multi-shard tr... Verified
Priority: High

 Description   

Genius Fluorine CSIT is randomly hitting some RESTCONF ReadTimeOuts on some of the DELETE/POST requests recently.

https://jenkins.opendaylight.org/sandbox/job/jamo-genius-csit-1node-gate-all-fluorine/

Documentation: Send a DELETE request on the session object found using the
Start / End / Elapsed: 20180531 04:34:26.956 / 20180531 04:34:31.569 / 00:00:04.613
04:34:27.959 WARN Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='10.30.170.29', port=8181): Read timed out. (read timeout=1.0)",)': /restconf/config/itm:transport-zones/transport-zone/TZA/  
04:34:29.162 WARN Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='10.30.170.29', port=8181): Read timed out. (read timeout=1.0)",)': /restconf/config/itm:transport-zones/transport-zone/TZA/  
04:34:30.565 WARN Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='10.30.170.29', port=8181): Read timed out. (read timeout=1.0)",)': /restconf/config/itm:transport-zones/transport-zone/TZA/  
04:34:31.568 FAIL ConnectionError: HTTPConnectionPool(host='10.30.170.29', port=8181): Max retries exceeded with url: /restconf/config/itm:transport-zones/transport-zone/TZA/ (Caused by ReadTimeoutError("HTTPConnectionPool(host='10.30.170.29', port=8181): Read timed out. (read timeout=1.0)",))

We see the below exception in karaf log on all failing runs, but not sure whether that is the reason for the failure though:

     2018-05-30T15:08:06,198 | WARN  | opendaylight-cluster-data-shard-dispatcher-88 | ShardDataTree 240 - org.opendaylight.controller.sal-distributed-datastore -| 1.8.0.SNAPSHOT | member-1-shard-default-config: Current transaction member-1-datastore-config-fe-0-txn-1477-0 has timed out after 19233 ms in state CAN_COMMIT_COMPLETE

     2018-05-30T15:08:06,198 | WARN  | opendaylight-cluster-data-shard-dispatcher-65 | ShardDataTree | 240 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | member-1-shard-inventory-config: Current transaction member-1-datastore-config-fe-0-txn-1478-0 has timed out after 19234 ms in state READY

     2018-05-30T15:08:06,199 | ERROR | opendaylight-cluster-data-shard-dispatcher-88 | Shard | 232 - org.opendaylight.controller.sal-clustering-commons - | 1.8.0.SNAPSHOT | member-1-shard-inventory-config: Cannot canCommit transaction member-1-datastore-config-fe-0-txn-1478-0 - no cohort entry found

2018-05-30T15:08:06,199 | ERROR | opendaylight-cluster-data-shard-dispatcher-65 | Shard                            | 232 - org.opendaylight.controller.sal-clustering-commons - 1.8.0.SNAPSHOT | member-1-shard-default-config: Cannot commit transaction member-1-datastore-config-fe-0-txn-1477-0 - no cohort entry found



 Comments   
Comment by Michael Vorburger [ 01/Jun/18 ]

Perhaps this is related to & caused by NETCONF-546 ?

Comment by Michael Vorburger [ 01/Jun/18 ]

tpantelis on this reply also pointed to https://git.opendaylight.org/gerrit/#/c/72525/ - still happening, with that?

Comment by Faseela K [ 01/Jun/18 ]

Ran with the DEBUG logs asked by tpantelis

   With the logs enabled, this particular error came in one of the runs. I had to run with two suites as the issue was not coming that frequently when I ran with one suite.

 

   after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='10.30.170.18', port=8181): Read timed out. (read timeout=1.0)",)': /restconf/config/itm:transport-zones/transport-zone/TZA/

 

            https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/87/Karthikagenius-csit-1node-gate-all-fluorine/41/

 

tpantelis after taking a look at the logs informed that "there's an indirect deadlock scenario in CDS wrt txns that span multiple shards. I'll work on a patch soon. No ETA yet..."

Comment by Tom Pantelis [ 04/Jun/18 ]

Submitted https://git.opendaylight.org/gerrit/#/c/72650/

Comment by Robert Varga [ 04/Jun/18 ]

The patch looks good for a stop-gap. In order to allow concurrency across shard-sets, we'd need a per-shard lock somewhere and acquire them in order. We can do that later if it proves to be a problem.

Comment by sathwik boggarapu [ 11/Jun/18 ]

As per tpantelis latest reply, The easiest solution to avoid this issue and probably help with performance is to just go with one shard like you've been talking about anyway - that's easy enough - just customize the modules-shards-config. Otherwise, to optimize for sharding you really have to understand and manage the app's access patterns but it seems the genius/netvirt patterns are too complex and erratic for that

 

Comment by Michael Vorburger [ 14/Jun/18 ]

according to tpantelis, CONTROLLER-1836 c/72874 fixes this.

k.faseela verify and close this?

Comment by Faseela K [ 17/Jun/18 ]

vorburger : c/72874 is not merged yet. We will run the CSIT in a loop to verify the issue is fixed or not, once the patch is merged.

Comment by Jamo Luhrsen [ 21/Jun/18 ]

k.faseela this sandbox job will run every 20m with the distro from c/72874

I am very curious to see what we get. I think I have stumbled across similar failures in other suites recently,
so maybe this is happening more often than we think, and not just in genius.

Comment by Faseela K [ 21/Jun/18 ]

jluhrsen The patch is merged in master. tpantelis : Any plans to get this in for stable/oxygen?

Comment by Faseela K [ 21/Jun/18 ]

jluhrsen Can u edit the sandbox job to run only Configure_ITM suite? Looks like the newly added suite has some other failures, and it might distract.

Comment by Jamo Luhrsen [ 21/Jun/18 ]

k.faseela, the sandbox job is edited now. It will not use the custom distro, since the patch was merged yesterday. (BTW, I thought we
were going to wait for CSIT results before doing that) I also removed that Configure_ITM suite, because like you said it creates noise.
You should probably have someone remove that suite and file a Jira to track whatever it is that's making it unstable.

BUT!!!! I did look at all the failures in the sandbox from overnight, and none of them were because of this issue. Seems like we are good
now.

big thanks to tpantelis for the fix.

Comment by Faseela K [ 26/Jun/18 ]

jluhrsen : I think the CSIT runs are looking good atleast from this error perspective.. Should we close this Jira now?

Comment by Jamo Luhrsen [ 26/Jun/18 ]

agreed

Generated at Wed Feb 07 20:00:05 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.