[CONTROLLER-1772] Clustering Issues Error writing to datastore and AskTimeoutException Created: 16/Sep/17 Updated: 02/Mar/22 Resolved: 02/Jul/21 |
|
| Status: | Resolved |
| Project: | controller |
| Component/s: | clustering |
| Affects Version/s: | Carbon |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Sai Sindhur Malleni | Assignee: | Oleksii Mozghovyi |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| External issue ID: | 9173 |
| Description |
|
Description of problem: When running a Browbeat+Rally Scenario which does: for a total 500 times at varying concurrencies of 8,16, and 32 we see journal entries piling up in opendaylightjournal table in the ovs_neutron DB. On inspecting, the neutron-server logs ton of HTTP client errors and timeouts are seen. For example, tracing the port f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73 We see 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal [req-cc3661d4-fb99-47bc-bb85-f64b4d1a496d - - - - -] Error while processing update port f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73: ReadTimeout: HTTPConnectionPool(host='172.16.0.17', port=8081): Read timed out. (read timeout=10) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal Traceback (most recent call last): 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/journal/journal.py", line 245, in _sync_entry 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.client.sendjson(method, urlpath, to_send) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 106, in sendjson 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal 'body': obj}) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in _exit_ 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.force_reraise() 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal six.reraise(self.type_, self.value, self.tb) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 98, in sendjson 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.request(method, urlpath, data)) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 91, in request 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal method, url=url, headers=headers, data=data, timeout=self.timeout) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal resp = self.send(prep, **send_kwargs) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 596, in send 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal r = adapter.send(request, **kwargs) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 499, in send 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal raise ReadTimeout(e, request=request) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal ReadTimeout: HTTPConnectionPool(host='172.16.0.17', port=8081): Read timed out. (read timeout=10) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal It looks like networking-odl isn't able to connect to ODL and on looking at karaf logs for this port-id we see on controller-1, we see
1894842:2017-09-14 04:18:43,228 | ERROR | ChangeHandler-16 | MDSALUtil | 319 - org.opendaylight.genius.mdsalutil-api - 0.2.2.SNAPSHOT | Error writing to datastore (path, data) : (KeyedInstanceIdentifier
{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.router.interfaces.Interfaces, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.RouterInterfacesMap, org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.RouterInterfaces[key=RouterInterfacesKey [_routerId=Uuid [_value=4d78c0f1-1db8-42e1-bde8-c48cad4bc351]]], org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.router.interfaces.Interfaces[key=InterfacesKey [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73]]]}
, Interfaces [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73, _key=InterfacesKey [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73], augmentation=[]])
Just above this message we also see several ASkTimeoutExceptions as following: 2017-09-14 04:18:13,208 | ERROR | t-dispatcher-134 | LocalThreePhaseCommitCohort | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Failed to prepare transaction member-1-datastore-config-fe-0-txn-175117-0 on backend akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-default-config#1400650520)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction". at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[185:com.typesafe.akka.actor:2.4.18] at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[185:com.typesafe.akka.actor:2.4.18] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)[185:com.typesafe.akka.actor:2.4.18] at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)[185:com.typesafe.akka.actor:2.4.18] at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)[185:com.typesafe.akka.actor:2.4.18] at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)[185:com.typesafe.akka.actor:2.4.18] at java.lang.Thread.run(Thread.java:748)[:1.8.0_141] Due to the state of the ODL cluster, networking-odl has trouble connecting and creating resources, leading to high mysqld usage (because of all the rows piled up) and also very high CPU usage of the karaf process (around 30 cores) We are suspecting the issues with ODL as possible reasons for mysqld high CPU usage because, we were testing networking-odl with some unmerged patches that deleted a row as soon as it was completed, leading to better performance and low CPU usage due to lower rows to scan. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: Actual results: Expected results: Additional info: Logs of karaf on controller-0,1 and 2 http://8.43.86.1:8088/smalleni/clustering-logs.tar.gz Port id that can be used to track problem: f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73 |
| Comments |
| Comment by Oleksii Mozghovyi [ 07/Apr/21 ] |
|
Hi smalleni@redhat.com, is this problem still happening on integration with the newer OpenDaylight releases? Could you please point me to the logs(the ticket is pretty old, and the log files link doesn't work anymore)? |
| Comment by Oleksii Mozghovyi [ 07/Apr/21 ] |
|
One more question about the test environment - what was the amount of CPU/RAM allocated for the ODL cluster nodes? |
| Comment by Oleksii Mozghovyi [ 26/May/21 ] |
|
I don't think we can reproduce this kind of issue in our dev/lfn only environment, so maybe it would be better to close it. |
| Comment by Shibu Vijayakumar [ 02/Mar/22 ] |
|
Hi oleksii.mozghovyi Seeing the AskTimeOutException in the Current release also (Phosphorus SR2 also) - CONTROLLER-2032 Phosphorus SR1 - AskTimeoutException while trying to multiple Netconf devices at a time - OpenDaylight JIRA Scenario: 3 Node Cluster, and tried to mount 250 Netconf devices at the same time, the POST requests are processed successfully, but when the real mouting happens, after 40-50 device mount, AskTimeout error occurs and further mounts are stuck. |