[CONTROLLER-1772] Clustering Issues Error writing to datastore and AskTimeoutException Created: 16/Sep/17  Updated: 02/Mar/22  Resolved: 02/Jul/21

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Sai Sindhur Malleni Assignee: Oleksii Mozghovyi
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 9173

 Description   

Description of problem:
Let me start with describing our setup:
3 OpenStack controllers
3 ODLs clustered
3 Compute nodes

When running a Browbeat+Rally Scenario which does:
1. Create two networks
2. Create two subnets
3. Create router
4. Attach each of these subnets to the router

for a total 500 times at varying concurrencies of 8,16, and 32 we see journal entries piling up in opendaylightjournal table in the ovs_neutron DB. On inspecting, the neutron-server logs ton of HTTP client errors and timeouts are seen. For example, tracing the port f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73

We see

2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal [req-cc3661d4-fb99-47bc-bb85-f64b4d1a496d - - - - -] Error while processing update port f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73: ReadTimeout: HTTPConnectionPool(host='172.16.0.17', port=8081): Read timed out. (read timeout=10)
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal Traceback (most recent call last):
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/journal/journal.py", line 245, in _sync_entry
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.client.sendjson(method, urlpath, to_send)
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 106, in sendjson
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal 'body': obj})
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in _exit_
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.force_reraise()
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal six.reraise(self.type_, self.value, self.tb)
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 98, in sendjson
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.request(method, urlpath, data))
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 91, in request
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal method, url=url, headers=headers, data=data, timeout=self.timeout)
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal resp = self.send(prep, **send_kwargs)
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 596, in send
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal r = adapter.send(request, **kwargs)
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 499, in send
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal raise ReadTimeout(e, request=request)
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal ReadTimeout: HTTPConnectionPool(host='172.16.0.17', port=8081): Read timed out. (read timeout=10)
 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal

It looks like networking-odl isn't able to connect to ODL and on looking at karaf logs for this port-id we see on controller-1, we see

 

1894842:2017-09-14 04:18:43,228 | ERROR | ChangeHandler-16 | MDSALUtil | 319 - org.opendaylight.genius.mdsalutil-api - 0.2.2.SNAPSHOT | Error writing to datastore (path, data) : (KeyedInstanceIdentifier
{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.router.interfaces.Interfaces, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.RouterInterfacesMap, org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.RouterInterfaces[key=RouterInterfacesKey [_routerId=Uuid [_value=4d78c0f1-1db8-42e1-bde8-c48cad4bc351]]], org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.router.interfaces.Interfaces[key=InterfacesKey [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73]]]}
, Interfaces [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73, _key=InterfacesKey [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73], augmentation=[]])

 

Just above this message we also see several ASkTimeoutExceptions as following:

2017-09-14 04:18:13,208 | ERROR | t-dispatcher-134 | LocalThreePhaseCommitCohort | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Failed to prepare transaction member-1-datastore-config-fe-0-txn-175117-0 on backend
 akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-default-config#1400650520)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
 at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[185:com.typesafe.akka.actor:2.4.18]
 at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[185:com.typesafe.akka.actor:2.4.18]
 at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
 at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
 at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
 at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)[185:com.typesafe.akka.actor:2.4.18]
 at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)[185:com.typesafe.akka.actor:2.4.18]
 at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)[185:com.typesafe.akka.actor:2.4.18]
 at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)[185:com.typesafe.akka.actor:2.4.18]
 at java.lang.Thread.run(Thread.java:748)[:1.8.0_141]

Due to the state of the ODL cluster, networking-odl has trouble connecting and creating resources, leading to high mysqld usage (because of all the rows piled up) and also very high CPU usage of the karaf process (around 30 cores)

We are suspecting the issues with ODL as possible reasons for mysqld high CPU usage because, we were testing networking-odl with some unmerged patches that deleted a row as soon as it was completed, leading to better performance and low CPU usage due to lower rows to scan.

Version-Release number of selected component (if applicable):
OpenStack Pike (12)
opendaylight-6.2.0-0.1.20170913snap58.el7.noarch
python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch

How reproducible:
Quite consistent

Steps to Reproduce:
1. Deploy OpenStack with ODL as backend
2. Run Perf/Scale tests of creating tons of neutron resources
3.

Actual results:
Resources aren't being created in ODL, they keep piling up in opendaylightjournal table due to HTTP connection errors with ODL (default timeout is 10s for the HTTP connection)

Expected results:
No exceptions in logs and resources are created

Additional info:

Logs of karaf on controller-0,1 and 2
Logs of neutron-server on controller-0

http://8.43.86.1:8088/smalleni/clustering-logs.tar.gz

Port id that can be used to track problem: f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73



 Comments   
Comment by Oleksii Mozghovyi [ 07/Apr/21 ]

Hi smalleni@redhat.com, is this problem still happening on integration with the newer OpenDaylight releases? Could you please point me to the logs(the ticket is pretty old, and the log files link doesn't work anymore)?

Comment by Oleksii Mozghovyi [ 07/Apr/21 ]

One more question about the test environment - what was the amount of CPU/RAM allocated for the ODL cluster nodes?

Comment by Oleksii Mozghovyi [ 26/May/21 ]

I don't think we can reproduce this kind of issue in our dev/lfn only environment, so maybe it would be better to close it.

Comment by Shibu Vijayakumar [ 02/Mar/22 ]

Hi oleksii.mozghovyi Seeing the AskTimeOutException in the Current release also (Phosphorus SR2 also) - CONTROLLER-2032 Phosphorus SR1 - AskTimeoutException while trying to multiple Netconf devices at a time - OpenDaylight JIRA

Scenario: 3 Node Cluster, and tried to mount 250 Netconf devices at the same time, the POST requests are processed successfully, but when the real mouting happens, after 40-50 device mount, AskTimeout error occurs and further mounts are stuck.

Generated at Wed Feb 07 19:56:25 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.