Details
-
Bug
-
Status: Resolved
-
Resolution: Cannot Reproduce
-
Carbon
-
None
-
None
-
Operating System: All
Platform: All
-
9173
Description
Description of problem:
Let me start with describing our setup:
3 OpenStack controllers
3 ODLs clustered
3 Compute nodes
When running a Browbeat+Rally Scenario which does:
1. Create two networks
2. Create two subnets
3. Create router
4. Attach each of these subnets to the router
for a total 500 times at varying concurrencies of 8,16, and 32 we see journal entries piling up in opendaylightjournal table in the ovs_neutron DB. On inspecting, the neutron-server logs ton of HTTP client errors and timeouts are seen. For example, tracing the port f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73
We see
2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal [req-cc3661d4-fb99-47bc-bb85-f64b4d1a496d - - - - -] Error while processing update port f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73: ReadTimeout: HTTPConnectionPool(host='172.16.0.17', port=8081): Read timed out. (read timeout=10) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal Traceback (most recent call last): 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/journal/journal.py", line 245, in _sync_entry 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.client.sendjson(method, urlpath, to_send) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 106, in sendjson 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal 'body': obj}) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in _exit_ 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.force_reraise() 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal six.reraise(self.type_, self.value, self.tb) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 98, in sendjson 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.request(method, urlpath, data)) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 91, in request 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal method, url=url, headers=headers, data=data, timeout=self.timeout) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal resp = self.send(prep, **send_kwargs) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 596, in send 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal r = adapter.send(request, **kwargs) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 499, in send 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal raise ReadTimeout(e, request=request) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal ReadTimeout: HTTPConnectionPool(host='172.16.0.17', port=8081): Read timed out. (read timeout=10) 2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal
It looks like networking-odl isn't able to connect to ODL and on looking at karaf logs for this port-id we see on controller-1, we see
1894842:2017-09-14 04:18:43,228 | ERROR | ChangeHandler-16 | MDSALUtil | 319 - org.opendaylight.genius.mdsalutil-api - 0.2.2.SNAPSHOT | Error writing to datastore (path, data) : (KeyedInstanceIdentifier
{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.router.interfaces.Interfaces, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.RouterInterfacesMap, org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.RouterInterfaces[key=RouterInterfacesKey [_routerId=Uuid [_value=4d78c0f1-1db8-42e1-bde8-c48cad4bc351]]], org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.router.interfaces.Interfaces[key=InterfacesKey [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73]]]}
, Interfaces [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73, _key=InterfacesKey [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73], augmentation=[]])
Just above this message we also see several ASkTimeoutExceptions as following:
2017-09-14 04:18:13,208 | ERROR | t-dispatcher-134 | LocalThreePhaseCommitCohort | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Failed to prepare transaction member-1-datastore-config-fe-0-txn-175117-0 on backend akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-default-config#1400650520)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction". at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[185:com.typesafe.akka.actor:2.4.18] at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[185:com.typesafe.akka.actor:2.4.18] at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc] at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)[185:com.typesafe.akka.actor:2.4.18] at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)[185:com.typesafe.akka.actor:2.4.18] at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)[185:com.typesafe.akka.actor:2.4.18] at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)[185:com.typesafe.akka.actor:2.4.18] at java.lang.Thread.run(Thread.java:748)[:1.8.0_141]
Due to the state of the ODL cluster, networking-odl has trouble connecting and creating resources, leading to high mysqld usage (because of all the rows piled up) and also very high CPU usage of the karaf process (around 30 cores)
We are suspecting the issues with ODL as possible reasons for mysqld high CPU usage because, we were testing networking-odl with some unmerged patches that deleted a row as soon as it was completed, leading to better performance and low CPU usage due to lower rows to scan.
Version-Release number of selected component (if applicable):
OpenStack Pike (12)
opendaylight-6.2.0-0.1.20170913snap58.el7.noarch
python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch
How reproducible:
Quite consistent
Steps to Reproduce:
1. Deploy OpenStack with ODL as backend
2. Run Perf/Scale tests of creating tons of neutron resources
3.
Actual results:
Resources aren't being created in ODL, they keep piling up in opendaylightjournal table due to HTTP connection errors with ODL (default timeout is 10s for the HTTP connection)
Expected results:
No exceptions in logs and resources are created
Additional info:
Logs of karaf on controller-0,1 and 2
Logs of neutron-server on controller-0
http://8.43.86.1:8088/smalleni/clustering-logs.tar.gz
Port id that can be used to track problem: f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73