Uploaded image for project: 'controller'
  1. controller
  2. CONTROLLER-1772

Clustering Issues Error writing to datastore and AskTimeoutException

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Resolution: Cannot Reproduce
    • Carbon
    • None
    • clustering
    • None
    • Operating System: All
      Platform: All

    • 9173

    Description

      Description of problem:
      Let me start with describing our setup:
      3 OpenStack controllers
      3 ODLs clustered
      3 Compute nodes

      When running a Browbeat+Rally Scenario which does:
      1. Create two networks
      2. Create two subnets
      3. Create router
      4. Attach each of these subnets to the router

      for a total 500 times at varying concurrencies of 8,16, and 32 we see journal entries piling up in opendaylightjournal table in the ovs_neutron DB. On inspecting, the neutron-server logs ton of HTTP client errors and timeouts are seen. For example, tracing the port f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73

      We see

      2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal [req-cc3661d4-fb99-47bc-bb85-f64b4d1a496d - - - - -] Error while processing update port f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73: ReadTimeout: HTTPConnectionPool(host='172.16.0.17', port=8081): Read timed out. (read timeout=10)
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal Traceback (most recent call last):
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/journal/journal.py", line 245, in _sync_entry
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.client.sendjson(method, urlpath, to_send)
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 106, in sendjson
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal 'body': obj})
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in _exit_
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.force_reraise()
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal six.reraise(self.type_, self.value, self.tb)
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 98, in sendjson
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal self.request(method, urlpath, data))
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/networking_odl/common/client.py", line 91, in request
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal method, url=url, headers=headers, data=data, timeout=self.timeout)
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal resp = self.send(prep, **send_kwargs)
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 596, in send
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal r = adapter.send(request, **kwargs)
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 499, in send
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal raise ReadTimeout(e, request=request)
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal ReadTimeout: HTTPConnectionPool(host='172.16.0.17', port=8081): Read timed out. (read timeout=10)
       2017-09-14 07:50:08.593 202374 ERROR networking_odl.journal.journal

      It looks like networking-odl isn't able to connect to ODL and on looking at karaf logs for this port-id we see on controller-1, we see

       

      1894842:2017-09-14 04:18:43,228 | ERROR | ChangeHandler-16 | MDSALUtil | 319 - org.opendaylight.genius.mdsalutil-api - 0.2.2.SNAPSHOT | Error writing to datastore (path, data) : (KeyedInstanceIdentifier
      {targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.router.interfaces.Interfaces, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.RouterInterfacesMap, org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.RouterInterfaces[key=RouterInterfacesKey [_routerId=Uuid [_value=4d78c0f1-1db8-42e1-bde8-c48cad4bc351]]], org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.neutronvpn.rev150602.router.interfaces.map.router.interfaces.Interfaces[key=InterfacesKey [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73]]]}
      , Interfaces [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73, _key=InterfacesKey [_interfaceId=f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73], augmentation=[]])
      

       

      Just above this message we also see several ASkTimeoutExceptions as following:

      2017-09-14 04:18:13,208 | ERROR | t-dispatcher-134 | LocalThreePhaseCommitCohort | 211 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Failed to prepare transaction member-1-datastore-config-fe-0-txn-175117-0 on backend
       akka.pattern.AskTimeoutException: Ask timed out on ActorSelection[Anchor(akka://opendaylight-cluster-data/), Path(/user/shardmanager-config/member-1-shard-default-config#1400650520)] after [30000 ms]. Sender[null] sent message of type "org.opendaylight.controller.cluster.datastore.messages.ReadyLocalTransaction".
       at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)[185:com.typesafe.akka.actor:2.4.18]
       at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)[185:com.typesafe.akka.actor:2.4.18]
       at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
       at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
       at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)[181:org.scala-lang.scala-library:2.11.11.v20170413-090219-8a413ba7cc]
       at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)[185:com.typesafe.akka.actor:2.4.18]
       at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)[185:com.typesafe.akka.actor:2.4.18]
       at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)[185:com.typesafe.akka.actor:2.4.18]
       at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)[185:com.typesafe.akka.actor:2.4.18]
       at java.lang.Thread.run(Thread.java:748)[:1.8.0_141]

      Due to the state of the ODL cluster, networking-odl has trouble connecting and creating resources, leading to high mysqld usage (because of all the rows piled up) and also very high CPU usage of the karaf process (around 30 cores)

      We are suspecting the issues with ODL as possible reasons for mysqld high CPU usage because, we were testing networking-odl with some unmerged patches that deleted a row as soon as it was completed, leading to better performance and low CPU usage due to lower rows to scan.

      Version-Release number of selected component (if applicable):
      OpenStack Pike (12)
      opendaylight-6.2.0-0.1.20170913snap58.el7.noarch
      python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch

      How reproducible:
      Quite consistent

      Steps to Reproduce:
      1. Deploy OpenStack with ODL as backend
      2. Run Perf/Scale tests of creating tons of neutron resources
      3.

      Actual results:
      Resources aren't being created in ODL, they keep piling up in opendaylightjournal table due to HTTP connection errors with ODL (default timeout is 10s for the HTTP connection)

      Expected results:
      No exceptions in logs and resources are created

      Additional info:

      Logs of karaf on controller-0,1 and 2
      Logs of neutron-server on controller-0

      http://8.43.86.1:8088/smalleni/clustering-logs.tar.gz

      Port id that can be used to track problem: f9f307c4-bcc7-48c2-863f-2cbc8b0b8e73

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            oleksii.mozghovyi Oleksii Mozghovyi
            smalleni@redhat.com Sai Sindhur Malleni
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: