[NETVIRT-1461] port create times out, instances go to error state Created: 12/Oct/18  Updated: 08/Nov/18  Resolved: 08/Nov/18

Status: Resolved
Project: netvirt
Component/s: General
Affects Version/s: None
Fix Version/s: Fluorine-SR2, Neon, Oxygen-SR4

Type: Bug Priority: High
Reporter: Jamo Luhrsen Assignee: Stephen Kitt
Resolution: Done Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to NETVIRT-1460 websocket failing: causes instance cr... Confirmed

 Description   

we have sporadic failures in our 3node (aka clustered) csit jobs for netvirt where
some openstack instances go to error state instead of active. There seems to be
multiple reasons and another issue similar to this is NETVIRT-1460.

For this one, it seems that after taking down one node (so two active nodes) there
is some communication problem with networking_odl and odl. This trace is seen in
the [ neutron-server log | https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-oxygen/66/control_1/oslogs/neutron-server.log.gz] :

ERROR networking_odl.common.client [[01;36mNone req-9b0893d1-4602-4008-8926-76975987b9a2 [00;36mNone None] [01;35mREST request ( post ) to url ( ports ) is failed. Request body : [{u'port': {'port_security_enabled': True, 'binding:host_id': '', 'name': '', 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': u'1c164d7f-7231-4cef-a967-c02a79b7aacc', 'tenant_id': u'4c6d58bff60d4f21a95a606edf297214', 'binding:vif_details': {}, 'binding:vnic_type': 'normal', 'binding:vif_type': 'unbound', 'device_owner': '', 'mac_address': 'fa:16:3e:b2:ed:3f', 'binding:profile': '{}', 'project_id': u'4c6d58bff60d4f21a95a606edf297214', 'fixed_ips': [{'subnet_id': u'8131ce39-4062-4cb5-96eb-e5095b48da21', 'ip_address': u'26.0.0.4'}], 'id': '842361ae-3077-4c64-a71a-6a0704723a2e', 'security_groups': [{'id': u'79a43166-0365-4085-be3a-45efc8c4fd6e'}], 'device_id': u'3a2df104-7865-4399-8b52-89fba5c34208'}}] service[00m: ReadTimeout: HTTPConnectionPool(host='10.30.170.112', port=8181): Read timed out. (read timeout=10)
ERROR networking_odl.journal.journal [[01;36mNone req-9b0893d1-4602-4008-8926-76975987b9a2 [00;36mNone None] [01;35mError while processing (Entry ID: 535) - create port 842361ae-3077-4c64-a71a-6a0704723a2e (Time stamp: 63672772711.3)[00m: ReadTimeout: HTTPConnectionPool(host='10.30.170.112', port=8181): Read timed out. (read timeout=10)
ERROR networking_odl.journal.journal [01;35m[00mTraceback (most recent call last):
ERROR networking_odl.journal.journal [01;35m[00m  File "/opt/stack/networking-odl/networking_odl/journal/journal.py", line 284, in _sync_entry
ERROR networking_odl.journal.journal [01;35m[00m    self.client.sendjson(method, urlpath, to_send)
ERROR networking_odl.journal.journal [01;35m[00m  File "/opt/stack/networking-odl/networking_odl/common/client.py", line 106, in sendjson
ERROR networking_odl.journal.journal [01;35m[00m    'body': obj})
ERROR networking_odl.journal.journal [01;35m[00m  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
ERROR networking_odl.journal.journal [01;35m[00m    self.force_reraise()
ERROR networking_odl.journal.journal [01;35m[00m  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
ERROR networking_odl.journal.journal [01;35m[00m    six.reraise(self.type_, self.value, self.tb)
ERROR networking_odl.journal.journal [01;35m[00m  File "/opt/stack/networking-odl/networking_odl/common/client.py", line 98, in sendjson
ERROR networking_odl.journal.journal [01;35m[00m    self.request(method, urlpath, data))
ERROR networking_odl.journal.journal [01;35m[00m  File "/opt/stack/networking-odl/networking_odl/common/client.py", line 91, in request
ERROR networking_odl.journal.journal [01;35m[00m    method, url=url, headers=headers, data=data, timeout=self.timeout)
ERROR networking_odl.journal.journal [01;35m[00m  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 508, in request
ERROR networking_odl.journal.journal [01;35m[00m    resp = self.send(prep, **send_kwargs)
ERROR networking_odl.journal.journal [01;35m[00m  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 618, in send
ERROR networking_odl.journal.journal [01;35m[00m    r = adapter.send(request, **kwargs)
ERROR networking_odl.journal.journal [01;35m[00m  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 521, in send
ERROR networking_odl.journal.journal [01;35m[00m    raise ReadTimeout(e, request=request)
ERROR networking_odl.journal.journal [01;35m[00mReadTimeout: HTTPConnectionPool(host='10.30.170.112', port=8181): Read timed out. (read timeout=10)
ERROR networking_odl.journal.journal [01;35m[00m

robot log showing the instance state

ODL3 is the default shard leader after ODL2 was downed, so I'm assuming ODL3
would be handling the transactions/requests. ODL3 karaf log
there are plenty of exceptions in the log, but I'm not sure which are going to be the
type we are ok with or not. There are some other logs mentioning "Job still failed on
retry" which seems serious.



 Comments   
Comment by Jamo Luhrsen [ 06/Nov/18 ]

skitt, what can we do to make progress on this one? If you have something for me to dig in to, let me know.

Comment by Jamo Luhrsen [ 08/Nov/18 ]

This specific symptom of network failures seems to have gone away with the recent changes for a better haproxy healthcheck for the websocket backend.

Generated at Wed Feb 07 20:24:07 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.