-
Bug
-
Resolution: Done
-
High
-
None
we have sporadic failures in our 3node (aka clustered) csit jobs for netvirt where
some openstack instances go to error state instead of active. There seems to be
multiple reasons and another issue similar to this is NETVIRT-1460.
For this one, it seems that after taking down one node (so two active nodes) there
is some communication problem with networking_odl and odl. This trace is seen in
the [ neutron-server log | https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-oxygen/66/control_1/oslogs/neutron-server.log.gz] :
ERROR networking_odl.common.client [[01;36mNone req-9b0893d1-4602-4008-8926-76975987b9a2 [00;36mNone None] [01;35mREST request ( post ) to url ( ports ) is failed. Request body : [{u'port': {'port_security_enabled': True, 'binding:host_id': '', 'name': '', 'allowed_address_pairs': [], 'admin_state_up': True, 'network_id': u'1c164d7f-7231-4cef-a967-c02a79b7aacc', 'tenant_id': u'4c6d58bff60d4f21a95a606edf297214', 'binding:vif_details': {}, 'binding:vnic_type': 'normal', 'binding:vif_type': 'unbound', 'device_owner': '', 'mac_address': 'fa:16:3e:b2:ed:3f', 'binding:profile': '{}', 'project_id': u'4c6d58bff60d4f21a95a606edf297214', 'fixed_ips': [{'subnet_id': u'8131ce39-4062-4cb5-96eb-e5095b48da21', 'ip_address': u'26.0.0.4'}], 'id': '842361ae-3077-4c64-a71a-6a0704723a2e', 'security_groups': [{'id': u'79a43166-0365-4085-be3a-45efc8c4fd6e'}], 'device_id': u'3a2df104-7865-4399-8b52-89fba5c34208'}}] service[00m: ReadTimeout: HTTPConnectionPool(host='10.30.170.112', port=8181): Read timed out. (read timeout=10) ERROR networking_odl.journal.journal [[01;36mNone req-9b0893d1-4602-4008-8926-76975987b9a2 [00;36mNone None] [01;35mError while processing (Entry ID: 535) - create port 842361ae-3077-4c64-a71a-6a0704723a2e (Time stamp: 63672772711.3)[00m: ReadTimeout: HTTPConnectionPool(host='10.30.170.112', port=8181): Read timed out. (read timeout=10) ERROR networking_odl.journal.journal [01;35m[00mTraceback (most recent call last): ERROR networking_odl.journal.journal [01;35m[00m File "/opt/stack/networking-odl/networking_odl/journal/journal.py", line 284, in _sync_entry ERROR networking_odl.journal.journal [01;35m[00m self.client.sendjson(method, urlpath, to_send) ERROR networking_odl.journal.journal [01;35m[00m File "/opt/stack/networking-odl/networking_odl/common/client.py", line 106, in sendjson ERROR networking_odl.journal.journal [01;35m[00m 'body': obj}) ERROR networking_odl.journal.journal [01;35m[00m File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ ERROR networking_odl.journal.journal [01;35m[00m self.force_reraise() ERROR networking_odl.journal.journal [01;35m[00m File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise ERROR networking_odl.journal.journal [01;35m[00m six.reraise(self.type_, self.value, self.tb) ERROR networking_odl.journal.journal [01;35m[00m File "/opt/stack/networking-odl/networking_odl/common/client.py", line 98, in sendjson ERROR networking_odl.journal.journal [01;35m[00m self.request(method, urlpath, data)) ERROR networking_odl.journal.journal [01;35m[00m File "/opt/stack/networking-odl/networking_odl/common/client.py", line 91, in request ERROR networking_odl.journal.journal [01;35m[00m method, url=url, headers=headers, data=data, timeout=self.timeout) ERROR networking_odl.journal.journal [01;35m[00m File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 508, in request ERROR networking_odl.journal.journal [01;35m[00m resp = self.send(prep, **send_kwargs) ERROR networking_odl.journal.journal [01;35m[00m File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 618, in send ERROR networking_odl.journal.journal [01;35m[00m r = adapter.send(request, **kwargs) ERROR networking_odl.journal.journal [01;35m[00m File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 521, in send ERROR networking_odl.journal.journal [01;35m[00m raise ReadTimeout(e, request=request) ERROR networking_odl.journal.journal [01;35m[00mReadTimeout: HTTPConnectionPool(host='10.30.170.112', port=8181): Read timed out. (read timeout=10) ERROR networking_odl.journal.journal [01;35m[00m
robot log showing the instance state
ODL3 is the default shard leader after ODL2 was downed, so I'm assuming ODL3
would be handling the transactions/requests. ODL3 karaf log
there are plenty of exceptions in the log, but I'm not sure which are going to be the
type we are ok with or not. There are some other logs mentioning "Job still failed on
retry" which seems serious.
- relates to
-
NETVIRT-1460 websocket failing: causes instance creation failures
- Confirmed