[NEUTRON-204] networking-odl gives up on websocket, stops retrying Created: 08/Nov/18  Updated: 10/Jan/19

Status: Open
Project: neutron
Component/s: northbound-api
Affects Version/s: None
Fix Version/s: Fluorine-SR2, Neon

Type: Bug Priority: High
Reporter: Jamo Luhrsen Assignee: Josh Hershberg
Resolution: Unresolved Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
is blocked by GENIUS-263 tunnels down after bouncing ODL nodes... Verified

 Description   

after bouncing all three nodes of an ODL cluster, websocket registration fails and networking-odl gives up. From that point forward, the netvirt workflow is broken



 Comments   
Comment by Jamo Luhrsen [ 08/Nov/18 ]

Is NEUTRON the right project for this?

Comment by Jamo Luhrsen [ 08/Nov/18 ]

copy paste from an email:

I think we are really close with everything, but I want to get to the bottom
of another one. job is here [0]

Essentially, what's happening is that networking-odl just quits trying to
get a websocket. The final error is this:

2018-11-07 19:35:45.116 sERROR networking_odl.common.websocket_client None req-4b2fa066-886c-4c1b-a836-8ea50cafa8ae None None websocket irrecoverable error

which is from here [1]

I think what's happening is that the /restconf call to make the registration
gets a 40x (404 in the log message) which basically terminates the thread to
keep retrying to make the registration.

I'm not sure that retrying would help anyway. Something seems
busted at this point in time, but where? This is after we have
finished the ha_l3 robot suite which takes all three ODLs down at the
same time and brings them back. The cluster is back and in sync ~12m
before this happens, but in those 12m our haproxy never gets the websocket
backup as UP. It does get the restconf backend up. You can see here [2]
in the haproxy log where this happens. 20:21:29 shows the last of the
3 nodes going down for websocket, and after you only see restconf come
back UP and no more messages about websocket.

Now, in each of the karaf logs [3][4][5] you can see that our haproxy
healthcheck is hitting each node and it's rejecting it because the
registration is not there yet. Remember, this is fluorine and your
change to pre-register this websocket is not done. The point is that
haproxy is there and working and polling on each node.

So, where is that registration getting lost. The one from [1]. Oh,
here is the full neutron-server log to see it [6]. It seems that the
restconf registration request is being sent by networking-odl, but
it's not working.

That's as far as I got for now. Any ideas to run with?

While I was writing this, one of my oxygen jobs [7] also seems to have
failed in the same way. So, my hope that websocket pre-registration
changes that are there in oxygen would fix it, doesn't seem to.

Here's a JIRA for this one:
https://jira.opendaylight.org/browse/NEUTRON-204

Thanks,
JamO

[0] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/499/jamo-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/14
[1] http://git.openstack.org/cgit/openstack/networking-odl/tree/networking_odl/common/websocket_client.py?id=38497ef6c0c228c1794ddae3ad971353b3ff64c4#n115
[2] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/499/jamo-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/14/haproxy.log.gz
[3] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/499/jamo-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/14/odl_1/odl1_karaf.log.gz
[4] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/499/jamo-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/14/odl_2/odl2_karaf.log.gz
[5] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/499/jamo-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/14/odl_3/odl3_karaf.log.gz
[6] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/499/jamo-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/14/control_1/oslogs/neutron-server.log.gz
[7] https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/500/jamo-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-oxygen/12/

Comment by Jamo Luhrsen [ 13/Dec/18 ]

jhershbe, did we ever get a patch in networking-odl to not actually die in the thread that
retries the websocket? I think that's all we need to mark this specific bug as resolved.

Generated at Wed Feb 07 20:25:49 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.