[NETVIRT-1460] websocket failing: causes instance creation failures Created: 12/Oct/18  Updated: 05/Dec/18

Status: Confirmed
Project: netvirt
Component/s: General
Affects Version/s: None
Fix Version/s: Fluorine-SR2, Neon, Oxygen-SR4

Type: Bug Priority: High
Reporter: Jamo Luhrsen Assignee: Jamo Luhrsen
Resolution: Unresolved Votes: 0
Labels: csit:3node
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Relates
relates to NETVIRT-1461 port create times out, instances go t... Resolved

 Description   

we have sporadic failures in our netvirt 3node (aka clustering) suites where openstack
instances go to error state. the reason is shown as "Failed to allocate network(s)".

In this example, this is happening after all three nodes have been stopped and started.
The cluster is showing as up and operational, in that syncstatus is True and every shard
has a proper Leader and Followers. Aprox 10m after the nodes are started these
instances are booted that end up in error state.

There is one ODL karaf.log with some of these messages:

2018-09-16T10:25:39,274 | ERROR | nioEventLoopGroup-7-1 | WebSocketServerHandler           | 336 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.7.4.SNAPSHOT | Listener for stream with name 'data-change-event-subscription/neutron:neutron/neutron:ports/datastore=OPERATIONAL/scope=SUBTREE' was not found.

In the nova log we can see some operation has timed out, so maybe that is because
of the failing websocket:

WARNING nova.virt.libvirt.driver [[01;36mNone req-a139053f-fabc-4c64-b658-0e73c1e4ecc5 [00;36madmin admin] [01;35m[instance: e115f123-25e9-4e6c-80be-347564d75af1] Timeout waiting for [('network-vif-plugged', u'8d8e28ca-1b96-41f9-8d12-6b041e5300e9')] for instance with vm_state building and task_state spawning.[00m: Timeout: 300 seconds

The other two ODL nodes do not seem to have this websocket error, but that may
only be because just the one ODL is being hit with the requests (via haproxy). Or it
may be that it's just not broken on the other two nodes.



 Comments   
Comment by Josh Hershberg [ 16/Oct/18 ]

This is almost certainly caused by haproxy. Connecting the websocket consists of two rest calls that go to the rest port (8081 in CSIT, IIRC) and a websocket connection that goes to port 8185. If the websocket lands on a different ODL node than the rest calls then the above error is emitted. 

Comment by Jamo Luhrsen [ 16/Oct/18 ]

Thanks for digging on this one Josh. I'm no expert in haproxy, but I think I can see how this scenario could
happen for us based on our config. Here is the current config:

global
  daemon
  group  haproxy
  log  /dev/log local0
  maxconn  20480
  pidfile  /tmp/haproxy.pid
  ssl-default-bind-ciphers  !SSLv2:kEECDH:kRSA:kEDH:kPSK:+3DES:!aNULL:!eNULL:!MD5:!EXP:!RC4:!SEED:!IDEA:!DES
  ssl-default-bind-options  no-sslv3 no-tlsv10
  stats  socket /var/lib/haproxy/stats mode 600 level user
  stats  timeout 2m
  user  haproxy

defaults
  log  global
  maxconn  4096
  mode  tcp
  retries  3
  timeout  http-request 10s
  timeout  queue 2m
  timeout  connect 10s
  timeout  client 2m
  timeout  server 2m
  timeout  check 10s

listen opendaylight
  bind 10.30.170.24:8181 transparent
  mode http
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  option httpchk GET /diagstatus
  option httplog
  server opendaylight-rest-1 10.30.170.101:8181 check fall 5 inter 2000 rise 2
  server opendaylight-rest-2 10.30.170.33:8181 check fall 5 inter 2000 rise 2
  server opendaylight-rest-3 10.30.170.134:8181 check fall 5 inter 2000 rise 2

listen opendaylight_ws
  bind 10.30.170.24:8185 transparent
  mode http
  timeout connect 5s
  timeout client 25s
  timeout server 25s
  timeout tunnel 3600s
  server opendaylight-ws-1 10.30.170.101:8185 check fall 5 inter 2000 rise 2
  server opendaylight-ws-2 10.30.170.33:8185 check fall 5 inter 2000 rise 2
  server opendaylight-ws-3 10.30.170.134:8185 check fall 5 inter 2000 rise 2

So, if Iunderstand it right, there are two different LBs happening here. one for
8181 and one for 8185, and the default algo is round robin. My guess is that
there is no guarantee that both requests that are needed for a websocket
create will hit the same server with the above config. And, maybe its possible
(especially when downing nodes) that we get fully out of sync in the two
round robin handling that we fail to create the websocket forever (or a really
long time).

For now, I'm trying to tweak the haproxy config to put both ports in the same
listener and use source loadbalancing. This will not spread the load of across
all the controllers, so maybe we have something else to try if that's the goal.
But, if it works like we think, then all requests will end up going to the same
ODL and websocket creation won't fail like we think it is.

Also, I noticed that our existing config's listener section for 8185 does not
have the /diagstatus healthcheck. So it might be possible that 8185 is
still getting sent to a controller that is not really ready.

Comment by Josh Hershberg [ 17/Oct/18 ]

Some more (and some repetitive) info from the mail thread...

 
As Tim said, "IIRC the registration request just needs to hit every ODL, but after that neutron should be able to connect via websocket to any ODL." What this means is that the registration request and the websocket connection do not need to be handled by the same ODL instance. However, the ODL instance that handles the websocket connection must have received a registration request from some (any) networking_odl but not necessarily the one connecting the websocket. In a director deployment there are three neutron controllers and the same number of ODL instances. As such, given that the algorithm is RR there is at least a very high likelihood that a registration call will hit each ODL instance. In CSIT we have just one controller!
 
I think this is very wrong even for director despite the fact that we haven't noticed any issues yet. The reason we have HA is so that we can handle when some nodes croak. If a neutron controller node goes down, we will see this same phenomenon of websocket connections failing. IMHO, this needs to be fixed. Also, it feels like something that "works by accident." I see a few options:
 
1) Configure haproxy so that rest and websocket connections from the same host will always get proxied to the same odl instance. Obviously, this should be done in such a way as to balance the traffic.
 
2) Modify networking_odl so that the rest calls for registration are not sent via the VIP but rather explicitly sent to each odl node. The websocket connections would continue to use the VIP. At first, this seemed a bit ugly but on second thought I actually do not think this is so bad. The registration calls are not something that happen often or carry any kind of load. It is less than optimal that networking_odl would need to know about the real IPs but all in all this is a rather safe and straightforward fix. 

Comment by Jamo Luhrsen [ 24/Oct/18 ]

this patch seems to mask this issue

Comment by Jamo Luhrsen [ 08/Nov/18 ]

will close this issue when https://git.opendaylight.org/gerrit/#/c/77569/ is merged

Comment by Jamo Luhrsen [ 04/Dec/18 ]

shague, can we keep this open? I know we had one patch merged towards fixing this general problem, but it's still not totally fixed.

robot log showing instances in error state, failing to allocate network
neutron server log

in the neutron log you can see the below message, which I think is indicative of this
problem:

48506:2018-11-16 23:57:13.071 sERROR networking_odl.common.websocket_client None req-582e6941-f0b2-4e84-9be5-c8d7984be1cf None None websocket irrecoverable error 
Comment by Sam Hague [ 05/Dec/18 ]

yeah, that's fine to keep open. I closed it after I saw your comment about closing when 77569 is merged so I closed it.

Generated at Wed Feb 07 20:24:07 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.