[NETVIRT-1460] websocket failing: causes instance creation failures Created: 12/Oct/18 Updated: 05/Dec/18 |
|
| Status: | Confirmed |
| Project: | netvirt |
| Component/s: | General |
| Affects Version/s: | None |
| Fix Version/s: | Fluorine-SR2, Neon, Oxygen-SR4 |
| Type: | Bug | Priority: | High |
| Reporter: | Jamo Luhrsen | Assignee: | Jamo Luhrsen |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
we have sporadic failures in our netvirt 3node (aka clustering) suites where openstack In this example, this is happening after all three nodes have been stopped and started. There is one ODL karaf.log with some of these messages: 2018-09-16T10:25:39,274 | ERROR | nioEventLoopGroup-7-1 | WebSocketServerHandler | 336 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.7.4.SNAPSHOT | Listener for stream with name 'data-change-event-subscription/neutron:neutron/neutron:ports/datastore=OPERATIONAL/scope=SUBTREE' was not found. In the nova log we can see some operation has timed out, so maybe that is because WARNING nova.virt.libvirt.driver [[01;36mNone req-a139053f-fabc-4c64-b658-0e73c1e4ecc5 [00;36madmin admin] [01;35m[instance: e115f123-25e9-4e6c-80be-347564d75af1] Timeout waiting for [('network-vif-plugged', u'8d8e28ca-1b96-41f9-8d12-6b041e5300e9')] for instance with vm_state building and task_state spawning.[00m: Timeout: 300 seconds The other two ODL nodes do not seem to have this websocket error, but that may |
| Comments |
| Comment by Josh Hershberg [ 16/Oct/18 ] |
|
This is almost certainly caused by haproxy. Connecting the websocket consists of two rest calls that go to the rest port (8081 in CSIT, IIRC) and a websocket connection that goes to port 8185. If the websocket lands on a different ODL node than the rest calls then the above error is emitted. |
| Comment by Jamo Luhrsen [ 16/Oct/18 ] |
|
Thanks for digging on this one Josh. I'm no expert in haproxy, but I think I can see how this scenario could global daemon group haproxy log /dev/log local0 maxconn 20480 pidfile /tmp/haproxy.pid ssl-default-bind-ciphers !SSLv2:kEECDH:kRSA:kEDH:kPSK:+3DES:!aNULL:!eNULL:!MD5:!EXP:!RC4:!SEED:!IDEA:!DES ssl-default-bind-options no-sslv3 no-tlsv10 stats socket /var/lib/haproxy/stats mode 600 level user stats timeout 2m user haproxy defaults log global maxconn 4096 mode tcp retries 3 timeout http-request 10s timeout queue 2m timeout connect 10s timeout client 2m timeout server 2m timeout check 10s listen opendaylight bind 10.30.170.24:8181 transparent mode http http-request set-header X-Forwarded-Proto https if { ssl_fc } http-request set-header X-Forwarded-Proto http if !{ ssl_fc } option httpchk GET /diagstatus option httplog server opendaylight-rest-1 10.30.170.101:8181 check fall 5 inter 2000 rise 2 server opendaylight-rest-2 10.30.170.33:8181 check fall 5 inter 2000 rise 2 server opendaylight-rest-3 10.30.170.134:8181 check fall 5 inter 2000 rise 2 listen opendaylight_ws bind 10.30.170.24:8185 transparent mode http timeout connect 5s timeout client 25s timeout server 25s timeout tunnel 3600s server opendaylight-ws-1 10.30.170.101:8185 check fall 5 inter 2000 rise 2 server opendaylight-ws-2 10.30.170.33:8185 check fall 5 inter 2000 rise 2 server opendaylight-ws-3 10.30.170.134:8185 check fall 5 inter 2000 rise 2 So, if Iunderstand it right, there are two different LBs happening here. one for For now, I'm trying to tweak the haproxy config to put both ports in the same Also, I noticed that our existing config's listener section for 8185 does not |
| Comment by Josh Hershberg [ 17/Oct/18 ] |
|
Some more (and some repetitive) info from the mail thread... |
| Comment by Jamo Luhrsen [ 24/Oct/18 ] |
|
this patch seems to mask this issue |
| Comment by Jamo Luhrsen [ 08/Nov/18 ] |
|
will close this issue when https://git.opendaylight.org/gerrit/#/c/77569/ is merged |
| Comment by Jamo Luhrsen [ 04/Dec/18 ] |
|
shague, can we keep this open? I know we had one patch merged towards fixing this general problem, but it's still not totally fixed. robot log showing instances in error state, failing to allocate network in the neutron log you can see the below message, which I think is indicative of this 48506:2018-11-16 23:57:13.071 sERROR networking_odl.common.websocket_client None req-582e6941-f0b2-4e84-9be5-c8d7984be1cf None None websocket irrecoverable error |
| Comment by Sam Hague [ 05/Dec/18 ] |
|
yeah, that's fine to keep open. I closed it after I saw your comment about closing when 77569 is merged so I closed it. |