<!-- 
RSS generated by JIRA (8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d) at Wed Feb 07 20:24:07 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>OpenDaylight JIRA</title>
    <link>https://jira.opendaylight.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>8.20.10</version>
        <build-number>820010</build-number>
        <build-date>22-06-2022</build-date>
    </build-info>


<item>
            <title>[NETVIRT-1460] websocket failing: causes instance creation failures</title>
                <link>https://jira.opendaylight.org/browse/NETVIRT-1460</link>
                <project id="10144" key="NETVIRT">netvirt</project>
                    <description>&lt;p&gt;we have sporadic failures in our netvirt 3node (aka clustering) suites where openstack&lt;br/&gt;
instances go to error state. the reason is shown as &quot;Failed to allocate network(s)&quot;.&lt;/p&gt;

&lt;p&gt;In this example, this is happening after all three nodes have been stopped and started.&lt;br/&gt;
The cluster is showing as up and operational, in that syncstatus is True and every shard&lt;br/&gt;
has a proper Leader and Followers. Aprox 10m after the nodes are started these &lt;br/&gt;
instances are booted that end up in error state.&lt;/p&gt;

&lt;p&gt;There is one ODL &lt;a href=&quot;https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-snat-conntrack-oxygen/65/odl_3/odl3_karaf.log.gz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;karaf.log &lt;/a&gt; with some of these messages:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
2018-09-16T10:25:39,274 | ERROR | nioEventLoopGroup-7-1 | WebSocketServerHandler           | 336 - org.opendaylight.netconf.restconf-nb-bierman02 - 1.7.4.SNAPSHOT | Listener &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; stream with name &lt;span class=&quot;code-quote&quot;&gt;&apos;data-change-event-subscription/neutron:neutron/neutron:ports/datastore=OPERATIONAL/scope=SUBTREE&apos;&lt;/span&gt; was not found.
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the &lt;a href=&quot;https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-snat-conntrack-oxygen/65/compute_2/oslogs/nova-compute.log.gz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;nova log &lt;/a&gt; we can see some operation has timed out, so maybe that is because&lt;br/&gt;
of the failing websocket:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
WARNING nova.virt.libvirt.driver [[01;36mNone req-a139053f-fabc-4c64-b658-0e73c1e4ecc5 [00;36madmin admin] [01;35m[instance: e115f123-25e9-4e6c-80be-347564d75af1] Timeout waiting &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; [(&lt;span class=&quot;code-quote&quot;&gt;&apos;network-vif-plugged&apos;&lt;/span&gt;, u&lt;span class=&quot;code-quote&quot;&gt;&apos;8d8e28ca-1b96-41f9-8d12-6b041e5300e9&apos;&lt;/span&gt;)] &lt;span class=&quot;code-keyword&quot;&gt;for&lt;/span&gt; instance with vm_state building and task_state spawning.[00m: Timeout: 300 seconds
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The other two ODL nodes do not seem to have this websocket error, but that may&lt;br/&gt;
only be because just the one ODL is being hit with the requests (via haproxy). Or it&lt;br/&gt;
may be that it&apos;s just not broken on the other two nodes.&lt;/p&gt;</description>
                <environment></environment>
        <key id="30882">NETVIRT-1460</key>
            <summary>websocket failing: causes instance creation failures</summary>
                <type id="10104" iconUrl="https://jira.opendaylight.org/secure/viewavatar?size=xsmall&amp;avatarId=10303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="2" iconUrl="https://jira.opendaylight.org/images/icons/priorities/critical.svg">High</priority>
                        <status id="10003" iconUrl="https://jira.opendaylight.org/images/icons/status_generic.gif" description="">Confirmed</status>
                    <statusCategory id="2" key="new" colorName="blue-gray"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="jluhrsen">Jamo Luhrsen</assignee>
                                    <reporter username="jluhrsen">Jamo Luhrsen</reporter>
                        <labels>
                            <label>csit:3node</label>
                    </labels>
                <created>Fri, 12 Oct 2018 23:06:37 +0000</created>
                <updated>Wed, 5 Dec 2018 01:03:41 +0000</updated>
                                                            <fixVersion>Fluorine-SR2</fixVersion>
                    <fixVersion>Neon</fixVersion>
                    <fixVersion>Oxygen-SR4</fixVersion>
                                    <component>General</component>
                        <due></due>
                            <votes>0</votes>
                                    <watches>3</watches>
                                                                                                                <comments>
                            <comment id="65359" author="jhershbe" created="Tue, 16 Oct 2018 10:53:23 +0000"  >&lt;p&gt;This is almost certainly caused by haproxy. Connecting the websocket consists of two rest calls that go to the rest port (8081 in CSIT, IIRC) and a websocket connection that goes to port 8185. If the websocket lands on a different ODL node than the rest calls then the above error is emitted.&#160;&lt;/p&gt;</comment>
                            <comment id="65362" author="jluhrsen" created="Tue, 16 Oct 2018 18:18:04 +0000"  >&lt;p&gt;Thanks for digging on this one Josh. I&apos;m no expert in haproxy, but I think I can see how this scenario could&lt;br/&gt;
happen for us based on our config. Here is the current config:&lt;/p&gt;

&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;global
  daemon
  group  haproxy
  log  /dev/log local0
  maxconn  20480
  pidfile  /tmp/haproxy.pid
  ssl-&lt;span class=&quot;code-keyword&quot;&gt;default&lt;/span&gt;-bind-ciphers  !SSLv2:kEECDH:kRSA:kEDH:kPSK:+3DES:!aNULL:!eNULL:!MD5:!EXP:!RC4:!SEED:!IDEA:!DES
  ssl-&lt;span class=&quot;code-keyword&quot;&gt;default&lt;/span&gt;-bind-options  no-sslv3 no-tlsv10
  stats  socket /&lt;span class=&quot;code-keyword&quot;&gt;var&lt;/span&gt;/lib/haproxy/stats mode 600 level user
  stats  timeout 2m
  user  haproxy

defaults
  log  global
  maxconn  4096
  mode  tcp
  retries  3
  timeout  http-request 10s
  timeout  queue 2m
  timeout  connect 10s
  timeout  client 2m
  timeout  server 2m
  timeout  check 10s

listen opendaylight
  bind 10.30.170.24:8181 transparent
  mode http
  http-request set-header X-Forwarded-Proto https &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; { ssl_fc }
  http-request set-header X-Forwarded-Proto http &lt;span class=&quot;code-keyword&quot;&gt;if&lt;/span&gt; !{ ssl_fc }
  option httpchk GET /diagstatus
  option httplog
  server opendaylight-&lt;span class=&quot;code-keyword&quot;&gt;rest&lt;/span&gt;-1 10.30.170.101:8181 check fall 5 inter 2000 rise 2
  server opendaylight-&lt;span class=&quot;code-keyword&quot;&gt;rest&lt;/span&gt;-2 10.30.170.33:8181 check fall 5 inter 2000 rise 2
  server opendaylight-&lt;span class=&quot;code-keyword&quot;&gt;rest&lt;/span&gt;-3 10.30.170.134:8181 check fall 5 inter 2000 rise 2

listen opendaylight_ws
  bind 10.30.170.24:8185 transparent
  mode http
  timeout connect 5s
  timeout client 25s
  timeout server 25s
  timeout tunnel 3600s
  server opendaylight-ws-1 10.30.170.101:8185 check fall 5 inter 2000 rise 2
  server opendaylight-ws-2 10.30.170.33:8185 check fall 5 inter 2000 rise 2
  server opendaylight-ws-3 10.30.170.134:8185 check fall 5 inter 2000 rise 2
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;


&lt;p&gt;So, if Iunderstand it right, there are two different LBs happening here. one for&lt;br/&gt;
8181 and one for 8185, and the default algo is round robin. My guess is that&lt;br/&gt;
there is no guarantee that both requests that are needed for a websocket&lt;br/&gt;
create will hit the same server with the above config. And, maybe its possible&lt;br/&gt;
(especially when downing nodes) that we get fully out of sync in the two&lt;br/&gt;
round robin handling that we fail to create the websocket forever (or a really&lt;br/&gt;
long time).&lt;/p&gt;

&lt;p&gt;For now, I&apos;m &lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/76955&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;trying to tweak &lt;/a&gt; the haproxy config to put both ports in the same&lt;br/&gt;
listener and use source loadbalancing. This will not spread the load of across&lt;br/&gt;
all the controllers, so maybe we have something else to try if that&apos;s the goal.&lt;br/&gt;
But, if it works like we think, then all requests will end up going to the same&lt;br/&gt;
ODL and websocket creation won&apos;t fail like we think it is.&lt;/p&gt;

&lt;p&gt;Also, I noticed that our existing config&apos;s listener section for 8185 does not&lt;br/&gt;
have the /diagstatus healthcheck. So it might be possible that 8185 is&lt;br/&gt;
still getting sent to a controller that is not really ready.&lt;/p&gt;</comment>
                            <comment id="65370" author="jhershbe" created="Wed, 17 Oct 2018 11:33:47 +0000"  >&lt;p&gt;Some more (and some repetitive) info from the mail thread...&lt;/p&gt;

&lt;p&gt;&#160;&lt;br/&gt;
As Tim said, &quot;IIRC the registration request just needs to hit every ODL, but after that neutron should be able to connect via websocket to any ODL.&quot; What this means is that the registration request and the websocket connection do not need to be handled by the same ODL instance. However, the ODL instance that handles the websocket connection &lt;b&gt;must&lt;/b&gt; have received a registration request from some (any) networking_odl but not necessarily the one connecting the websocket. In a director deployment there are three neutron controllers and the same number of ODL instances. As such, given that the algorithm is RR there is at least a very high likelihood that a registration call will hit each ODL instance. In CSIT we have just one controller!&lt;br/&gt;
&#160;&lt;br/&gt;
I think this is very wrong even for director despite the fact that we haven&apos;t noticed any issues yet. The reason we have HA is so that we can handle when some nodes croak. If a neutron controller node goes down, we will see this same phenomenon of websocket connections failing. IMHO, this needs to be fixed. Also, it feels like something that &quot;works by accident.&quot; I see a few options:&lt;br/&gt;
&#160;&lt;br/&gt;
1) Configure haproxy so that rest and websocket connections from the same host will always get proxied to the same odl instance. Obviously, this should be done in such a way as to balance the traffic.&lt;br/&gt;
&#160;&lt;br/&gt;
2) Modify networking_odl so that the rest calls for registration are not sent via the VIP but rather explicitly sent to each odl node. The websocket connections would continue to use the VIP. At first, this seemed a bit ugly but on second thought I actually do not think this is so bad. The registration calls are not something that happen often or carry any kind of load. It is less than optimal that networking_odl would need to know about the real IPs but all in all this is a rather safe and straightforward fix.&#160;&lt;/p&gt;</comment>
                            <comment id="65415" author="jluhrsen" created="Wed, 24 Oct 2018 23:41:50 +0000"  >&lt;p&gt;&lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/77229/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;this patch &lt;/a&gt; seems to mask this issue&lt;/p&gt;</comment>
                            <comment id="65518" author="jluhrsen" created="Thu, 8 Nov 2018 01:30:58 +0000"  >&lt;p&gt;will close this issue when &lt;a href=&quot;https://git.opendaylight.org/gerrit/#/c/77569/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://git.opendaylight.org/gerrit/#/c/77569/&lt;/a&gt; is merged&lt;/p&gt;</comment>
                            <comment id="65879" author="jluhrsen" created="Tue, 4 Dec 2018 19:59:13 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.opendaylight.org/secure/ViewProfile.jspa?name=shague&quot; class=&quot;user-hover&quot; rel=&quot;shague&quot;&gt;shague&lt;/a&gt;, can we keep this open? I know we had one patch merged towards fixing this general problem, but it&apos;s still not totally fixed.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/135/robot-plugin/log_full.html.gz#s1-s6-t21-k3-k1-k3-k1-k2&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;robot log showing instances in error state, failing to allocate network &lt;/a&gt;&lt;br/&gt;
&lt;a href=&quot;https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/135/control_1/oslogs/neutron-server.log.gz&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;neutron server log &lt;/a&gt;&lt;/p&gt;

&lt;p&gt;in the neutron log you can see the below message, which I think is indicative of this&lt;br/&gt;
problem:&lt;/p&gt;
&lt;div class=&quot;code panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;codeContent panelContent&quot;&gt;
&lt;pre class=&quot;code-java&quot;&gt;
48506:2018-11-16 23:57:13.071 sERROR networking_odl.common.websocket_client None req-582e6941-f0b2-4e84-9be5-c8d7984be1cf None None websocket irrecoverable error 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="65883" author="shague@redhat.com" created="Wed, 5 Dec 2018 01:03:41 +0000"  >&lt;p&gt;yeah, that&apos;s fine to keep open. I closed it after I saw your comment about closing when 77569 is merged so I closed it.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10003">
                    <name>Relates</name>
                                            <outwardlinks description="relates to">
                                        <issuelink>
            <issuekey id="30883">NETVIRT-1461</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                            <customfield id="customfield_11400" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10000" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>0|i03jlj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                </customfields>
    </item>
</channel>
</rss>