[NETVIRT-819] An instance doesn't get an IP after deployment Created: 03/Aug/17  Updated: 03/May/18  Resolved: 28/Aug/17

Status: Verified
Project: netvirt
Component/s: General
Affects Version/s: Carbon
Fix Version/s: None

Type: Bug
Reporter: Itzik Brown Assignee: Sridhar Gaddam
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: File overcloud_validate.log.txt.gz    
External issue ID: 8926

 Description   

Description of problem
=======================
After deployment when launching an instance - it doesn't get an IP.
When launching a second instance on the same node - it gets and IP.
Then when rebooting the first one - it gets an IP

Version
=======
opendaylight-6.1.0-2.el7ost.noarch

Steps to Reproduce
=====================
1.Deploy an overcloud
2.Launch an instance.
3.Open The proper security group rules
4.Verify that it doesn't get an IP (can be pinged from the DHCP name space).
5.Launch an instance on the same node as the first one and make sure it gets an IP.



 Comments   
Comment by Tim Rozet [ 15/Aug/17 ]

Also seen in TripleO CI. See tripleo_ci attachment. We need to get logs for flow dumps and ODL which are missing right now in OOO CI.

Comment by Tim Rozet [ 15/Aug/17 ]

Attachment overcloud_validate.log.txt.gz has been added with description: tripleo_ci_log

Comment by Sridhar Gaddam [ 18/Aug/17 ]

In a fresh multinode deployment with Controller node running
ODL + dhcp-agent and a Compute node, when we spawn a first VM
on the compute node, it was seen that VM does not acquire the
IPAddress. On debugging, it turns out that the remote broadcast
group entries were not programmed on the Compute node.

Setup details:
1. Multi-node with Controller and a Compute node.
2. Create a tenant neutron network with an IPv4 subnet.
3. Create a neutron router.
4. Associate the ipv4 subnet to the neutron router.

At this stage, you can see that there is no tunnel between
Controller node and Compute node.

5. Now spawn a VM on the Compute node (you can explicitly
specify that VM has to be spawned on the compute node
by passing --availability-zone to nova boot command).

When the VM is spawned, following would be the sequence
of events.

t1: Nova creates a tap interface for the VM, this translates
to an add event for elanInterface (i.e., ElanInterfaceStateChangeListener
is invoked, and addElanInterface gets processed)
t2: In addElanInterface, elanManager checks if the interface
is part of existingElanDpnInterfaces (i.e., DpnInterfaces YANG model)
t3: Since its a new interface, it invokes createElanInterfacesList()
which would update the DpnInterfaces model. At this stage, the
transaction/information is still not comitted to the datastore.
t4: The processing continues to installEntriesForFirstInterfaceonDpn(),
where we try to program the local/remote BC Group entries.
In this API, we have an explicit sleep for (300 + 300) ms and when we
try to query the getEgressActionsForInterface (which is an API in GENIUS).
GENIUS returns an empty list with the following reason - "Interface
information not present in oper DS for the tunnel interface".
t5: So the remote BC Group does not include the actions to send the
packets over the tunnel interface at this stage.
t6: addElanInterface processing continues further and we commit the
transaction (i.e., DpnInterfaces model is now updated in the datastore).

While t1 to t6 is going on, in parallel, auto-tunnel code in GENIUS
creates the tunnel interfaces.

A1: A tunnel interface is created on the Compute node. When the tunnel interface
state is up, TunnelsState YANG model is updated in GENIUS (ItmTunnelStateUpdateHelper).
A2: A notification is received in ElanTunnelInterfaceStateListener, which
is handled in the following api - handleInternalTunnelStateEvent.
A3: In this API, when we query the ElanDpnInterfaces it only includes
the DPNInfo of the Controller and not the Compute node (because of
the delay in updating the model step t3-t6 above)
A4: Due to this, in handleInternalTunnelStateEvent, we do not invoke
setupElanBroadcastGroups() to program the Remote Group entries
on the Compute node and the remote Broadcast Group entries on
the Compute node never get updated.

So, the fix is, not to delay the updation of model (i.e., DpnInterfaces
until step t6) since this information is used while processing
ElanTunnelInterfaceState.

Comment by Sridhar Gaddam [ 18/Aug/17 ]

The following patch would address this issue:
https://git.opendaylight.org/gerrit/#/c/61995/

Comment by Sam Hague [ 19/Aug/17 ]

nitrogen and master are merged. Waiting for carbon.

Comment by Sam Hague [ 19/Aug/17 ]

master: https://git.opendaylight.org/gerrit/#/c/62015/1
nitrogen: https://git.opendaylight.org/gerrit/#/c/62014/1

Comment by A H [ 23/Aug/17 ]

A patch was submitted to revert the changes and fix this bug in Carbon SR2:

https://git.opendaylight.org/gerrit/#/c/61995/

To better assess the impact of this bug and fix, could someone from your team please help us identify the following:
Regression: Is this bug a regression of functionality/performance/feature compared to Carbon?
Severity: Could you elaborate on the severity of this bug? Is this a BLOCKER such that we cannot release Carbon SR2 without it?
Workaround: Is there a workaround such that we can write a release note instead?
Testing: Could you also elaborate on the testing of this patch? How extensively has this patch been tested? Is it covered by any unit tests or system tests?
Impact: Does this fix impact any dependent projects?

Comment by Sridhar Gaddam [ 24/Aug/17 ]

(In reply to A H from comment #7)
> A patch was submitted to revert the changes and fix this bug in Carbon SR2:
>
> https://git.opendaylight.org/gerrit/#/c/61995/
>
> To better assess the impact of this bug and fix, could someone from your
> team please help us identify the following:
> Regression: Is this bug a regression of functionality/performance/feature
> compared to Carbon?
Yes, AFAIK its a regression.

> Severity: Could you elaborate on the severity of this bug? Is this a
> BLOCKER such that we cannot release Carbon SR2 without it?
Yes, its a blocker bug for a Multi-node setup.

> Workaround: Is there a workaround such that we can write a release note
> instead?
One work-around is to create manual tunnels and not rely on the auto-tunnel code, but this is not an acceptable solution to many users.

> Testing: Could you also elaborate on the testing of this patch? How
> extensively has this patch been tested? Is it covered by any unit tests or
> system tests?
The issue is seen in TripleO CI and also in our downstream multi-node setup.

> Impact: Does this fix impact any dependent projects?
No

Comment by Tim Rozet [ 24/Aug/17 ]

OPNFV Functest is now passing with the fix:
https://build.opnfv.org/ci/job/apex-verify-master/756/

Comment by Sridhar Gaddam [ 28/Aug/17 ]

(In reply to Timothy Rozet from comment #9)
> OPNFV Functest is now passing with the fix:
> https://build.opnfv.org/ci/job/apex-verify-master/756/

Thanks @Tim for confirming that fix is working.

Generated at Wed Feb 07 20:22:33 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.