[OPNFLWPLUG-1056] Default tables missing Created: 10/Dec/18  Updated: 09/Jul/19  Resolved: 09/Jul/19

Status: Resolved
Project: OpenFlowPlugin
Component/s: None
Affects Version/s: None
Fix Version/s: Neon, Sodium

Type: Bug Priority: Medium
Reporter: Sam Hague Assignee: Somashekhar Javalagi
Resolution: Done Votes: 0
Labels: csit:failures
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

CSIt job and the default table is missing so the suite setup fails.

 

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-gate-stateful-neon/439/robot-plugin/log_01_l2.html#s1-k1-k1-k9-k2-k3-k1-k1-k5



 Comments   
Comment by Faseela K [ 10/Dec/18 ]

shague : Why is this a GENIUS JIRA? Tables missing default flows are 18, 60, and 45. Those are netvirt programmed tabled. And if they are already programmed by netvirt, most likely an openflowplugin bug.

Comment by Sam Hague [ 10/Dec/18 ]

Thanks, moved to ofp.

Comment by Somashekhar Javalagi [ 13/Dec/18 ]

shague I have added a patch with debug logs, can you please run the csit on the same?

Comment by Jamo Luhrsen [ 20/Dec/18 ]

We need to figure out which job sees this the most frequent and then try to reproduce it there
with this patch. The job given in the description is a gate job. I've checked the non-gate
job of the same type and this problem hasn't happened in at least the last 30 tries.

I'll see what I can figure out, but if anyone else knows please comment.

Comment by Jamo Luhrsen [ 20/Dec/18 ]

Looks like the 3node cluster jobs see this more frequently than others. This tempest one
in particular failed because of missing default tables 3 times in the past 30 days (runs once a day).

I will create a sandbox job that runs without any test cases, in a loop,
using the distribution from the patch (c/78730) and monitor for any
failures due to missing tables.

Comment by Jamo Luhrsen [ 20/Dec/18 ]

Here's the sandbox job to try and reproduce this. Note that it
will be purged in two days, so if we don't hit the problem before then I'll
have to recreate.

Comment by Jamo Luhrsen [ 21/Dec/18 ]

Was able to recreate with the distro created in the debug patch

here is a link to the robot failure where you can see table=45 was not found on one of the nodes

this is a clustered job so three karaf logs to look at:

ODL 1
ODL 2
ODL 3

The node with the missing table=45 was the first compute node, and it's ovsdb UUID was 46c9d66c-60a6-4da3-8b58-2c4831689600. I think
the ODL that was owning and writing to it was ODL 2, just based off of grepping for that UUID in each karaf.log and seeing things like addPatchPort, etc
in ODL 2 and not in the others.

Comment by Arunprakash D [ 06/Feb/19 ]

DeviceContext is writing node information to the oper inventory.

Rolecontext is responsible to device's mastership election and ownership change callback.

 

FRM registers for ownership callback and will get notified once master is elected for a device. In some cases, Rolecontext is taking time for ownership election but deviceContext is going ahead and writing the node information to the oper inventory. So, apps which listen for node DTCL will go ahead and push default table flows which might be dropped by FRM as it has not yet got ownership callback.

The new implementation would be for devicecontext to wait for mastership election to go through and then write the switch information to the oper inventory. This will make sure FRM always has the mastership details when it receives flow information.

Generated at Wed Feb 07 20:34:04 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.