[GENIUS-263] tunnels down after bouncing ODL nodes in netvirt csit 3node HA job Created: 08/Jan/19  Updated: 05/Feb/20  Resolved: 05/Feb/20

Status: Verified
Project: genius
Component/s: ITM
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: High
Reporter: Jamo Luhrsen Assignee: nidhi adhvaryu
Resolution: Cannot Reproduce Votes: 0
Labels: csit, csit:3node, csit:failures
Remaining Estimate: 0 minutes
Time Spent: 1 day
Original Estimate: Not Specified

Issue Links:
Blocks
blocks NEUTRON-204 networking-odl gives up on websocket,... Open

 Description   

in our 3node netvirt jobs it seems the default tunnels all end up showing as 'down'
after running the first suite that bounces ODL instances.

example job

the test code is using odltools to check if all the tunnels are up. The command and
response:

odltools netvirt analyze tunnels -i 10.30.170.109 -t 8181 -u admin -w admin --path /tmp/07_ha_l3_Suite_Setup

2019-01-03 03:11:41,564 | ERR | common.rest_client   | 0052 | 404 Client Error: Not Found for url: http://10.30.170.109:8181/restconf/config/itm-state:dpn-teps-state
Analysing transport-zone:default-transport-zone
..Interface tun65d79967da1 is down between 10.30.170.163 and 10.30.170.27
..Interface tun8186ae8b8b0 is down between 10.30.170.27 and 10.30.170.163
..Interface tunaddd45e0aa2 is down between 10.30.170.163 and 10.30.170.170
..Interface tun0a682004fbe is down between 10.30.170.27 and 10.30.170.170

but, looking at some debug output in the suite before and comparing to the
failure debug, I can't find any difference.

taking one interface "tun65d79967da1" here is some output:

operational/itm-state:tunnels_state

{
                "dst-info": {
                    "tep-device-id": "140946245075061",
                    "tep-device-type": "itm-state:tep-type-internal",
                    "tep-ip": "10.30.170.27"
                },
                "oper-state": "unknown",
                "src-info": {
                    "tep-device-id": "62509838011292",
                    "tep-device-type": "itm-state:tep-type-internal",
                    "tep-ip": "10.30.170.163"
                },
                "transport-type": "odl-interface:tunnel-type-vxlan",
                "tunnel-interface-name": "tun65d79967da1",
                "tunnel-state": false
},

of-ctl show

2(tun65d79967da1): addr:52:fe:71:a9:ad:14
     config:     0
     state:      LIVE
     speed: 0 Mbps now, 0 Mbps max

ovs-vsctl show

Port "tun65d79967da1"
            Interface "tun65d79967da1"
                type: vxlan
                options: {key=flow, local_ip="10.30.170.163", remote_ip="10.30.170.27"}

It does seem that there may be trouble in at least one ODL after it's coming up as
I see some clustering INFO messages that seem to indicate something is out-of-sync.
for example:

2019-01-03T03:04:38,348 | INFO  | opendaylight-cluster-data-shard-dispatcher-21 | Shard                            | 229 - org.opendaylight.controller.sal-clustering-commons - 1.8.2 | member-3-shard-default-config (Follower): The log is not empty but the prevLogIndex 19042 was not found in it - lastIndex: 17875, snapshotIndex: -1
2019-01-03T03:04:38,348 | INFO  | opendaylight-cluster-data-shard-dispatcher-21 | Shard                            | 229 - org.opendaylight.controller.sal-clustering-commons - 1.8.2 | member-3-shard-default-config (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=23, success=false, followerId=member-3-shard-default-config, logLastIndex=17875, logLastTerm=4, forceInstallSnapshot=false, payloadVersion=9, raftVersion=3]


 Comments   
Comment by nidhi adhvaryu [ 29/Jan/20 ]

From this information i can identify that one controller is down due to which dpns are getting disconnected and tunnels will be in unknown state. but as you mentioned other 2 controllers are up so tunnels should be up, for this i need to investigate further and i need logs, logs which you have attached are no longer available.

Should i re-trigger job on fluorine?

Comment by Jamo Luhrsen [ 31/Jan/20 ]

Hi enidadh,

this bug is over a year old and I haven't been paying too much attention to netvirt CSIT in the past 6m or so. Logs are purged after 6 months which
is why the provided logs are no longer available.

you can probably just dig through these two jobs recent results:
https://jenkins.opendaylight.org/releng/job/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-sodium/
https://jenkins.opendaylight.org/releng/job/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-magnesium/

I'm not sure if the problem is still happening or not, but those jobs do have lots, and maybe consistent, failures.

Comment by nidhi adhvaryu [ 03/Feb/20 ]

Hi jluhrsen,

I have analyzed the latest job, In which i have not observed this failure.

In later branches also i haven't seen the similar failure.

Comment by Jamo Luhrsen [ 03/Feb/20 ]

ok. maybe the failures in the links I gave are from some other/new bug then. If so, you can close this as unreproducible and open new bugs for the failures

Comment by nidhi adhvaryu [ 05/Feb/20 ]

Thanks jluhrsen.

I will close this bug. and i will check the recent failures.

Comment by nidhi adhvaryu [ 05/Feb/20 ]

This bug is not reproducible. and not present in recent branches. so closing the bug.

Generated at Wed Feb 07 20:00:19 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.