[GENIUS-263] tunnels down after bouncing ODL nodes in netvirt csit 3node HA job Created: 08/Jan/19 Updated: 05/Feb/20 Resolved: 05/Feb/20 |
|
| Status: | Verified |
| Project: | genius |
| Component/s: | ITM |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | High |
| Reporter: | Jamo Luhrsen | Assignee: | nidhi adhvaryu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | csit, csit:3node, csit:failures | ||
| Remaining Estimate: | 0 minutes | ||
| Time Spent: | 1 day | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
in our 3node netvirt jobs it seems the default tunnels all end up showing as 'down' the test code is using odltools to check if all the tunnels are up. The command and odltools netvirt analyze tunnels -i 10.30.170.109 -t 8181 -u admin -w admin --path /tmp/07_ha_l3_Suite_Setup 2019-01-03 03:11:41,564 | ERR | common.rest_client | 0052 | 404 Client Error: Not Found for url: http://10.30.170.109:8181/restconf/config/itm-state:dpn-teps-state Analysing transport-zone:default-transport-zone ..Interface tun65d79967da1 is down between 10.30.170.163 and 10.30.170.27 ..Interface tun8186ae8b8b0 is down between 10.30.170.27 and 10.30.170.163 ..Interface tunaddd45e0aa2 is down between 10.30.170.163 and 10.30.170.170 ..Interface tun0a682004fbe is down between 10.30.170.27 and 10.30.170.170 but, looking at some debug output in the suite before and comparing to the taking one interface "tun65d79967da1" here is some output: operational/itm-state:tunnels_state
{
"dst-info": {
"tep-device-id": "140946245075061",
"tep-device-type": "itm-state:tep-type-internal",
"tep-ip": "10.30.170.27"
},
"oper-state": "unknown",
"src-info": {
"tep-device-id": "62509838011292",
"tep-device-type": "itm-state:tep-type-internal",
"tep-ip": "10.30.170.163"
},
"transport-type": "odl-interface:tunnel-type-vxlan",
"tunnel-interface-name": "tun65d79967da1",
"tunnel-state": false
},
of-ctl show
2(tun65d79967da1): addr:52:fe:71:a9:ad:14
config: 0
state: LIVE
speed: 0 Mbps now, 0 Mbps max
ovs-vsctl show Port "tun65d79967da1" Interface "tun65d79967da1" type: vxlan options: {key=flow, local_ip="10.30.170.163", remote_ip="10.30.170.27"} It does seem that there may be trouble in at least one ODL after it's coming up as 2019-01-03T03:04:38,348 | INFO | opendaylight-cluster-data-shard-dispatcher-21 | Shard | 229 - org.opendaylight.controller.sal-clustering-commons - 1.8.2 | member-3-shard-default-config (Follower): The log is not empty but the prevLogIndex 19042 was not found in it - lastIndex: 17875, snapshotIndex: -1 2019-01-03T03:04:38,348 | INFO | opendaylight-cluster-data-shard-dispatcher-21 | Shard | 229 - org.opendaylight.controller.sal-clustering-commons - 1.8.2 | member-3-shard-default-config (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=23, success=false, followerId=member-3-shard-default-config, logLastIndex=17875, logLastTerm=4, forceInstallSnapshot=false, payloadVersion=9, raftVersion=3] |
| Comments |
| Comment by nidhi adhvaryu [ 29/Jan/20 ] |
|
From this information i can identify that one controller is down due to which dpns are getting disconnected and tunnels will be in unknown state. but as you mentioned other 2 controllers are up so tunnels should be up, for this i need to investigate further and i need logs, logs which you have attached are no longer available. Should i re-trigger job on fluorine? |
| Comment by Jamo Luhrsen [ 31/Jan/20 ] |
|
Hi enidadh, this bug is over a year old and I haven't been paying too much attention to netvirt CSIT in the past 6m or so. Logs are purged after 6 months which you can probably just dig through these two jobs recent results: I'm not sure if the problem is still happening or not, but those jobs do have lots, and maybe consistent, failures. |
| Comment by nidhi adhvaryu [ 03/Feb/20 ] |
|
Hi jluhrsen, I have analyzed the latest job, In which i have not observed this failure. In later branches also i haven't seen the similar failure. |
| Comment by Jamo Luhrsen [ 03/Feb/20 ] |
|
ok. maybe the failures in the links I gave are from some other/new bug then. If so, you can close this as unreproducible and open new bugs for the failures |
| Comment by nidhi adhvaryu [ 05/Feb/20 ] |
|
Thanks jluhrsen. I will close this bug. and i will check the recent failures. |
| Comment by nidhi adhvaryu [ 05/Feb/20 ] |
|
This bug is not reproducible. and not present in recent branches. so closing the bug. |