[OVSDB-444] Port binding failure after rebooting compute node - hostconfig missing Created: 13/Dec/17 Updated: 02/May/19 Resolved: 02/May/19 |
|
| Status: | Resolved |
| Project: | ovsdb |
| Component/s: | Southbound.Open_vSwitch |
| Affects Version/s: | Oxygen-SR4, Fluorine |
| Fix Version/s: | Fluorine |
| Type: | Bug | Priority: | Medium |
| Reporter: | Victor Pickard | Assignee: | Victor Pickard |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
I have a pike/carbon TripleO based setup. Spin up a VM on compute node, verify it gets an IP and can ping. All is good. Now, delete the VM, reboot the compute node (sudo shutdown -r now). Compute node comes up, tunnel is up between compute and control (good stuff). But, when I boot the VM, port binding fails, as shown below. It looks like networking_odl sees the agent as "dead". Tried booting the VM again, same issue (was wondering if it took some time to see the agent as "alive"). This log is generated:
2017-11-29 23:52:38.233 56464 WARNING networking_odl.ml2.pseudo_agentdb_binding [req-fde69db7-2e68-4d1b-887c-3d16680cf567 74655be1d3634adc8234db3b9a35d083 231c8ce05e234185ac3db9c98e5eb3bd - default default] Refusing to bind port 13029ea7-
9821-4796-a1d3-0640a8b14873 to dead pseudo agent: {'binary': u'neutron-odlagent-portbinding', 'description': None, 'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2017, 11, 29, 23, 46, 18), 'availability_zone': None, 'ali
ve': False, 'topic': u'N/A', 'host': u'overcloud-novacompute-0.opnfvlf.org', 'agent_type': u'ODL L2', 'resource_versions': {}, 'created_at': datetime.datetime(2017, 11, 28, 2, 40, 59), 'started_at': datetime.datetime(2017, 11, 28, 2, 40,
59), 'id': u'dddc4a58-fee2-4788-b560-433be36f64fd', 'configurations': {u'allowed_network_types': [u'local', u'vlan', u'vxlan', u'gre'], u'bridge_mappings':
{u'datacentre': u'br-ex'}
, u'supported_vnic_types': [{u'vnic_type': u'normal', u'v
if_type': u'ovs', u'vif_details': {}}]}}
2017-11-29 23:52:38.233 56464 WARNING networking_odl.ml2.pseudo_agentdb_binding [req-fde69db7-2e68-4d1b-887c-3d16680cf567 74655be1d3634adc8234db3b9a35d083 231c8ce05e234185ac3db9c98e5eb3bd - default default] No ODL hostconfigs for host ove
rcloud-novacompute-0.opnfvlf.org found in agentdb
More debugging with input from devs, indicates that hostconfig is missing from ODL.
Here is that query after reboot of compute node. [ root@overcloud-controller-0 neutron]# curl -s -u admin:admin -X GET http://${CIP}:8181/restconf/operational/neutron:neutron/hostconfigs/ | python -m json.tool
{
"hostconfigs": {
"hostconfig": [
{
"config": "{ \"supported_vnic_types\": [{ \"vnic_type\": \"normal\", \"vif_type\": \"ovs\", \"vif_details\": {} }], \"allowed_network_types\": [\"local\",\"vlan\",\"vxlan\",\"gre\"], \"bridge_mappings\": {\"da
tacentre\":\"br-ex\"}}",
"host-id": "overcloud-controller-0.opnfvlf.org",
"host-type": "ODL L2"
}
]
}
}
[root@overcloud-controller-0 neutron]#
One last data point. I went to the compute node, did a del-manager, then set-manager, and the compute node is now in the host config, and I can boot VMs on the compute node again. (overcloud) [heat-admin@overcloud-controller-0 log]$ curl -s -u admin:admin -X GET http://${CIP}:8181/restconf/operational/neutron:neutron/hostconfigs/ | python -m json.tool From Moshe, another workaound is to restart opendaylight service.
Isaku's email describing some relevant behavior: Right now Ran Xiao has reported a bug related to hostconfig. When compute node is shutdown, hostconfig entry in MD-SAL is deleted as ovsdb Then, suppose the compute node is back. When port binding occurs and agentdb If hostconfig in MD-SAL isn't populated after rebooting compude node, |
| Comments |
| Comment by Ritu Sood [ 14/Dec/17 ] |
|
After reboot of compute node can you also check the ovsdb tables and check if the tables are populated with the hostconfig data. ( ovs-vsctl list Open_vSwitch) |
| Comment by Victor Pickard [ 14/Dec/17 ] |
|
Yes, the tables are populated with host config after reboot. Here is the dump before reboot: [stack@control2 vagrant]$ sudo ovs-vsctl list Open_vSwitch
Dump after reboot
[stack@control2 vagrant]$ sudo ovs-vsctl list Open_vSwitch
|
| Comment by Ritu Sood [ 16/Dec/17 ] |
|
That's good. One other thing to verify is whether after reboot the ovsdb manager is set before the ovsdb tables are updated with the hostconfig information. If the sequence is reversed that can cause ODL to miss hostconfig update. |
| Comment by Isaku Yamahata [ 20/Dec/17 ] |
|
ovsdb table looks good. What about ovsdb entry in topology in MD-SAL? ODL ovsdb handles reboots properly?
|
| Comment by Victor Pickard [ 20/Dec/17 ] |
|
The OVSDB node is not in the topology in MD-SAL after reboot.
Here is the compute node info. I have attached 2 files, one with topology before reboot, one with after reboot. You will see that this node is not there after node reboot.
[heat-admin@overcloud-novacompute-0 ~]$ sudo ovs-vsctl show |
| Comment by Isaku Yamahata [ 20/Dec/17 ] |
|
I'm asking the value in MD-SAL ovsdb model. not ovsdb table in ovs.
|
| Comment by Victor Pickard [ 03/Jan/18 ] |
|
Isaku, Look at my comment earlier, from 20/Dec/17. There are 2 links, in that comment, which are the MD-SAL dumps. You can see that the OVSDB node was not in the topology after reboot. So, this appears to be an OVSDB bug, I'll change the bug. |
| Comment by Victor Pickard [ 22/Aug/18 ] |
|
I'm still seeing this with queens/oxygen.
Some karaf logs that may be of interest.... Will keep digging. {no format}2018-08-22T19:39:23,469 | ERROR | OVSDBConnNotifSer-9 | OvsdbConnectionManager | 393 - org.opendaylight.ovsdb.southbound-impl - 1.6.4.SNAPSHOT | Old connection still hanging for Entity{type='ovsdb', id=/(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/922f7894-5e8f-4b34-8284-8029f3934312}]} 2018-08-22T19:39:23,469 | INFO | OVSDBConnNotifSer-9 | OvsdbConnectionManager | 393 - org.opendaylight.ovsdb.southbound-impl - 1.6.4.SNAPSHOT | Library disconnected PASSIVE from /10.8.125.231:49846 to /10.8.125.230:6640. Cleaning up the operational data store 2018-08-22T19:39:23,469 | INFO | OVSDBConnNotifSer-9 | OvsdbConnectionManager | 393 - org.opendaylight.ovsdb.southbound-impl - 1.6.4.SNAPSHOT | Library disconnected ConnectionInfo [_localIp=IpAddress [_ipv4Address=Ipv4Address [_value=10.8.125.230]], _localPort=PortNumber [_value=6640], _remoteIp=IpAddress [_ipv4Address=Ipv4Address [_value=10.8.125.231]], _remotePort=PortNumber [_value=49846], augmentation=[]] this controller does not have ownership 2018-08-22T19:42:09,781 | ERROR | nioEventLoopGroup-5-5 | ExceptionHandler | 389 - org.opendaylight.ovsdb.library - 1.6.4.SNAPSHOT | Exception occurred while processing connection pipeline io.netty.handler.timeout.ReadTimeoutException: null 2018-08-22T19:42:09,784 | INFO | nioEventLoopGroup-5-5 | ExceptionHandler | 389 - org.opendaylight.ovsdb.library - 1.6.4.SNAPSHOT | Closing channel to ovsdb [id: 0x51eb2fe8, L:/10.8.125.230:6640 - R:/10.8.125.231:53276] 2018-08-22T19:42:09,784 | INFO | nioEventLoopGroup-5-5 | OvsdbConnectionService | 389 - org.opendaylight.ovsdb.library - 1.6.4.SNAPSHOT | Connection closed ConnectionInfo [Remote-address=10.8.125.231, Remote-port=53276, Local-address10.8.125.230, Local-port=6640, type=PASSIVE] 2018-08-22T19:42:09,785 | INFO | nioEventLoopGroup-5-5 | HwvtepConnectionManager | 388 - org.opendaylight.ovsdb.hwvtepsouthbound-impl - 1.6.4.SNAPSHOT | Library disconnected PASSIVE from /10.8.125.231:53276 to /10.8.125.230:6640. Cleaning up the operational data store 2018-08-22T19:42:09,785 | WARN | nioEventLoopGroup-5-5 | HwvtepConnectionManager | 388 - org.opendaylight.ovsdb.hwvtepsouthbound-impl - 1.6.4.SNAPSHOT | HWVTEP disconnected event did not find connection instance for ConnectionInfo [_localIp=IpAddress [_ipv4Address=Ipv4Address [_value=10.8.125.230]], _localPort=PortNumber [_value=6640], _remoteIp=IpAddress [_ipv4Address=Ipv4Address [_value=10.8.125.231]], _remotePort=PortNumber [_value=53276], augmentation=[]] 2018-08-22T19:42:09,786 | INFO | nioEventLoopGroup-5-5 | OvsdbConnectionManager | 393 - org.opendaylight.ovsdb.southbound-impl - 1.6.4.SNAPSHOT | Library disconnected PASSIVE from /10.8.125.231:53276 to /10.8.125.230:6640. Cleaning up the operational data store 2018-08-22T19:42:09,786 | INFO | nioEventLoopGroup-5-5 | OvsdbConnectionManager | 393 - org.opendaylight.ovsdb.southbound-impl - 1.6.4.SNAPSHOT | Library disconnected ConnectionInfo [_localIp=IpAddress [_ipv4Address=Ipv4Address [_value=10.8.125.230]], _localPort=PortNumber [_value=6640], _remoteIp=IpAddress [_ipv4Address=Ipv4Address [_value=10.8.125.231]], _remotePort=PortNumber [_value=53276], augmentation=[]] this controller instance has ownership 2018-08-22T19:42:09,787 | INFO | nioEventLoopGroup-5-5 | StalePassiveConnectionService | 389 - org.opendaylight.ovsdb.library - 1.6.4.SNAPSHOT | Client disconnected ConnectionInfo [Remote-address=10.8.125.231, Remote-port=53276, Local-address10.8.125.230, Local-port=6640, type=PASSIVE] 2018-08-22T19:42:09,789 | WARN | opendaylight-cluster-data-notification-dispatcher-729 | OvsdbNodeListener | 255 - org.opendaylight.genius.itm-impl - 0.4.4.SNAPSHOT | processBridgeUpdate: bridge br-int removal case when Switch is disconnected.Hence, Ovsdb Node could not be fetched from Oper DS. 2018-08-22T19:42:09,790 | WARN | opendaylight-cluster-data-notification-dispatcher-729 | OvsdbNodeListener | 255 - org.opendaylight.genius.itm-impl - 0.4.4.SNAPSHOT | processBridgeUpdate: bridge br-ex removal case when Switch is disconnected.Hence, Ovsdb Node could not be fetched from Oper DS. 2018-08-22T19:42:09,798 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-4 | OvsdbConnectionManager | 393 - org.opendaylight.ovsdb.southbound-impl - 1.6.4.SNAPSHOT | Entity{type='ovsdb', id=/(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)network-topology/topology/topology[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)topology-id=ovsdb:1}]/node/node[{(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/922f7894-5e8f-4b34-8284-8029f3934312}]} has no owner, cleaning up the operational data store {no format} |
| Comment by Victor Pickard [ 22/Aug/18 ] |
|
Here is the hostconfig before rebooting compute node compute1 (sudo shutdown -r now). Notice, that compute1 is missing from the hostconfig in the 2nd query (it is there initially, but takes some time before it is removed, like a minute or so....) The workaround, is to do a del-manager, followed by a set-manager, on the compute node (until this bug is resolved). [Vic@rippleRider$]$ curl -s -u admin:admin -X GET http://${CIP}:8181/restconf/operational/neutron:neutron/hostconfigs/ | python -m json.tool [Vic@rippleRider$]$ curl -s -u admin:admin -X GET http://${CIP}:8181/restconf/operational/neutron:neutron/hostconfigs/ | python -m json.tool |
| Comment by Victor Pickard [ 23/Aug/18 ] |
|
Here is my analysis of this issue: 1. switch/compute node is rebooted Proposed Fix New connection attempt from switch just after reboot Testing of old connection – fails in 1 sec, clients are notified. 2018-08-22T22:57:22,428 | INFO | OVSDB-Lib-Future-Reaper-1 | StalePassiveConnectionService | 389 - org.opendaylight.ovsdb.library - 1.6.4.SNAPSHOT | Sending notification for client ConnectionInfo [Remote-address=10.8.125.231, Remote-port=34176, Local-address10.8.125.230, Local-port=6640, type=PASSIVE] Switch connects to controller after reboot, new connection This will register the entity for ownership, registerEntityForOwnership(), which fails... next logs Old connection still there, so new connection is rejected, and ovsdb closes the new connection.... ]/node/node[ {(urn:TBD:params:xml:ns:yang:network-topology?revision=2013-10-21)node-id=ovsdb://uuid/922f7894-5e8f-4b34-8284-8029f3934312}]} Read timeout exception (from echo/ping failure of old connection) Fails, closes connection, which ends up removing from hostconfig 2018-08-22T23:00:09,548 | ERROR | nioEventLoopGroup-5-7 | ExceptionHandler | 389 - org.opendaylight.ovsdb.library - 1.6.4.SNAPSHOT | Exception occurred while processing connection pipeline Connection is really there between Controller and OVS switch [stack@control log]$ netstat -punta|grep 6640 Last login: Thu Aug 23 00:45:32 2018 from 192.168.121.1 |
| Comment by Victor Pickard [ 23/Aug/18 ] |
| Comment by Victor Pickard [ 23/Aug/18 ] |
|
For reference, looks like similiar issue was reported in https://jira.opendaylight.org/browse/OVSDB-439 Corresponding patch: |