[NETVIRT-1312] Upgrade: Missing Remote FIB Entry After new ODL Boots Created: 14/Jun/18 Updated: 05/Jul/18 Resolved: 05/Jul/18 |
|
| Status: | Resolved |
| Project: | netvirt |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Medium |
| Reporter: | Josh Hershberg | Assignee: | Josh Hershberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | netvirt | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
This was discovered in the course of developing the upgrade CSIT. After the "new" ODL is booted various pings are attempted, one of them from the DHCP namespace on the controller node to a VM on one of the compute nodes. The ping fails. An interesting fact that requires explanation is that if a sleep is introduced between the neutron full-sync and the set-manager calls this does not occur. However, if the sleep is not present the error will occur and seems to not fix itself even if tried multiple times over 90 seconds. This bug seems to occur due to the following exception:
2018-06-13T11:08:46,672 | DEBUG | jobcoordinator-main-task-5 | VrfEntryListener | 362 - org.opendaylight.netvirt.fibmanager-impl - 0.7.0.SNAPSHOT | createremotefibentry: adding route 91.0.0.8/32 for rd ae388818-b19f-4164-8d3f-366a196116a0 on remoteDpnId 44711998493085
2018-06-13T11:08:46,672 | DEBUG | jobcoordinator-main-task-5 | BaseVrfEntryHandler | 362 - org.opendaylight.netvirt.fibmanager-impl - 0.7.0.SNAPSHOT | NextHop IP for destination 91.0.0.8/32 is 10.30.170.181
2018-06-13T11:08:46,673 | DEBUG | jobcoordinator-main-task-5 | NexthopManager | 362 - org.opendaylight.netvirt.fibmanager-impl - 0.7.0.SNAPSHOT | Trying to fetch tunnel interface name for source dpn 44711998493085 destIp [1, 0, ., 3, 0, ., 1, 7, 0, ., 1, 8, 1] tunType org.opendaylight.yang.gen.v1.urn.opendaylight.genius.interfacemanager.rev160406.TunnelTypeVxlan
2018-06-13T11:08:46,674 | DEBUG | jobcoordinator-main-task-5 | BaseVrfEntryHandler | 362 - org.opendaylight.netvirt.fibmanager-impl - 0.7.0.SNAPSHOT | adding set tunnel id action for label 100005
2018-06-13T11:08:46,674 | DEBUG | jobcoordinator-main-task-5 | FibUtil | 362 - org.opendaylight.netvirt.fibmanager-impl - 0.7.0.SNAPSHOT | getAdjacencyIdentifierOp vpninterface fec3aa4e-08da-4f0c-806e-b9126e7b39a5 vpn ae388818-b19f-4164-8d3f-366a196116a0 ip 91.0.0.8/32
2018-06-13T11:08:46,678 | DEBUG | org.opendaylight.yang.gen.v1.urn.ietf.params.xml.ns.yang.ietf.interfaces.rev140508.interfaces.state.Interface_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NatInterfaceStateChangeListener | 369 - org.opendaylight.netvirt.natservice-impl - 0.7.0.SNAPSHOT | add : Interface tun052d1681e4f is not Vlan Interface.Ignoring
2018-06-13T11:08:46,678 | DEBUG | opendaylight-cluster-data-notification-dispatcher-47 | FlowBasedServicesInterfaceStateListener | 265 - org.opendaylight.genius.interfacemanager-impl - 0.5.0.SNAPSHOT | Received interface state add event for tun052d1681e4f
2018-06-13T11:08:46,677 | DEBUG | opendaylight-cluster-data-notification-dispatcher-76 | InterfaceStateListener | 268 - org.opendaylight.genius.itm-impl - 0.5.0.SNAPSHOT | Interface of type Tunnel added: tun052d1681e4f
2018-06-13T11:08:46,677 | DEBUG | opendaylight-cluster-data-notification-dispatcher-62 | FlowBasedServicesInterfaceStateListener | 265 - org.opendaylight.genius.interfacemanager-impl - 0.5.0.SNAPSHOT | Received interface state add event for tun052d1681e4f
2018-06-13T11:08:46,678 | DEBUG | org.opendaylight.yang.gen.v1.urn.ietf.params.xml.ns.yang.ietf.interfaces.rev140508.interfaces.state.Interface_AsyncClusteredDataTreeChangeListenerBase-DataTreeChangeHandler-0 | InterfaceStateListener | 265 - org.opendaylight.genius.interfacemanager-impl - 0.5.0.SNAPSHOT | Received Tunnel state add event for tun052d1681e4f
2018-06-13T11:08:46,675 | DEBUG | jobcoordinator-main-task-5 | InterfaceManagerRpcService | 265 - org.opendaylight.genius.interfacemanager-impl - 0.5.0.SNAPSHOT | RPC getEgressActionsForInterface() failed; input = GetEgressActionsForInterfaceInput{_intfName=tun052d1681e4f, augmentation=[]}
java.lang.NullPointerException: Interface information not present in oper DS for tun052d1681e4f
at org.opendaylight.genius.interfacemanager.IfmUtil.getEgressActionInfosForInterface(IfmUtil.java:279) ~[?:?]
at org.opendaylight.genius.interfacemanager.IfmUtil.getEgressActionsForInterface(IfmUtil.java:220) ~[?:?]
at org.opendaylight.genius.interfacemanager.rpcservice.InterfaceManagerServiceImpl.getEgressActionsForInterface(InterfaceManagerServiceImpl.java:193) ~[?:?]
at org.opendaylight.genius.interfacemanager.rpcservice.InterfaceManagerRpcService.lambda$getEgressActionsForInterface$8(InterfaceManagerRpcService.java:105) ~[?:?]
at org.opendaylight.genius.tools.mdsal.rpc.FutureRpcResults$FutureRpcResultBuilder.build(FutureRpcResults.java:213) ~[?:?]
at org.opendaylight.genius.interfacemanager.rpcservice.InterfaceManagerRpcService.getEgressActionsForInterface(InterfaceManagerRpcService.java:106) ~[?:?]
at org.opendaylight.yangtools.yang.binding.util.RpcMethodInvokerWithInput.invokeOn(RpcMethodInvokerWithInput.java:32) ~[?:?]
at org.opendaylight.yangtools.yang.binding.util.AbstractMappedRpcInvoker.invokeRpc(AbstractMappedRpcInvoker.java:56) ~[?:?]
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invoke(BindingDOMRpcImplementationAdapter.java:85) ~[?:?]
at org.opendaylight.controller.md.sal.binding.impl.BindingDOMRpcImplementationAdapter.invokeRpc(BindingDOMRpcImplementationAdapter.java:72) ~[?:?]
at org.opendaylight.controller.md.sal.dom.broker.impl.GlobalDOMRpcRoutingTableEntry.invokeRpc(GlobalDOMRpcRoutingTableEntry.java:41) ~[?:?]
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRoutingTable.invokeRpc(DOMRpcRoutingTable.java:178) ~[?:?]
at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRouter.invokeRpc(DOMRpcRouter.java:102) ~[?:?]
at Proxyeea38403_30f0_4d03_ad92_aec4c9bb32e0.invokeRpc(Unknown Source) ~[?:?]
at Proxy9e441361_e307_45d2_a189_b25dc74407e3.invokeRpc(Unknown Source) ~[?:?]
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.invoke0(RpcServiceAdapter.java:68) ~[?:?]
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.access$000(RpcServiceAdapter.java:46) ~[?:?]
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter$RpcInvocationStrategy.invoke(RpcServiceAdapter.java:165) ~[?:?]
at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.invoke(RpcServiceAdapter.java:99) ~[?:?]
at com.sun.proxy.$Proxy183.getEgressActionsForInterface(Unknown Source) ~[?:?]
at org.opendaylight.netvirt.fibmanager.NexthopManager.getEgressActionsForInterface(NexthopManager.java:267) ~[?:?]
at org.opendaylight.netvirt.fibmanager.BaseVrfEntryHandler.programRemoteFib(BaseVrfEntryHandler.java:409) ~[?:?]
at org.opendaylight.netvirt.fibmanager.VrfEntryListener.createRemoteFibEntry(VrfEntryListener.java:1176) ~[?:?]
at org.opendaylight.netvirt.fibmanager.VrfEntryListener.lambda$null$5(VrfEntryListener.java:405) ~[?:?]
at org.opendaylight.genius.infra.ManagedNewTransactionRunnerImpl.callWithNewWriteOnlyTransactionAndSubmit(ManagedNewTransactionRunnerImpl.java:45) ~[?:?]
at org.opendaylight.netvirt.fibmanager.VrfEntryListener.lambda$createFibEntries$6(VrfEntryListener.java:394) ~[?:?]
at org.opendaylight.infrautils.jobcoordinator.internal.JobCoordinatorImpl$MainTask.runWithUncheckedExceptionLogging(JobCoordinatorImpl.java:404) [291:org.opendaylight.infrautils.jobcoordinator-impl:1.4.0.SNAPSHOT]
at org.opendaylight.infrautils.utils.concurrent.LoggingUncaughtThreadDeathContextRunnable.run(LoggingUncaughtThreadDeathContextRunnable.java:60) [296:org.opendaylight.infrautils.util:1.4.0.SNAPSHOT]
at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402) [?:?]
2018-06-13T11:08:46,678 | INFO | ResourceBatchingManager-0 | ResourceBatchingManager | 271 - org.opendaylight.genius.mdsalutil-api - 0.5.0.SNAPSHOT | Total taken ##time = 4ms for resourceList of size 6 for resourceType INTERFACEMGR-DEFAULT-OPERATIONAL
2018-06-13T11:08:46,692 | ERROR | jobcoordinator-main-task-5 | NexthopManager | 362 - org.opendaylight.netvirt.fibmanager-impl - 0.7.0.SNAPSHOT | RPC Call to Get egress vm actions for interface tun052d1681e4f returned with Errors [RpcError [message=Interface information not present in oper DS for tun052d1681e4f, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=java.lang.NullPointerException: Interface information not present in oper DS for tun052d1681e4f]]
2018-06-13T11:08:46,693 | ERROR | jobcoordinator-main-task-5 | BaseVrfEntryHandler | 362 - org.opendaylight.netvirt.fibmanager-impl - 0.7.0.SNAPSHOT | Failed to retrieve egress action for prefix 91.0.0.8/32 route-paths [RoutePaths\{getLabel=100005, getNexthopAddress=10.30.170.181, augmentations={}}] interface tun052d1681e4f. Aborting remote FIB entry creation.
2018-06-13T11:08:46,693 | DEBUG | jobcoordinator-main-task-5 | VrfEntryListener | 362 - org.opendaylight.netvirt.fibmanager-impl - 0.7.0.SNAPSHOT | Successfully added FIB entry for prefix 91.0.0.8/32 in vpnId 100000
(Note: the ironic last line
|
| Comments |
| Comment by Josh Hershberg [ 14/Jun/18 ] |
|
The issue is that normally the tunnel ports are created WAY before the tap ports so by the time route flows are being configured the operational interface state can be assumed to exist. However, in the upgrade case the taps and tuns appear at about the same time. From the model dumps in the "log_full" attachment it is clear that the interface state for the tun does eventually show up so It's clearly a timing issue. It is as of yet unclear why a sleep before set-manager would prevent this from happening...but this is clearly what is happening and it's clearly an issue. |
| Comment by daya kamath [ 15/Jun/18 ] |
|
josh, i remember you had fixed a similar race in another part of the code for a similar issue, using the watcher. do you have a review for that? kiran could evaluate the same and apply it for the specific exception? delaying openflow until tunnels are built could be non-productive in a scaled setup. however, there is a new feature added in oxygen (off by default) where we have improved on the time it takes to build all tunnels out, by decoupling tunnels from interface manager. it could make this issue go away, however this is not a good fix for this issue, i think. we should address it in code, with a delayed listener which implements the necessary actions when the tunnel interface is up. |
| Comment by Josh Hershberg [ 19/Jun/18 ] |
|
Ah, I'm already working on a patch. |
| Comment by Josh Hershberg [ 20/Jun/18 ] |