[OPNFLWPLUG-1028] Table Miss Entry failed to program in 3 node netvirt CSIT Created: 17/Jul/18 Updated: 20/Aug/18 Resolved: 20/Aug/18 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | clustering |
| Affects Version/s: | None |
| Fix Version/s: | Oxygen, Fluorine |
| Type: | Bug | Priority: | Highest |
| Reporter: | Faseela K | Assignee: | Somashekhar Javalagi |
| Resolution: | Done | Votes: | 0 |
| Labels: | csit, csit:3node | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Epic Link: | Clustering Stability | ||||||||
| Description |
|
L2 suites in netvirt are failing randomly in 3 node, and in one such failed instance I noticed that Table Miss entry for Table 43 is not programmed on switch. I can see the flow being present in config/opendaylight-inventory, but flow is missing on the switch. Problematic node id - openflow:80006780316000 Model Dumps can be found at the Tear Down Dumps of the first failing TC "Check If VMs got IP Address". Let me know if any more information is needed. |
| Comments |
| Comment by Jamo Luhrsen [ 17/Jul/18 ] |
|
k.faseela, could you try to find a log line in karaf.log (I guess it needs to be on the leader) for when |
| Comment by Faseela K [ 17/Jul/18 ] |
|
jluhrsen : This issue came only once out of 20 runs. With INFO in karaf logs, there is nothing to track to see when exactly ELAN pushed this flow. But I see that ELAN has pushed the flow by checking the same in the datastore dump. |
| Comment by Faseela K [ 23/Jul/18 ] |
|
Avishnoi : Any one from ofp who can take this up? This issue is showing up in several of our 3 node CSIT runs. |
| Comment by Faseela K [ 24/Jul/18 ] |
|
Another instance, where table miss flow in Table 50 was present initially, and all of a suddent went off :
|
| Comment by Arunprakash D [ 25/Jul/18 ] |
|
Could you please enable the following package logging and provide the logs? log:set DEBUG org.opendaylight.openflowplugin.impl |
| Comment by Faseela K [ 26/Jul/18 ] |
|
Arunprakash : Here are the logs with DEBUG enabled. Table 18 entry is missing in this case. It took several runs to reproduce the issue when DEBUG was enabled :
|
| Comment by Jamo Luhrsen [ 26/Jul/18 ] |
|
k.faseela do we want to assign this to Arunprakash? thanks for looking at this guys... |
| Comment by Faseela K [ 26/Jul/18 ] |
|
Arunprakash will analyze the logs and let us know whom to assign the bug to. |
| Comment by Faseela K [ 26/Jul/18 ] |
|
d.arunprakash@ericsson.com : Taking at the logs, i can see the below log in https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/260/faseela-l2-netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-oxygen/18/odl_1/odl1_karaf.log.gz 10 mins before the test suite starts : 2018-07-25T23:00:41,688 | INFO | pool-241-thread-1 | WeightedCentralizedSwitchScheduler | 361 - org.opendaylight.netvirt.natservice-impl - 0.6.3.SNAPSHOT | addSwitch: Adding 167740445697660 dpnId to switchWeightsMap 2018-07-25T23:00:42,274 | DEBUG | epollEventLoopGroup-9-1 | SalFlowServiceImpl | 385 - org.opendaylight.openflowplugin.impl - 0.6.3.SNAPSHOT | Flow add with id=DHCPTableMissFlowForExternalTunnel finished without error But the flow is not present on the switch when the suite starts. We have 3 OVS in the test, but i see only this flow-id log for 2 switches. Anyways the compute in question, is 167740445697660, so let us assume that the flow was getting pushed for the same, but doesn't show up on switch. But there are different failures showing up in the logs including the below one. Let me know if this could impact:
2018-07-25T23:01:09,328 | ERROR | opendaylight-cluster-data-notification-dispatcher-97 | DefaultConfigPusher | 378 - org.opendaylight.openflowplugin.applications.of-switch-config-pusher - 0.6.3.SNAPSHOT | Future (eventually) failed: addFlow org.opendaylight.controller.md.sal.dom.api.DOMRpcImplementationNotAvailableException: No implementation of RPC AbsoluteSchemaPath{path=[(urn:opendaylight:module:config?revision=2014-10-15)set-config]} available at org.opendaylight.controller.md.sal.dom.broker.impl.RoutedDOMRpcRoutingTableEntry.invokeRpc(RoutedDOMRpcRoutingTableEntry.java:85) [229:org.opendaylight.controller.sal-broker-impl:1.7.3.SNAPSHOT] at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRoutingTable.invokeRpc(DOMRpcRoutingTable.java:178) [229:org.opendaylight.controller.sal-broker-impl:1.7.3.SNAPSHOT] at org.opendaylight.controller.md.sal.dom.broker.impl.DOMRpcRouter.invokeRpc(DOMRpcRouter.java:102) [229:org.opendaylight.controller.sal-broker-impl:1.7.3.SNAPSHOT] at Proxybb8ecbf1_eb9d_4480_a18e_4a2007f00a3c.invokeRpc(Unknown Source) [?:?] at Proxy40f45215_a726_4be9_ac6b_649862163662.invokeRpc(Unknown Source) [?:?] at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.invoke0(RpcServiceAdapter.java:68) [226:org.opendaylight.controller.sal-binding-broker-impl:1.7.3.SNAPSHOT] at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.access$000(RpcServiceAdapter.java:46) [226:org.opendaylight.controller.sal-binding-broker-impl:1.7.3.SNAPSHOT] at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter$RpcInvocationStrategy.invoke(RpcServiceAdapter.java:165) [226:org.opendaylight.controller.sal-binding-broker-impl:1.7.3.SNAPSHOT] at org.opendaylight.controller.md.sal.binding.impl.RpcServiceAdapter.invoke(RpcServiceAdapter.java:99) [226:org.opendaylight.controller.sal-binding-broker-impl:1.7.3.SNAPSHOT] at com.sun.proxy.$Proxy113.setConfig(Unknown Source) [388:org.opendaylight.openflowplugin.model.flow-service:0.6.3.SNAPSHOT] at org.opendaylight.openflowplugin.openflow.ofswitch.config.DefaultConfigPusher.onDataTreeChanged(DefaultConfigPusher.java:84) [378:org.opendaylight.openflowplugin.applications.of-switch-config-pusher:0.6.3.SNAPSHOT] at org.opendaylight.controller.md.sal.binding.impl.BindingDOMDataTreeChangeListenerAdapter.onDataTreeChanged(BindingDOMDataTreeChangeListenerAdapter.java:41) [226:org.opendaylight.controller.sal-binding-broker-impl:1.7.3.SNAPSHOT] at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.dataChanged(DataTreeChangeListenerActor.java:67) [239:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT] at org.opendaylight.controller.cluster.datastore.DataTreeChangeListenerActor.handleReceive(DataTreeChangeListenerActor.java:41) [239:org.opendaylight.controller.sal-distributed-datastore:1.7.3.SNAPSHOT] at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(AbstractUntypedActor.java:38) [232:org.opendaylight.controller.sal-clustering-commons:1.7.3.SNAPSHOT] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) [42:com.typesafe.akka.actor:2.5.11] at akka.actor.Actor.aroundReceive(Actor.scala:517) [42:com.typesafe.akka.actor:2.5.11] at akka.actor.Actor.aroundReceive$(Actor.scala:515) [42:com.typesafe.akka.actor:2.5.11] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97) [42:com.typesafe.akka.actor:2.5.11] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:590) [42:com.typesafe.akka.actor:2.5.11] at akka.actor.ActorCell.invoke(ActorCell.scala:559) [42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [42:com.typesafe.akka.actor:2.5.11] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [42:com.typesafe.akka.actor:2.5.11] |
| Comment by Faseela K [ 26/Jul/18 ] |
|
2018-07-25T23:01:00,416 | INFO | pool-93-thread-1 | SalRoleServiceImpl | 385 - org.opendaylight.openflowplugin.impl - 0.6.3.SNAPSHOT | SetRole called with input:SetRoleInput [_controllerRole=BECOMESLAVE, _node=NodeRef [_value=KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.Nodes, org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node[key=NodeKey [_id=Uri [_value=openflow:167740445697660]]]]}], augmentation=[]] 2018-07-25T23:01:00,417 | INFO | pool-93-thread-1 | SalRoleServiceImpl | 385 - org.opendaylight.openflowplugin.impl - 0.6.3.SNAPSHOT | Requesting state change to BECOMESLAVE 2018-07-25T23:01:00,417 | INFO | pool-93-thread-1 | SalRoleServiceImpl | 385 - org.opendaylight.openflowplugin.impl - 0.6.3.SNAPSHOT | RoleChangeTask called on device:openflow:167740445697660 OFPRole:BECOMESLAVE 2018-07-25T23:01:00,418 | INFO | pool-93-thread-1 | RoleService | 385 - org.opendaylight.openflowplugin.impl - 0.6.3.SNAPSHOT | getGenerationIdFromDevice called for device: openflow:167740445697660 2018-07-25T23:01:00,427 | WARN | pool-93-thread-1 | OutboundQueueEntry | 394 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.3.SNAPSHOT | Can't commit a completed message. 2018-07-25T23:01:00,428 | WARN | pool-93-thread-1 | HashedWheelTimer | 62 - io.netty.common - 4.1.22.Final | An exception was thrown by TimerTask. java.lang.NullPointerException: null at org.opendaylight.openflowplugin.impl.services.AbstractRequestCallback.onFailure(AbstractRequestCallback.java:72) ~[?:?] at org.opendaylight.openflowjava.protocol.impl.core.connection.OutboundQueueEntry.commit(OutboundQueueEntry.java:57) ~[394:org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl:0.6.3.SNAPSHOT] at org.opendaylight.openflowjava.protocol.impl.core.connection.StackedOutboundQueue.commitEntry(StackedOutboundQueue.java:36) ~[394:org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl:0.6.3.SNAPSHOT] at org.opendaylight.openflowjava.protocol.impl.core.connection.AbstractStackedOutboundQueue.commitEntry(AbstractStackedOutboundQueue.java:61) ~[394:org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl:0.6.3.SNAPSHOT] at org.opendaylight.openflowplugin.impl.connection.OutboundQueueProviderImpl.commitEntry(OutboundQueueProviderImpl.java:81) ~[?:?] at org.opendaylight.openflowplugin.impl.services.AbstractService.handleServiceCall(AbstractService.java:165) ~[?:?] at org.opendaylight.openflowplugin.impl.services.AbstractService.handleServiceCall(AbstractService.java:110) ~[?:?] at org.opendaylight.openflowplugin.impl.services.RoleService.getGenerationIdFromDevice(RoleService.java:69) ~[?:?] at org.opendaylight.openflowplugin.impl.services.sal.SalRoleServiceImpl.tryToChangeRole(SalRoleServiceImpl.java:84) ~[?:?] at org.opendaylight.openflowplugin.impl.services.sal.SalRoleServiceImpl.setRole(SalRoleServiceImpl.java:78) ~[?:?] at org.opendaylight.openflowplugin.impl.role.RoleContextImpl.sendRoleChangeToDevice(RoleContextImpl.java:165) ~[?:?] at org.opendaylight.openflowplugin.impl.role.RoleContextImpl.makeDeviceSlave(RoleContextImpl.java:144) ~[?:?] at org.opendaylight.openflowplugin.impl.role.RoleContextImpl.lambda$new$0(RoleContextImpl.java:70) ~[?:?] at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:663) [62:io.netty.common:4.1.22.Final] at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:738) [62:io.netty.common:4.1.22.Final] at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:466) [62:io.netty.common:4.1.22.Final] at java.lang.Thread.run(Thread.java:748) [?:?] 2018-07-25T23:01:08,916 | INFO | nioEventLoopGroup-4-1 | LoggingHandler | 62 - io.netty.common - 4.1.22.Final | [id: 0x8b315f38, L:/0:0:0:0:0:0:0:0:6640] READ: [id: 0x0aaeab0b, L:/10.30.170.60:6640 - R:/10.30.170.84:60506] |
| Comment by Faseela K [ 31/Jul/18 ] |
|
Latest log run, with wireshark capture and inventory dumps:
|
| Comment by Faseela K [ 31/Jul/18 ] |
|
Somashekhar : Here are the logs with DEBUGs enabled :
|
| Comment by Somashekhar Javalagi [ 01/Aug/18 ] |
|
k.faseela, I have added few logs for debugging, can you please run the csit with this. |
| Comment by Faseela K [ 01/Aug/18 ] |
|
Here are the logs Somashekhar :
|
| Comment by Arunprakash D [ 03/Aug/18 ] |
|
k.faseela, can you run the csit with FRM debug logs enabled and with Som's review?
Package: org.opendaylight.openflowplugin.applications.frm
https://git.opendaylight.org/gerrit/#/c/74715/
|
| Comment by Faseela K [ 03/Aug/18 ] |
|
Arunprakash, Somashekhar : Running this right now.. Will update once we get a failure log.
jluhrsen : Do you think we should mark this as a blocker for Fluorine? This issue comes only in 3 node CSIT though. |
| Comment by Jamo Luhrsen [ 03/Aug/18 ] |
|
no, I don't think we are close enough with cluster stability to start marking CSIT failures in 3node jobs as blockers. especially sporadic ones. But, I'm crossing my fingers that we will be able to do this in Neon |
| Comment by Sam Hague [ 06/Aug/18 ] |
|
Found an issue in a single node job where table 45 did not have the default flow: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/130/ |
| Comment by Somashekhar Javalagi [ 06/Aug/18 ] |
|
Openflowplugin-impl is deciding the mastership of the device and communicating same to the forwardingrules-manager through yang notification. But this yang notification is taking more time(around 15 to 25 msecond) to reach FRM, after which mastership will be set to the device. By the time yang notification reaches FRM, there will be flows start coming for the reconciliation. As the device is not yet mastered, these flows get rejected.
There is no direct way to fix this, as the fix may involve lot of design changes. |
| Comment by Sam Hague [ 06/Aug/18 ] |
|
Are there any current workarounds? Understood that this is difficult but this is looking like a very bad issue. Seems like you can hit this at any time. Restarting isn't a good workaround though. |
| Comment by Arunprakash D [ 06/Aug/18 ] |
|
shague, we are working on a simplified fix which can be used for the current situation and move to a better solution later in neon. Somashekhar will raise a review tomorrow with the probable fix and k.faseela can run CSIT and confirm if it works. |
| Comment by Arunprakash D [ 07/Aug/18 ] |
|
k.faseela, could you please provide the latest CSIT run logs, we want to verify the logs for the sequence of events. |
| Comment by Arunprakash D [ 07/Aug/18 ] |
|
k.faseela, thanks for the link via email.
We are able to check the sequence of events from openflowplugin to FRM and the owner is being set now before any default flows being programmed by applications. Somashekhar, you can work on formalizing the review. |