[OPNFLWPLUG-1039] Not able to set SLAVE role on device openflow:97702827111235, reason: Was not able to propagate SLAVE role on device. Error: org.opendaylight.openflowplugin.impl.role.RoleChangeException: Can't commit a completed message. Created: 08/Oct/18 Updated: 07/Jan/19 Resolved: 07/Jan/19 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | openflowplugin-impl |
| Affects Version/s: | None |
| Fix Version/s: | Fluorine-SR2, Neon |
| Type: | Bug | Priority: | Medium |
| Reporter: | Sam Hague | Assignee: | Somashekhar Javalagi |
| Resolution: | Done | Votes: | 0 |
| Labels: | csit:3node, csit:exception, csit:failures | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Epic Link: | Clustering Stability | ||||||||
| Description |
|
The exception and errors below are repeated numerous times as ovs nodes are initially connected to ODL. This looks to lead to initial default flows not being programmed on the switches. 2018-10-07T21:49:21,087 | INFO | pool-83-thread-1 | SalRoleServiceImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | SetRole called with input:SetRoleInput [_controllerRole=BECOMESLAVE, _node=NodeRef [_value=KeyedInstanceIdentifier{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.Nodes, org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node[key=NodeKey [_id=Uri [_value=openflow:97702827111235]]]]}], augmentation=[]]
2018-10-07T21:49:21,088 | INFO | pool-83-thread-1 | SalRoleServiceImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | Requesting state change to BECOMESLAVE
2018-10-07T21:49:21,088 | INFO | pool-83-thread-1 | SalRoleServiceImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | RoleChangeTask called on device:openflow:97702827111235 OFPRole:BECOMESLAVE
2018-10-07T21:49:21,088 | INFO | pool-83-thread-1 | RoleService | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | getGenerationIdFromDevice called for device: openflow:97702827111235
2018-10-07T21:49:21,093 | WARN | pool-83-thread-1 | OutboundQueueEntry | 391 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.4 | Can't commit a completed message.
2018-10-07T21:49:21,096 | ERROR | pool-83-thread-1 | RoleService | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | getGenerationIdFromDevice RPC error null
2018-10-07T21:49:21,099 | WARN | pool-83-thread-1 | ContextChainHolderImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | Not able to set SLAVE role on device openflow:97702827111235, reason: Was not able to propagate SLAVE role on device. Error: org.opendaylight.openflowplugin.impl.role.RoleChangeException: Can't commit a completed message.
2018-10-07T21:49:21,100 | INFO | pool-83-thread-1 | ReconciliationManagerImpl | 376 - org.opendaylight.openflowplugin.applications.reconciliation-framework - 0.6.4 | Stopping reconciliation for node Uri [_value=openflow:97702827111235]
2018-10-07T21:49:21,105 | INFO | pool-83-thread-1 | ContextChainImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | Closed clustering services registration for node openflow:97702827111235
2018-10-07T21:49:21,106 | INFO | pool-83-thread-1 | GuardedContextImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | Terminating DeviceContextImpl[NEW] service for node openflow:97702827111235
2018-10-07T21:49:21,106 | INFO | pool-83-thread-1 | GuardedContextImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | Terminating RpcContextImpl[NEW] service for node openflow:97702827111235
2018-10-07T21:49:21,107 | INFO | pool-83-thread-1 | GuardedContextImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | Terminating StatisticsContextImpl[NEW] service for node openflow:97702827111235
2018-10-07T21:49:21,107 | INFO | pool-83-thread-1 | StatisticsContextImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | Stopping running statistics gathering for node openflow:97702827111235
2018-10-07T21:49:21,113 | INFO | pool-83-thread-1 | GuardedContextImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | Terminating RoleContextImpl[NEW] service for node openflow:97702827111235
2018-10-07T21:49:21,190 | INFO | epollEventLoopGroup-9-15 | AbstractConnectionAdapter | 391 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.4 | The channel outbound queue size:1024
2018-10-07T21:49:21,191 | INFO | epollEventLoopGroup-9-15 | ConnectionAdapterImpl | 391 - org.opendaylight.openflowplugin.openflowjava.openflow-protocol-impl - 0.6.4 | Hello received
2018-10-07T21:49:21,196 | INFO | epollEventLoopGroup-9-15 | ContextChainHolderImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | Device openflow:97702827111235 connected.
2018-10-07T21:49:21,197 | INFO | epollEventLoopGroup-9-15 | ContextChainHolderImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | No context chain found for device: openflow:97702827111235, creating new.
2018-10-07T21:49:21,197 | INFO | epollEventLoopGroup-9-15 | DeviceManagerImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | ConnectionEvent: Device connected to controller, Device:/10.30.170.126:57622, NodeId:Uri [_value=openflow:97702827111235]
2018-10-07T21:49:21,198 | INFO | epollEventLoopGroup-9-15 | RoleContextImpl | 382 - org.opendaylight.openflowplugin.impl - 0.6.4 | Started timer for setting SLAVE role on device openflow:97702827111235 if no role will be set in 20s.
|
| Comments |
| Comment by Jamo Luhrsen [ 12/Oct/18 ] |
|
looks like we have this here too: Avishnoi, as we continue to dig in to our cluster problems, I'm wondering if we have someone in OFP |
| Comment by Vishal Thapar [ 30/Oct/18 ] |
|
Looking at logs at https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-oxygen/95/ 1. control-1 is repeatedly connecting to 2 of the odls 2. Control-1 pcap on 6653 show ODL sending echo request/reply and after it goes 5 seconds without sending echo request, it sends a FIN. We see 6 such reconnects to each. 3. Same pattern repeates on compute-0 but to different controllers and about 56 reconnects each. 4. It looks like the ODLs that don't become leaders/masters of that device are disconnecting the devices. 5. We're seeing repeated role change messages in ODL logs in tune with disconnects. One of them is the cause, other is effect. No idea which is which. |
| Comment by Somashekhar Javalagi [ 31/Oct/18 ] |
|
In all the cases, I see that whenever the role request is sent to the switch, many of the times this error is coming. Because, though the role request is triggered from the plugin, it is not sending the openflow message for that through channel. Few of the times, I am able to see in wireshark traces that role request is successfully sent and this error is not there.
So we have to see why many of the role requests are not being sent by openflowplugin, though triggered. For this I have added debug logs in review : https://git.opendaylight.org/gerrit/#/c/77372/.] I have also triggered test-openflowplugin-cluster-netvirt csit on this review. shague, is this job going to reproduce the scenario? If not, can you please suggest which job to run on this openfllowplugin review? |
| Comment by Jamo Luhrsen [ 31/Oct/18 ] |
|
Somashekhar, yeah the patch test you ran has this "Was not able to propagate SLAVE" in it's karaf.log i think we can find a faster job maybe if that helps, but it's there. did your debugs work? |
| Comment by Somashekhar Javalagi [ 06/Nov/18 ] |
|
Apart from the first time, every time the slave request proposal to the switch is failing, which tries to get the generation id by sending a dummy role. The request may be failing because of not resetting the old entry in the outbound queue. Which is wrongly marked as completed task. I will add some changes to reset the outbound queue before the role request is sent and will run the job on same. |
| Comment by Sam Hague [ 09/Nov/18 ] |
|
That has been fixed now. I restarted the csit on the debug patch. There was an infra issue where the vm images were deleted and that has been fixed since. |
| Comment by Somashekhar Javalagi [ 28/Nov/18 ] |
|
Avishnoi, after adding switch idle handler registration at the end of the tcp channel initialization, I am not able to see this issue. The slave role is being set properly. shague, Can you please confirm if the issue is there with this change or not, by running csit job |
| Comment by Arunprakash D [ 05/Dec/18 ] |
|
jluhrsen / shague , Could you please confirm if the error has been removed with the review. |
| Comment by Sam Hague [ 06/Dec/18 ] |
|
The infra is down again so we can't stack the jobs. We need to fix that so that we can run the tests for this patch to verify the fix. I am fine with merging the patch if it does pass the ofp tests and looks to solve the problem. |
| Comment by Arunprakash D [ 19/Dec/18 ] |
|
Is the issue resolved? We wanted to ensure before merging the patch. |
| Comment by Jamo Luhrsen [ 19/Dec/18 ] |
|
Arunprakash, shague, so the issue of default flows not being there seems to still org.opendaylight.openflowplugin.impl - 0.6.5 | Not able to set SLAVE role on device openflow:102142247889978, reason: Was not able to propagate SLAVE role on device. Error: org.opendaylight.openflowplugin.impl.role.RoleChangeException: Can't commit a completed message. so, potentially the WARN message is not even related to the default flows missing. |
| Comment by Arunprakash D [ 20/Dec/18 ] |
|
jluhrsen we are debugging 2 issues here, one is the exception tracked with this jira. "Not able to set SLAVE role on device" For this, we have raised a patch "https://git.opendaylight.org/gerrit/#/c/78199/" and wanted to run CSIT for the patch and confirm if the error is no longer seen before merging the review. Another one is default flows missing, which is tracked by
|
| Comment by Jamo Luhrsen [ 20/Dec/18 ] |
Ok, it looks like you already used our patch test to run CSIT. This is the job you triggered I put a +1 on that patch because of this. Feel free to run patch-test again if
ok, let's let this 1039 be just for the WARN message and looks like that might Thanks Arunprakash |