[OPNFLWPLUG-920] Cluster member gets in bad state if 2 instances disconnect just after the 3rd is recovered Created: 06/Jul/17  Updated: 27/Sep/21  Resolved: 20/Sep/17

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Luis Gomez
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8805

 Description   

This happens in stable/carbon. To reproduce:

1) Connect single switch to cluster
sudo ovs-vsctl set-controller s1 "tcp:192.168.0.101:6633" "tcp:192.168.0.102:6633" "tcp:192.168.0.103:6633"

2) Wait until all connections get stablished, master and slave after ~20 secs.

3) Break connection Master to switch:
sudo iptables -A INPUT -s 192.168.0.103 -p tcp --sport 6633 -j DROP

4) Wait until new master is elected and then restore the connection:
sudo iptables -D INPUT -s 192.168.0.103 -p tcp --sport 6633 -j DROP

5) Now that connection is waiting for slave break new master and remaining connection:
sudo iptables -A INPUT -s 192.168.0.101 -p tcp --sport 6633 -j DROP
sudo iptables -A INPUT -s 192.168.0.102 -p tcp --sport 6633 -j DROP

6) Following is observed in the remaining node (102), after that only solution is to reboot the node:

2017-07-06 01:06:06,352 | INFO | ofppool-0 | LifecycleServiceImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Starting clustering services for node openflow:1
2017-07-06 01:06:06,352 | INFO | ofppool-0 | DeviceContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Starting device context cluster services for node openflow:1
2017-07-06 01:06:10,319 | INFO | entLoopGroup-7-2 | SystemNotificationsListenerImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | ConnectionEvent: Connection closed by device, Device:/192.168.0.24:49423, NodeId:openflow:1
2017-07-06 01:06:10,319 | INFO | entLoopGroup-7-2 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Device openflow:1 disconnected.
2017-07-06 01:06:10,319 | INFO | entLoopGroup-7-2 | LifecycleServiceImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Closing clustering services registration for node openflow:1
2017-07-06 01:06:10,323 | WARN | entLoopGroup-7-2 | OF13DeviceInitializer | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Failed to write node openflow:1 to DS because we failed to gather device info.
2017-07-06 01:06:10,323 | ERROR | entLoopGroup-7-2 | OutboundQueueProviderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | No queue present, failing request
2017-07-06 01:06:10,324 | WARN | entLoopGroup-7-2 | OF13DeviceInitializer | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Failed to retrieve static node OFPMPMETERFEATURES info: Outbound queue wasn't able to reserve XID.
2017-07-06 01:06:10,324 | ERROR | entLoopGroup-7-2 | OutboundQueueProviderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | No queue present, failing request
2017-07-06 01:06:10,324 | WARN | entLoopGroup-7-2 | OF13DeviceInitializer | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Failed to retrieve static node OFPMPGROUPFEATURES info: Outbound queue wasn't able to reserve XID.
2017-07-06 01:06:10,325 | ERROR | entLoopGroup-7-2 | OutboundQueueProviderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | No queue present, failing request
2017-07-06 01:06:10,325 | WARN | entLoopGroup-7-2 | OF13DeviceInitializer | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Failed to retrieve static node OFPMPPORTDESC info: Outbound queue wasn't able to reserve XID.
2017-07-06 01:06:10,325 | INFO | entLoopGroup-7-2 | OF13DeviceInitializer | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Static node openflow:1 successfully finished collecting
2017-07-06 01:06:10,325 | INFO | entLoopGroup-7-2 | StatisticsContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Stopping running statistics gathering for node openflow:1
2017-07-06 01:06:10,326 | INFO | ofppool-0 | SalRoleServiceImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | SetRole called with input:SetRoleInput [_controllerRole=BECOMEMASTER, _node=NodeRef [_value=KeyedInstanceIdentifier

{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.Nodes, org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node[key=NodeKey [_id=Uri [_value=openflow:1]]]]}

], augmentation=[]]
2017-07-06 01:06:10,326 | INFO | ofppool-0 | SalRoleServiceImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Device NodeRef [_value=KeyedInstanceIdentifier

{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.Nodes, org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node[key=NodeKey [_id=Uri [_value=openflow:1]]]]}

] has been disconnected
2017-07-06 01:06:10,326 | ERROR | ofppool-0 | SalRoleServiceImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | SetRoleService set Role BECOMEMASTER for Node: KeyedInstanceIdentifier

{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.Nodes, org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node[key=NodeKey [_id=Uri [_value=openflow:1]]]]}

fail . Reason java.lang.Exception: Device connection doesn't exist anymore. Primary connection status : RIP
2017-07-06 01:06:10,326 | WARN | ofppool-0 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Not able to set MASTER role on device openflow:1, reason: Was not able to set MASTER role on device
2017-07-06 01:06:10,332 | INFO | ult-dispatcher-5 | LifecycleServiceImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Closing clustering services for node openflow:1
2017-07-06 01:06:10,334 | INFO | ofppool-0 | StatisticsContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Starting statistics context cluster services for node openflow:1
2017-07-06 01:06:10,335 | WARN | ofppool-0 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Not able to set MASTER role on device openflow:1, reason: Initial gathering statistics unsuccessful.
2017-07-06 01:06:10,335 | INFO | ofppool-0 | RpcContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Starting rpc context cluster services for node openflow:1
2017-07-06 01:06:10,336 | INFO | lt-dispatcher-29 | ContextChainHolderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Removing device Uri [_value=openflow:1] from operational DS
2017-07-06 01:06:11,380 | ERROR | n-dispatcher-590 | OutboundQueueProviderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | No queue present, failing request
2017-07-06 01:06:11,381 | WARN | n-dispatcher-590 | RpcContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Xid cannot be reserved for new RequestContext, node:openflow:1
2017-07-06 01:06:11,380 | ERROR | n-dispatcher-558 | OutboundQueueProviderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | No queue present, failing request
2017-07-06 01:06:11,381 | WARN | n-dispatcher-558 | RpcContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Xid cannot be reserved for new RequestContext, node:openflow:1
2017-07-06 01:06:11,382 | ERROR | n-dispatcher-589 | OutboundQueueProviderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | No queue present, failing request
2017-07-06 01:06:11,384 | WARN | n-dispatcher-589 | RpcContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Xid cannot be reserved for new RequestContext, node:openflow:1
2017-07-06 01:06:11,384 | ERROR | n-dispatcher-589 | OutboundQueueProviderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | No queue present, failing request
2017-07-06 01:06:11,384 | WARN | n-dispatcher-589 | RpcContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Xid cannot be reserved for new RequestContext, node:openflow:1
2017-07-06 01:06:12,377 | ERROR | ult-dispatcher-3 | OutboundQueueProviderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | No queue present, failing request
2017-07-06 01:06:12,378 | WARN | ult-dispatcher-3 | RpcContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Xid cannot be reserved for new RequestContext, node:openflow:1
2017-07-06 01:06:14,901 | ERROR | pool-26-thread-1 | OutboundQueueProviderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | No queue present, failing request
2017-07-06 01:06:14,902 | WARN | pool-26-thread-1 | RpcContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Xid cannot be reserved for new RequestContext, node:openflow:1
2017-07-06 01:06:14,902 | ERROR | pool-26-thread-1 | OutboundQueueProviderImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | No queue present, failing request
2017-07-06 01:06:14,902 | WARN | pool-26-thread-1 | RpcContextImpl | 220 - org.opendaylight.openflowplugin.impl - 0.4.1.SNAPSHOT | Xid cannot be reserved for new RequestContext, node:openflow:1



 Comments   
Comment by Tomas Slusny [ 10/Jul/17 ]

Can you try it again on this patch Luis? https://git.opendaylight.org/gerrit/#/c/60075/

I was not able to reproduce the error on that patch.

Comment by Tomas Slusny [ 10/Jul/17 ]

Here is same patch for stable/carbon: https://git.opendaylight.org/gerrit/#/c/60144/

Comment by Luis Gomez [ 10/Jul/17 ]

Lowering priority to critical as this is kind of corner case.

Comment by Thanh Ha (zxiiro) [ 11/Jul/17 ]

(In reply to Luis Gomez from comment #3)
> Lowering priority to critical as this is kind of corner case.

Do we need a respin for this patch then? if not we should remove it from blocking bugs on the tracking spreadsheet.

Comment by Abhijit Kumbhare [ 18/Sep/17 ]

Can we confirm this is OK? (We think this should be OK now)

Comment by Luis Gomez [ 20/Sep/17 ]

Yes, I confirm this is fixed.

Generated at Wed Feb 07 20:33:44 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.