[OPNFLWPLUG-758] Li plugin: Controller rejects switch connections Created: 26/Aug/16  Updated: 27/Sep/21  Resolved: 08/Sep/16

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Luis Gomez Assignee: Jozef Bacigal
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 6554

 Description   

As this test shows:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-clustering-only-boron/

List of candidates vary from 1-2-3 when switch connects to 1 or 2 members only.

BR/Luis



 Comments   
Comment by Luis Gomez [ 30/Aug/16 ]

I changed the title of the bug because I think the entity owner unstability is generated by a quick switch disconnect + connect transition. Looking at this log:

2016-08-30 08:47:41,330 | INFO | entLoopGroup-5-1 | SystemNotificationsListenerImpl | 207 - org.opendaylight.openflowplugin.impl - 0.3.0.Boron-RC1 | ConnectionEvent: Connection closed by device, Device:/192.168.0.20:56480, NodeId:null
2016-08-30 08:47:41,447 | INFO | entLoopGroup-5-2 | ConnectionAdapterImpl | 197 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.8.0.Boron-RC1 | Hello received / branch
2016-08-30 08:47:41,449 | WARN | entLoopGroup-5-2 | DeviceManagerImpl | 207 - org.opendaylight.openflowplugin.impl - 0.3.0.Boron-RC1 | Node org.opendaylight.openflowplugin.impl.connection.ConnectionContextImpl$DeviceInfoImpl@a774dfb8 already connected disconnecting device. Rejecting connection
2016-08-30 08:47:41,449 | WARN | entLoopGroup-5-2 | DeviceManagerImpl | 207 - org.opendaylight.openflowplugin.impl - 0.3.0.Boron-RC1 | Node openflow:1 context state not in TERMINATION state.
2016-08-30 08:47:41,449 | INFO | entLoopGroup-5-2 | ConnectionContextImpl | 207 - org.opendaylight.openflowplugin.impl - 0.3.0.Boron-RC1 | Unregister outbound queue successful.

A hello packet is received before the device is fully removed from the controller. The hello packet stops the device unregister and at the same time switch connection gets rejected because the device is still register. This generates an unstable switch connection.

Comment by Luis Gomez [ 31/Aug/16 ]

This same issue is seen often during the cluster test and the result is always an unstable controller with entity-owner, inventory and topology entries when the switch in reality does not exist anymore.

2016-08-31 00:21:20,553 | INFO | entLoopGroup-7-4 | ConnectionAdapterImpl | 270 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.8.0.Boron-RC1 | Hello received / branch
2016-08-31 00:21:20,559 | WARN | entLoopGroup-7-4 | DeviceManagerImpl | 280 - org.opendaylight.openflowplugin.impl - 0.3.0.Boron-RC1 | Node org.opendaylight.openflowplugin.impl.connection.ConnectionContextImpl$DeviceInfoImpl@1f8d8b33 already connected disconnecting device. Rejecting connection
2016-08-31 00:21:20,559 | INFO | entLoopGroup-7-4 | ConnectionContextImpl | 280 - org.opendaylight.openflowplugin.impl - 0.3.0.Boron-RC1 | Unregister outbound queue successful.
2016-08-31 00:21:22,554 | INFO | entLoopGroup-7-1 | ConnectionAdapterImpl | 270 - org.opendaylight.openflowjava.openflow-protocol-impl - 0.8.0.Boron-RC1 | Hello received / branch
2016-08-31 00:21:22,558 | WARN | entLoopGroup-7-1 | DeviceManagerImpl | 280 - org.opendaylight.openflowplugin.impl - 0.3.0.Boron-RC1 | Node org.opendaylight.openflowplugin.impl.connection.ConnectionContextImpl$DeviceInfoImpl@1f8d8b33 already connected disconnecting device. Rejecting connection
2016-08-31 00:21:22,559 | INFO | entLoopGroup-7-1 | ConnectionContextImpl | 280 - org.opendaylight.openflowplugin.impl - 0.3.0.Boron-RC1 | Unregister outbound queue successful.
2016-08-31 00:21:22,722 | WARN | pool-31-thread-1 | StatisticsManagerImpl | 280 - org.opendaylight.openflowplugin.impl - 0.3.0.Boron-RC1 | Statistics gathering for single node was not successful: Device connection doesn't exist anymore. Primary connection status : RIP

Comment by Luis Gomez [ 31/Aug/16 ]

Raising to blocker, we cannot afford to have unstable OpenFlow cluster in Boron.

Comment by Luis Gomez [ 01/Sep/16 ]

The easiest way to reproduce this issue is to stop and start mininet with no delay for few times. After a while you will see stale entries in entity-owner, topology and inventory.

Comment by Luis Gomez [ 01/Sep/16 ]

The ERRORs described in this bug show very often in 2 scenarios:

1) OpenFlow Cluster test:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-clustering-only-boron/

That is why the test suite is not stable.

2) Switch scalability test single node:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-boron/plot/Inventory%20Scalability/

That is why we do not see good switch scalability.

In general this issue is more reproducible:

  • In cluster environment
  • When flapping switch connections
  • In switch scalibility scenarios
Comment by A H [ 02/Sep/16 ]

Is there an ETA for this bug and someone assigned to fix?

Comment by Colin Dixon [ 02/Sep/16 ]

Do we think this bug is related to (or caused by) BUG-6540?

Comment by Luis Gomez [ 02/Sep/16 ]

No, this bugs occurs without cluster member isolation and even in single instance test. The bug you point seems to be related to https://bugs.opendaylight.org/show_bug.cgi?id=6177 which is not blocker (major bug).

Comment by Luis Gomez [ 05/Sep/16 ]

I see also this ERROR very often prior to device reject:

2016-09-03 20:03:06,703 | WARN | pool-32-thread-1 | RoleContextImpl | 183 - org.opendaylight.openflowplugin.impl - 0.3.0.SNAPSHOT | New role BECOMESLAVE was not propagated to device openflow:160 during 10 sec
2016-09-03 20:03:06,703 | ERROR | pool-32-thread-1 | SalRoleServiceImpl | 183 - org.opendaylight.openflowplugin.impl - 0.3.0.SNAPSHOT | SetRoleService set Role BECOMESLAVE for Node: KeyedInstanceIdentifier

{targetType=interface org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node, path=[org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.Nodes, org.opendaylight.yang.gen.v1.urn.opendaylight.inventory.rev130819.nodes.Node[key=NodeKey [_id=Uri [_value=openflow:160]]]]}

fail . Reason java.util.concurrent.CancellationException: Task was cancelled.
2016-09-03 20:03:06,703 | WARN | pool-32-thread-1 | RoleManagerImpl | 183 - org.opendaylight.openflowplugin.impl - 0.3.0.SNAPSHOT | Was not able to set role SLAVE to device on node openflow:160

The above ERROR is kind of often in single node switch scalability and longevity tests.

In cluster test, there is an easy way to reproduce the above:

1) start mininet 1 switch (s1) pointing to any of the instances (e.g. 192.168.0.101)
2) Run the following script in the mininet system to bounce connection quickly:

#!/bin/bash
sudo ovs-vsctl del-controller s1
sleep 0.2
sudo ovs-vsctl set-controller s1 "tcp:192.168.0.101"

Comment by Jozef Bacigal [ 06/Sep/16 ]

https://git.opendaylight.org/gerrit/#/c/44664/6

Comment by A H [ 06/Sep/16 ]

To better assess the impact of this bug and fix, could someone from your team please help us identify the following:
Severity: Could you elaborate on the severity of this bug? Is this a BLOCKER such that we cannot release Boron without it? Is there a workaround such that we can write a release note and fix in future Boron SR1?
Testing: Could you also elaborate on the testing of this patch? How extensively has this patch been tested? Is it covered by any unit tests or system tests?
Impact: Does this fix impact any dependent projects?

Comment by Luis Gomez [ 06/Sep/16 ]

After this merge I see more stable cluster test but still some sporadic issue in switch scalability test:

https://logs.opendaylight.org/releng/jenkins092/openflowplugin-csit-1node-periodic-scalability-daily-only-boron/242/archives/karaf.log.gz

So far we cannot close this issue but maybe reduce importance if cluster behaves more stable (wee need few more runs to confirm).

BR/Luis

Comment by Luis Gomez [ 07/Sep/16 ]

OK, after a day of test I downgraded the importance from Blocker to Major. The reason is cluster test seems stable now:

https://jenkins.opendaylight.org/releng/view/CSIT-3node/job/openflowplugin-csit-3node-clustering-only-boron/

The switch scalability test, even when showing the issue sometimes, it has stabilize the max number:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-periodic-scalability-daily-only-boron/plot/Inventory%20Scalability/

BR/Luis

Comment by Colin Dixon [ 07/Sep/16 ]

So, can we note the maximum number of switch connections somewhere? I guess that would be part of the performance report, which we'll hopefully have time to update...

Comment by A H [ 08/Sep/16 ]

Has this bug been verified as fixed in the latest Boron RC 3.1 Build?

Comment by Jozef Bacigal [ 08/Sep/16 ]

Closing this as solved and creating new one for the switch scalability test.

Jozef

Generated at Wed Feb 07 20:33:19 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.