[OPNFLWPLUG-1086] Reconciliation framework failure when starting cbench tool for the first time Created: 25/Feb/20  Updated: 27/Feb/20  Resolved: 27/Feb/20

Status: Verified
Project: OpenFlowPlugin
Component/s: reconciliation-framework
Affects Version/s: None
Fix Version/s: Magnesium

Type: Bug Priority: Highest
Reporter: Jamo Luhrsen Assignee: Gobinath Suganthan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File karaf.log    
Issue Links:
Blocks
blocks INTTEST-72 openflowplugin-csit-1node-cbench-only... Verified

 Description   

this issue was noticed because an old CSIT job using cbench had a consistent failure that is not
seen in sodium (technically a regression in that sense)

The cbench tool fails on starting with a "closed connection ... exiting" message.

snippet from this karaf.log :

2020-02-22T03:05:59,721 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-6 | GuardedContextImpl               | 310 - org.opendaylight.openflowplugin.impl - 0.10.0.SNAPSHOT | Starting RpcContextImpl[NEW] service for node openflow:0
2020-02-22T03:05:59,721 | WARN  | ofppool-7        | ContextChainHolderImpl           | 310 - org.opendaylight.openflowplugin.impl - 0.10.0.SNAPSHOT | Reconciliation framework failure.
2020-02-22T03:05:59,722 | INFO  | ofppool-7        | ReconciliationManagerImpl        | 302 - org.opendaylight.openflowplugin.applications.reconciliation-framework - 0.10.0.SNAPSHOT | Stopping reconciliation for node Uri{_value=openflow:10}
2020-02-22T03:05:59,726 | WARN  | ofppool-0        | ContextChainHolderImpl           | 310 - org.opendaylight.openflowplugin.impl - 0.10.0.SNAPSHOT | Reconciliation framework failure.
2020-02-22T03:05:59,726 | INFO  | ofppool-0        | ReconciliationManagerImpl        | 302 - org.opendaylight.openflowplugin.applications.reconciliation-framework - 0.10.0.SNAPSHOT | Stopping reconciliation for node Uri{_value=openflow:5}
2020-02-22T03:05:59,738 | INFO  | ListenableFutureAdapter-thread-0 | ContextChainHolderImpl           | 310 - org.opendaylight.openflowplugin.impl - 0.10.0.SNAPSHOT | Device openflow:3 connection is enabled by reconciliation framework.

Only the first try fails though, and additional cbench tests are ok.

We should figure out the root cause of the failure before we talk about the few workarounds
I know we can do:

  1. start a dummy cbench test first to get past the initial failure
  2. remove the cbench tests altogether if they no longer provide any value, although I do check from time to time the graphs that those jobs produce


 Comments   
Comment by Jamo Luhrsen [ 25/Feb/20 ]

Attached is karaf.log from a fresh boot with TRACE enabled for 'org.opendaylight.openflowplugin.impl'

the cbench command used was:

cbench -c 10.1.154.174 -m 3000 -M 1 -s 2 -l 2 -D 3000 -p 6633

notice the -s 2 for a switch count of 2. The CSIT job uses much more, but I was trying to reduce the number
of variables involved and when I only used a single (-s 1) switch the issue did not show up. Increasing it to
2 seems to hit it every time. hopefully that is a worthwhile clue.

Comment by Arunprakash D [ 26/Feb/20 ]

gobinath@ericsson.com will get back on the analysis by tomorrow

Comment by Gobinath Suganthan [ 26/Feb/20 ]

jluhrsen I want to add some logging in openflowplugin project and re-run this cbench test for debugging. Any way I could start this job on a openflowplugin review?

Comment by Gobinath Suganthan [ 26/Feb/20 ]

Also I could see the following errors in the karaf log:

"2020-02-11T12:46:26,913 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-37 | RoleContextImpl | 310 - org.opendaylight.openflowplugin.impl - 0.10.0 | Device: openflow:6 with version: 1 does not support role BECOMEMASTER"

What is the openflow version used by the switches in cbench?

Comment by Jamo Luhrsen [ 26/Feb/20 ]

Once this patch is merged, you can add the comment:

test-openflowplugin-cbench

to the patch you are adding debug logs to and it should give you what you want. You'll want to make sure your
debugs are all INFO level though. If you want a job that runs with higher debug levels for some module, we can
do that to, but it'll be a more manual process that I can do in the sandbox.

Comment by Jamo Luhrsen [ 26/Feb/20 ]

yes, IIRC, cbench got stuck on OF1.1 and never made it to 1.3.

Comment by Jamo Luhrsen [ 26/Feb/20 ]

thanks.

Comment by Gobinath Suganthan [ 27/Feb/20 ]

thanks a lot jluhrsen

Comment by Gobinath Suganthan [ 27/Feb/20 ]

I have raised a review & triggered the job but the job seem to be waiting with no available executor for too long.

https://git.opendaylight.org/gerrit/c/openflowplugin/+/88084

Job link:

https://jenkins.opendaylight.org/releng/job/openflowplugin-patch-test-cbench-magnesium/1/

Alternately could you please send the steps to set up cbench in local. (couldn't find complete working steps anywhere)

 

Comment by Gobinath Suganthan [ 27/Feb/20 ]

I had added some logging and rerun the tests. From the logs, it was found that the "flownodereconciliation" service was throwing an uncaught exception (ConcurrentModificationException) which had resulted in the reconciliation framework.

 

Logs:

2020-02-27T11:48:31,071 | ERROR | ofppool-2 | ContextChainHolderImpl | 310 - org.opendaylight.openflowplugin.impl - 0.10.0.SNAPSHOT | Reconciliation framework failure for device openflow:1 with error
java.util.ConcurrentModificationException: null
at java.util.HashMap.computeIfAbsent(HashMap.java:1134) ~[?:?]
at org.opendaylight.openflowplugin.applications.frm.impl.FlowNodeReconciliationImpl.startReconciliation(FlowNodeReconciliationImpl.java:302) ~[?:?]
at org.opendaylight.openflowplugin.applications.reconciliation.impl.ReconciliationManagerImpl.lambda$reconcileServices$4(ReconciliationManagerImpl.java:135) ~[?:?] 

 

I have fixed this and added some logs to debug similar issues.

https://git.opendaylight.org/gerrit/c/openflowplugin/+/88084

Note:

The "flownodereconciliation" task completes fast when the switch connects initially (no flows are present in inventory config for the node). So the race condition would be more probable in this case and we see might be seeing the cbench failures only during initial connection.

 

Comment by Jamo Luhrsen [ 27/Feb/20 ]

Great debugging and thanks for the quick fix, gobinath. the patch test job is passing now and I have merged the stable/magnesium patch.

does this need to be cherry-picked to sodium?

I assume it should be in master, so I already made that cherry pick

Comment by Gobinath Suganthan [ 27/Feb/20 ]

Yes. This has to be cherry-picked to Sodium too. Cherry-picked here https://git.opendaylight.org/gerrit/c/openflowplugin/+/87965

It is surprising that the cbench is passing in Sodium however.

Comment by Jamo Luhrsen [ 27/Feb/20 ]

actually, now that I think about it, I think I was able to reproduce this locally with sodium. Just didn't spend much time double checking. Maybe upstream CSIT is slightly slow enough in sodium + that shared cloud environment compared to my local laptop that the race condition isn't hit.

Generated at Wed Feb 07 20:34:09 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.