[OPNFLWPLUG-607] [Clustering]: Unrecoverable cluster flow provisioning failure with 30 switches, 1000 flows/switch. (Tried He design only) Created: 27/Jan/16  Updated: 27/Sep/21  Resolved: 16/Feb/16

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Saibal Roy Assignee: Anil Vishnoi
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Attachments: Zip Archive logs.zip     Zip Archive logs.zip    
External issue ID: 5114
Priority: Highest

 Description   

Build used :
===================
Karaf distro from latest ODL Beryllium code

Unrecoverable flow provisioning failure in ODL cluster with 30 switches and 1000 flows per switch.

Test Type :
===================
Flow Provisioning failing on connecting 30 switches across 3 cluster nodes , 1000 flows per switch.Total 30,000 flows.

Test setup :
====================
Build used: Beryllium
OF-Plugin Used: Helium
OF-HA used: NO

Objective of test :
===================
verify if the flow provioning is stable when we scale switches per controller.

Test Steps :
============
1. Bring up healthy 3 node cluster say c1, c2 and c3.
2. Bring UP the mininet with following command and connect to controller c1(10.183.181.41),c2(10.183.181.42) and c3(10.183.181.43).c2 is the Leader of inventory-config-shard
Here each controller is connected with 10 switches.so overall 30 switches across the cluster.
3. Now i provision 30000 flows from c1(Follower).

Note:We are provisioning flows via Binding aware Api of Openflow Inventory Model.
4.Check if 30000 flows (1000 flows per switch) have been provisioned across 30 switches.

Commands
========
To connect 10 switches per controller, i use mininet custom command on each controller(c1,c2,c3)
sudo mn --custom /home/mininet/mininet/custom/mytopo.py --topo mytopo --controller remote,ip=10.183.181.41 --switch ovsk,protocols=OpenFlow13
sudo mn --custom /home/mininet/mininet/custom/mytopo.py --topo mytopo --controller remote,ip=10.183.181.42 --switch ovsk,protocols=OpenFlow13
sudo mn --custom /home/mininet/mininet/custom/mytopo.py --topo mytopo --controller remote,ip=10.183.181.43 --switch ovsk,protocols=OpenFlow13

Note:mytopo.py is attached for quick reference.please modify mytopo.py accordingly

The below command we use to check number of flows provisioned per controller.
dpctl dump-aggregate -O OpenFlow13

Test Results:
=============
1. I could not see 30000 flows(1000 flows per switch) and i get the below Exception (full stack could be seen in the logs)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [ActorSelection[Anchor(akka.tcp://opendaylight-cluster-data@10.183.181.41:2550/), Path(/user/shardmanager-operational/member-1-shard-inventory-operational/shard-member-2-chn-157-txn-1#1338207586)]] after [5000 ms]

2. After above condition, no further flow provisioning via config DS works and system requires complete cluster reboot to restore normalcy.

Attaching all the karaf logs of 3 controller nodes.

Thanks & Regards,
Saibal Roy.



 Comments   
Comment by Saibal Roy [ 27/Jan/16 ]

Attachment logs.zip has been added with description: Unrecoverable flow provisioning failure in ODL cluster with 30 switches and 1000 flows per switch

Comment by Muthukumaran Kothandaraman [ 27/Jan/16 ]

Binding aware stub app follows typical ODL stereotype of
Config DS update --> FRM --> OpenflowPlugin --> Flow provisioning --> Stats --> Oper DS Update.

Statistics was in enabled state as the evaluation is for "as is" functionality

Comment by Muthukumaran Kothandaraman [ 27/Jan/16 ]

From the logs it appears that AskTimeoutException originates from extensive Oper DS updates from the plugin. Eventually, this can start having ripple effect on the Config DS updates too

Comment by Saibal Roy [ 01/Feb/16 ]

Hi,

I connected 30 switches(10 switch per controller) and then i pushed 750 flows from follower c2.Note my leader is c1.
I could see 30*750=22500 flows across the cluster.

Now i increase the flow to 850 and i provision from follower c2 and i am not able to see all the flows in my switch across the cluster..It gives AskTimeoutException.

Thanks & Regards,
Saibal Roy.

Comment by Abhijit Kumbhare [ 01/Feb/16 ]

Muthu will provide an update on this by Feb 3 or 4.

Comment by Saibal Roy [ 03/Feb/16 ]

Hi,

In continuation of further troubleshooting following exercise was carried out.

Objective:
==========
To identify if AskTimeout exception surfaces only when both ConfigDS and OperDS are being updated concurrently in flow-provisioning.

i did 2 flavors of testing with 10 switch per controller(Total 30 switches) and pushing 1000 flows per switch.Below are my observations.

Test Case1:
========
1. c1,c2,c3 are the controller where c1 is the leader.
2. Connected 10 switches per controller(Total 30 switches)
3. I provision 1000 flows per switch(total 3*1000=3000 flows) from Follower c2
4. i could see that till 760 flows get provision per switch and then i get AskTimeOutException.

Test Case2:
========
1. c1,c2,c3 are the controller where c1 is the leader.
2. I provision 1000 flows per switch(total 3*1000=3000 flows) from Follower c2
3. i could see 30000 flows in config DS.
4. Now i Connect 10 switches per controller(Total 30 switches)
5. i could see 1000 flows per switch(total 30000 flows).

Observation :
=============
It appears that standard flow-provisioning (which updates Config DS as well as Oper DS concurrently) results in AskTimeout exception. Whereas in case of
Reconciliation, Config DS update (30K flows) and Oper DS updates (updates of flows based on Statistics) are staggered across , we perhaps do not see this issue

In the 1st TestCase when we are pushing 30K flows, it actually does 2x transaction where as in the 2nd TestCase, first 30K flows are pushed to config DS and then switch are connected where again another 30K transaction is pushed.Is this the reason for failure of TestCase1??

Thanks & Regards,
Saibal Roy.

Comment by Saibal Roy [ 03/Feb/16 ]

Hi,
Note:We are using the latest build and still the BUG exists as Test Case:1 fails to provision 30K flows across the cluster.

Thanks & Regards,
Saibal Roy

Comment by Anil Vishnoi [ 11/Feb/16 ]

Hi Saibal,

I tried the same test with with the latest stable/beryllium and things looks good to me.

https://git.opendaylight.org/gerrit/#/c/34115/ this patch get rid of routed rpc use and use the local rpc registration. Although i didn't install equal number of flows on each switch but distribution is approximately even. This is what i did

1) Started 3-node cluster (used feature :odl-openflowplugin-flow-service-rest)
2) Connect 10 switch to each controller
3) Installed 50K flow on all the connected switch using flow_config_blaster.py script from integration/test repo.
sudo ./flow_config_blaster.py --host 10.0.0.3 --threads 5 --flows 10000 --auth --no-delete --fpr 200 --nodes 30

4) Flow installed done through the inventory-config shard follower.
5) I can see that all 50K flows get installed successfully, following is the distribution of the flow on each switch

  1. ./get-total-found.sh
    Switch s1: 1600 flows
    Switch s2: 800 flows
    Switch s3: 2200 flows
    Switch s4: 2200 flows
    Switch s5: 1200 flows
    Switch s6: 1800 flows
    Switch s7: 1400 flows
    Switch s8: 1600 flows
    Switch s9: 1800 flows
    Switch s10: 600 flows
    Switch s11: 2200 flows
    Switch s12: 1400 flows
    Switch s13: 1200 flows
    Switch s14: 2200 flows
    Switch s15: 2200 flows
    Switch s16: 1400 flows
    Switch s17: 2000 flows
    Switch s18: 1000 flows
    Switch s19: 1800 flows
    Switch s20: 2400 flows
    Switch s21: 2000 flows
    Switch s22: 2800 flows
    Switch s23: 1800 flows
    Switch s24: 1400 flows
    Switch s25: 1200 flows
    Switch s26: 2000 flows
    Switch s27: 1200 flows
    Switch s28: 1800 flows
    Switch s29: 1000 flows
    Switch s30: 1800 flows

Total: 50000

Can you please test it again in your environment and update the bug.

Comment by Saibal Roy [ 11/Feb/16 ]

Attachment logs.zip has been added with description: logs

Comment by Saibal Roy [ 11/Feb/16 ]

Hi Anil,
In continuation of further testing following exercise was carried out.

1. c1,c2,c3 are the controller where c2 is the leader.
2. Connected 10 switches per controller(Total 30 switches)
3. I provision 1000 flows per switch(total 30*1000=30000 flows) from Follower c1

i could see 30000 flows in the configDS but in the switch i could not see 1000 flows per switch..
Total 29953 flows got provisioned on the switch across the cluster..

root@mininet-vm:/home/mininet/integration/test/tools/odl-mdsal-clustering-tests/clustering-performance-test# ./inventory_crawler.py --auth --host 10.183.181.42 --datastore config
Crawling 'http://10.183.181.42:8181/restconf/config/opendaylight-inventory:nodes'

Totals:
Nodes: 30
Reported flows: 0
Found flows: 30000

root@mininet-vm:/home/mininet/integration/test/tools/odl-mdsal-clustering-tests/clustering-performance-test# ./inventory_crawler.py --auth --host 10.183.181.42 --datastore operational
Crawling 'http://10.183.181.42:8181/restconf/operational/opendaylight-inventory:nodes'

Totals:
Nodes: 30
Reported flows: 30163
Found flows: 29953

mininet> dpctl dump-aggregate -O OpenFlow13

      • s1 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=997
      • s2 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=997
      • s3 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s4 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=999
      • s5 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=997
      • s6 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=999
      • s7 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=998
      • s8 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=998
      • s9 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=998
      • s10 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=999

mininet> dpctl dump-aggregate -O OpenFlow13

      • s11 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s12 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s13 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s14 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s15 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s16 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s17 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s18 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s19 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s20 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000

mininet> dpctl dump-aggregate -O OpenFlow13

      • s21 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s22 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=1000
      • s23 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=996
      • s24 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=996
      • s25 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=996
      • s26 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=997
      • s27 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=996
      • s28 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=996
      • s29 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=997
      • s30 ------------------------------------------------------------------------
        OFPST_AGGREGATE reply (OF1.3) (xid=0x2): packet_count=0 byte_count=0 flow_count=997

The above states that the BUG exist as we are not able to see 1000 flows per switch.

There has been a definite improvement in the flow provisioning behaviour as we are able to push ~29.5K flows in the switch.
Thanks for your help!!

Attaching the logs.

Thanks & Regards,
Saibal Roy.

Comment by Anil Vishnoi [ 11/Feb/16 ]

Hi Saibal,

How long do you wait before dumping all the flows in the switches? Also what's the size of your VM that's running controller ? Just want to make sure that controller has enough computing power. Can you give some more details about how your application is writing flow ?

I tested right now with 50K flows and i see that things are working fine, all the flows are getting installed properly.

Comment by Saibal Roy [ 12/Feb/16 ]

Hi Anil,
Below are the details.

1. How long do you wait before dumping all the flows in the switches?

Once i connect the switches and i check all the switches are UP on the respective controller, i wait for a minute or so..then i push 30K flows.

Once i push 30k flows from the Follower,i check the configDS how many flows got provisioned followed by checking in the switch and then in the OperDS.

Once all the flows are provisioned in the configDS(30K) , I keep on checking for next 30-35 minutes so that all the flows gets provisioned in the OperDS as well and i check in the switch as well.

2.Also what's the size of your VM that's running controller ?

Each VM is of 20GB size.
We are using 6VCPUs for each controller.
Heap Size alloted is 8GB.

3.Can you give some more details about how your application is writing flow ?

The application used for writing flows pushes the flows directly to the config DS instead of pushing through RESTconf.
One thread per dpn is used and the flows are pushed sequentially in each thread.

Comment by Muthukumaran Kothandaraman [ 12/Feb/16 ]

Flows are written to Config DS using bindingaware stub application. It uses standard databroker and WriteTransaction with put calls.

Comment by Saibal Roy [ 16/Feb/16 ]

Hi,
With further testing with detailed logs implemented, we didn't see the issue of cluster Datastore going into Unrecoverable state .

But the issue of Flow provisioning still exists.Hence we are closing this BUG and raising another BUG.
Below of the details of the new BUG.

https://bugs.opendaylight.org/show_bug.cgi?id=5364

Thanks & Regards,
Saibal Roy.

Generated at Wed Feb 07 20:32:55 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.