[OPNFLWPLUG-962] Multiple "expired" flows take up the memory resource of CONFIG DS which leads to Controller shutdown. Created: 30/Nov/17  Updated: 16/Jan/18  Resolved: 16/Jan/18

Status: Resolved
Project: OpenFlowPlugin
Component/s: None
Affects Version/s: None
Fix Version/s: Carbon-SR3, Nitrogen-SR2, Oxygen

Type: Bug Priority: High
Reporter: Vaibhav Hemant Dixit Assignee: Anil Vishnoi
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File karaf.log    

 Description   

#security-status: confirmed

Please Note: This issue is a possible security vulnerability, do not discuss outside of this Jira or stage any patches on gerrit until the embargo process reaches that stage.

I am sending this to you in advance to give us some lead time to triage...

ISSUE: Multiple "expired" flows take up the memory resource of CONFIG DS
which leads to CONTROLLER shutdown.

STEPS TO REPRODUCE:

1. Start the controller.
2. Connect Openflow swtiches (can use mininet)
3. Send multiple different flows with "idle-timeout" and "hard-timeout"
set to config store:
(http://<CONTROLLER-IP>:8181/restconf/config/opendaylight-inventory:nodes/node/openflow:1/table/0)

4. Verify that the OPENFLOWPLUGIN is working fine: Flows are
transferred to network and to OPERATIONAL datastore.
5. Depending on JVM size, the expired flows are bound to crash the
controller

Karaf crash LOGS are attached

OBSERVATION:

Although the installed flows(with timeout set) are removed from network (an thus also from controller's operations DS), the expired entries are still present in CONFIG DS.This may adhere with the design goals of CONFIG datastore, but, is
prone to dangerous attacks on controller.

The attack can originate both from NORTH or SOUTH. Above description is for north bound attack. A south bound attack can originate when an
attacker is attempting a flow flooding attack and since flows come with timeouts, the attack is not successful. However, the attacker will now
be successful in CONTROLLER overflow attack(resource consumption). This is more severe and dangerous than the actual flow-table-flooding attack.

Although, the network(actual flow tables) and operational DS are only (~)1% occupied, the controller shouts for resource consumption. This happens because the installed flows get removed from the network upon timeout.

The error is not recoverable and shuts down the controller.

MITIGATION:

The expired flows should be removed from controller's CONFIG datastore.

If the design goal is to keep the flow entries persistent, there should be a threshold which should be calculated truly based on JVM's heap size.

Another thought: it makes sense to have operational DS (active state of nw) contain only those many tables as present in the network, the CONFIG DS(desired state) can have different size which can be scaled up and scaled down depending on resource usage.



 Comments   
Comment by Tom Pantelis [ 30/Nov/17 ]

This is an openflowplugin issue and should be moved to that project.

Comment by Luke Hinds [ 04/Dec/17 ]

abhijit2511 could you triage this please (i.e. confirm its a security issue). After that we need to decide if it requires a code fix, or an architecture recommendation (run on a secluded network, behind a proxy (rate limiting) etc.

Comment by Abhijit Kumbhare [ 05/Dec/17 ]

Avishnoi - can you please take a look at this?

Comment by Luke Hinds [ 07/Dec/17 ]

Avishnoi could you start by triaging (confirm its an issue), so we can then set the security status to `confirmed`

 

This is what I posted to the security list:

 
I think we could cover both issue's as an advisory that the plugins API should only be listening on an internal network and / or coupled with a rate limiting proxy or web application firewall. I have found that once attempts are made to resolve DOS attacks within the app itself, it can become a case of going down the rabbit hole and sets an expectation of inherent robust capabilities always being present.
 __ 
The plugin PTL is yet to triage, but I would OK with a nofix and we send out details of the exploits and a recommendation of leveraging a security tier around the API.

Comment by Anil Vishnoi [ 08/Dec/17 ]

lukehinds I will look at the issue tomororw.

Comment by Anil Vishnoi [ 09/Dec/17 ]

lukehinds Keeping the configuration in config data store is by design for OpenDaylight controller, because data present in the config data store is user configuration and controller should not remove the configuration on it's own, because that breaks the contract with the it's consumer. Removing this configuration is users responsibility. Now it's a valid scenario that it can be leveraged for the DOS attack by exhausting the memory resources. But this problem is not specific to openflowplugin project, it's a general issue with the base ODL datastore. Exhausting memory through flows is a one possible scenario. User can push millions of flow with timeout value and that can also cause the crash. So root of the vulnerability is that user can exhaust data store memory by pushing lot of configuration and it requires some preventive mechanism in place.  This is something that need to be handled by the Controller project as it owns the data store component. Can you please move the issue to the Controller project for their attention. 

Comment by Luke Hinds [ 10/Dec/17 ]

Avishnoi this was originally lodged under the controller project, but rovarga stated it was an openflow-plugin issue and not the controllers so we moved the Jira. With that in mind I don't want to switch it back again. Instead we should resolve its rightful home in this comment thread and it can be moved / remain when a consensus is met. I cannot be arbiter here as I don't know  both projects well enough, so will ask skitt and anyone else from the SRT to help mediate where the issue should be fixed.

Comment by Anil Vishnoi [ 12/Dec/17 ]

tpantelis rovarga Tom/Robert, the attack here is basically exhausting the data store by pushing lot of data (this can be openflowplugin data or it can be any other data from other southbound plugin/applications etc). So in my opinion the preventive mechanism is required at the data store level and not at the individual consumer level. Please let me know your thoughts.

Comment by Stephen Kitt [ 12/Dec/17 ]

I don’t think we can ever find effective mitigation measures inside OpenDaylight against DoS attacks in general. Given that ODL is only supposed to be deployed in segregated networks anyway (at least, only have its admin accessible from an admin network), I’m not sure we really need to do anything more (beyond documenting this better perhaps).

Comment by Robert Varga [ 13/Dec/17 ]

At the high level, the deployment needs to be sized to support the expected workload – the datastore can hardly make that decision based on what it observes inside the JVM, simply because it does not know what exactly runs in the JVM.

From the datastore perspective, it is asked to hold some state – and it does so. The datastore does not have a concept of data expiring – this is something that is specific to the OFP model. It therefore falls to either the user or OFP to prune this expired state. I believe this can be easily achieved in FRM during reconciliation – it has to understand that a flow is already expired and should not be pushed to the switch anyway.

Comment by Anil Vishnoi [ 13/Dec/17 ]

rovarga You are assuming here the only scenario where it's possible that there is some expired data present in the data store. But there are very likely scenarios where all the data present in the data store is valid configuration. Also this holds true not only for the openflowplugin but all the consumer (southbound, northbound, applications) of the data store.

From openflowplugin perspective, FRM reconciliation only triggers for specific switch when that switch connects to the controllers, so that is not really a good trigger point for the pruning because switch might not disconnect for long duration. Moreover we can't just remove user configuration, just because a configuration is removed from the switch for valid reasons. Openflowplugin uses operational data store to notify the user about removal of the configuration from switch, so it's up to the application now to prune the data store. 

Just for a sake of assumption, we assume that openflowplugin does some pruning, but what if user is using openflowplugin with bgp/pcep plugin and that start consuming memory for valid reason. In that scenario, OFP pruning won't help because controller can still hit the same issue and OFP won't have any control over it as well.

Given that this security vulnerability can be used to attack by dumping lot of data to the data store through rest, one simple mechanism can be implemented in the controller is to expose a configuration parameter to user, where use can set a % of heap utilization when data store stop taking any more transaction and throw exception to the application. That can be a deterministic trigger point for applications as well for doing any possible pruning they can do.

 

Overall to address this vulnerability i can see only two possible solutions, either we can do what Stephen suggested above and make it users responsibility OR controller should implement some mechanism on the line of example i provided above. 

Comment by Luke Hinds [ 14/Dec/17 ]

So my thoughts here are that any rate limiting / flow control should take place in the 'front' regions of an applications architecture , rather than in any back end such as a database, meta store etc.

So with that I think we should notify operators of the vulnerability and make two recommendations as well:

  1. All REST interfaces should be on private / trusted internal networks.
  2. The above if desired can be further bolstered by implementation of a rate limiting proxy (repose, haproxy, nginx, mod_ratelimit etc) or web application firewall.

If no objections I will put a draft together, and we can review in here before posting?

Comment by Luke Hinds [ 15/Dec/17 ]

ok all, I have the following. Once there is an OK from one srt member and Avishnoi I will send this note out to downstream stakeholders.

Multiple "expired" flows take up the memory resource of CONFIG DS which leads to Controller shutdown.

The following issue was discovered and reported by Vaibhav Hemant Dixit.

Summary

Multiple "expired" flows take up the memory resource of CONFIG DATASTORE which leads to CONTROLLER shutdown.

Affected Services / Software

OpenFlow Plugin and OpenDayLight Controller.

Versions: Nitrogen, Carbon, Boron   rovarga, Avishnoi -< please verify versions affected (back to depreciated releases).

Discussion

If multiple different flows with "idle-timeout" and "hard-timeout" are sent to the Openflow Plugin REST API, the expired flows will eventually crash the controller once its resource allocations set with the JVM size are exceeded.

Although the installed flows(with timeout set) are removed from network (an thus also from controller's operations DS), the expired entries are still present in CONFIG DS.

The attack can originate both from NORTH or SOUTH. The above description is for a north bound attack. A south bound attack can originate when an attacker attempts a flow flooding attack and since flows come with timeouts, the attack is not successful. However, the attacker will now be successful in CONTROLLER overflow attack (resource consumption).

Although, the network(actual flow tables) and operational DS are only (~)1% occupied, the controller requests for resource consumption. This happens because the installed flows get removed from the network upon timeout.

Recommended Actions

Management API’s within OpenDayLight should only ever be deployed within a segregated private network and never exposed to public networks, this includes the OpenFlowPlugin. Further protections can be implemented by deploying a rate limiting proxy (such as OpenRepose, HAProxy, nginx, mod_ratelimit etc) or web application firewall.

 

Comment by Luke Hinds [ 19/Dec/17 ]

ok, three days have passed so will take the radio silence as a consensus to go with the above. this will be sent to downstream stakeholders on Tuesday the 9th of January to allow operators to get passed the seasonal shutdown some will have.

Comment by Luke Hinds [ 16/Jan/18 ]

This is now open with notification sent to stakeholder and public lists. vhd please close now.

Comment by Vaibhav Hemant Dixit [ 16/Jan/18 ]

Marking the bug as closed.

Generated at Wed Feb 07 20:33:50 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.