[OPNFLWPLUG-962] Multiple "expired" flows take up the memory resource of CONFIG DS which leads to Controller shutdown. Created: 30/Nov/17 Updated: 16/Jan/18 Resolved: 16/Jan/18 |
|
| Status: | Resolved |
| Project: | OpenFlowPlugin |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Carbon-SR3, Nitrogen-SR2, Oxygen |
| Type: | Bug | Priority: | High |
| Reporter: | Vaibhav Hemant Dixit | Assignee: | Anil Vishnoi |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
#security-status: confirmed Please Note: This issue is a possible security vulnerability, do not discuss outside of this Jira or stage any patches on gerrit until the embargo process reaches that stage. I am sending this to you in advance to give us some lead time to triage... ISSUE: Multiple "expired" flows take up the memory resource of CONFIG DS STEPS TO REPRODUCE: 1. Start the controller. 4. Verify that the OPENFLOWPLUGIN is working fine: Flows are Karaf crash LOGS are attached OBSERVATION: Although the installed flows(with timeout set) are removed from network (an thus also from controller's operations DS), the expired entries are still present in CONFIG DS.This may adhere with the design goals of CONFIG datastore, but, is The attack can originate both from NORTH or SOUTH. Above description is for north bound attack. A south bound attack can originate when an Although, the network(actual flow tables) and operational DS are only (~)1% occupied, the controller shouts for resource consumption. This happens because the installed flows get removed from the network upon timeout. The error is not recoverable and shuts down the controller. MITIGATION: The expired flows should be removed from controller's CONFIG datastore. If the design goal is to keep the flow entries persistent, there should be a threshold which should be calculated truly based on JVM's heap size. Another thought: it makes sense to have operational DS (active state of nw) contain only those many tables as present in the network, the CONFIG DS(desired state) can have different size which can be scaled up and scaled down depending on resource usage. |
| Comments |
| Comment by Tom Pantelis [ 30/Nov/17 ] |
|
This is an openflowplugin issue and should be moved to that project. |
| Comment by Luke Hinds [ 04/Dec/17 ] |
|
abhijit2511 could you triage this please (i.e. confirm its a security issue). After that we need to decide if it requires a code fix, or an architecture recommendation (run on a secluded network, behind a proxy (rate limiting) etc. |
| Comment by Abhijit Kumbhare [ 05/Dec/17 ] |
|
Avishnoi - can you please take a look at this? |
| Comment by Luke Hinds [ 07/Dec/17 ] |
|
Avishnoi could you start by triaging (confirm its an issue), so we can then set the security status to `confirmed`
This is what I posted to the security list: |
| Comment by Anil Vishnoi [ 08/Dec/17 ] |
|
lukehinds I will look at the issue tomororw. |
| Comment by Anil Vishnoi [ 09/Dec/17 ] |
|
lukehinds Keeping the configuration in config data store is by design for OpenDaylight controller, because data present in the config data store is user configuration and controller should not remove the configuration on it's own, because that breaks the contract with the it's consumer. Removing this configuration is users responsibility. Now it's a valid scenario that it can be leveraged for the DOS attack by exhausting the memory resources. But this problem is not specific to openflowplugin project, it's a general issue with the base ODL datastore. Exhausting memory through flows is a one possible scenario. User can push millions of flow with timeout value and that can also cause the crash. So root of the vulnerability is that user can exhaust data store memory by pushing lot of configuration and it requires some preventive mechanism in place. This is something that need to be handled by the Controller project as it owns the data store component. Can you please move the issue to the Controller project for their attention. |
| Comment by Luke Hinds [ 10/Dec/17 ] |
|
Avishnoi this was originally lodged under the controller project, but rovarga stated it was an openflow-plugin issue and not the controllers so we moved the Jira. With that in mind I don't want to switch it back again. Instead we should resolve its rightful home in this comment thread and it can be moved / remain when a consensus is met. I cannot be arbiter here as I don't know both projects well enough, so will ask skitt and anyone else from the SRT to help mediate where the issue should be fixed. |
| Comment by Anil Vishnoi [ 12/Dec/17 ] |
|
tpantelis rovarga Tom/Robert, the attack here is basically exhausting the data store by pushing lot of data (this can be openflowplugin data or it can be any other data from other southbound plugin/applications etc). So in my opinion the preventive mechanism is required at the data store level and not at the individual consumer level. Please let me know your thoughts. |
| Comment by Stephen Kitt [ 12/Dec/17 ] |
|
I don’t think we can ever find effective mitigation measures inside OpenDaylight against DoS attacks in general. Given that ODL is only supposed to be deployed in segregated networks anyway (at least, only have its admin accessible from an admin network), I’m not sure we really need to do anything more (beyond documenting this better perhaps). |
| Comment by Robert Varga [ 13/Dec/17 ] |
|
At the high level, the deployment needs to be sized to support the expected workload – the datastore can hardly make that decision based on what it observes inside the JVM, simply because it does not know what exactly runs in the JVM. From the datastore perspective, it is asked to hold some state – and it does so. The datastore does not have a concept of data expiring – this is something that is specific to the OFP model. It therefore falls to either the user or OFP to prune this expired state. I believe this can be easily achieved in FRM during reconciliation – it has to understand that a flow is already expired and should not be pushed to the switch anyway. |
| Comment by Anil Vishnoi [ 13/Dec/17 ] |
|
rovarga You are assuming here the only scenario where it's possible that there is some expired data present in the data store. But there are very likely scenarios where all the data present in the data store is valid configuration. Also this holds true not only for the openflowplugin but all the consumer (southbound, northbound, applications) of the data store. From openflowplugin perspective, FRM reconciliation only triggers for specific switch when that switch connects to the controllers, so that is not really a good trigger point for the pruning because switch might not disconnect for long duration. Moreover we can't just remove user configuration, just because a configuration is removed from the switch for valid reasons. Openflowplugin uses operational data store to notify the user about removal of the configuration from switch, so it's up to the application now to prune the data store. Just for a sake of assumption, we assume that openflowplugin does some pruning, but what if user is using openflowplugin with bgp/pcep plugin and that start consuming memory for valid reason. In that scenario, OFP pruning won't help because controller can still hit the same issue and OFP won't have any control over it as well. Given that this security vulnerability can be used to attack by dumping lot of data to the data store through rest, one simple mechanism can be implemented in the controller is to expose a configuration parameter to user, where use can set a % of heap utilization when data store stop taking any more transaction and throw exception to the application. That can be a deterministic trigger point for applications as well for doing any possible pruning they can do.
Overall to address this vulnerability i can see only two possible solutions, either we can do what Stephen suggested above and make it users responsibility OR controller should implement some mechanism on the line of example i provided above. |
| Comment by Luke Hinds [ 14/Dec/17 ] |
|
So my thoughts here are that any rate limiting / flow control should take place in the 'front' regions of an applications architecture , rather than in any back end such as a database, meta store etc. So with that I think we should notify operators of the vulnerability and make two recommendations as well:
If no objections I will put a draft together, and we can review in here before posting? |
| Comment by Luke Hinds [ 15/Dec/17 ] |
|
ok all, I have the following. Once there is an OK from one srt member and Avishnoi I will send this note out to downstream stakeholders. Multiple "expired" flows take up the memory resource of CONFIG DS which leads to Controller shutdown. The following issue was discovered and reported by Vaibhav Hemant Dixit. Summary Multiple "expired" flows take up the memory resource of CONFIG DATASTORE which leads to CONTROLLER shutdown. Affected Services / Software OpenFlow Plugin and OpenDayLight Controller. Versions: Nitrogen, Carbon, Boron rovarga, Avishnoi -< please verify versions affected (back to depreciated releases). Discussion If multiple different flows with "idle-timeout" and "hard-timeout" are sent to the Openflow Plugin REST API, the expired flows will eventually crash the controller once its resource allocations set with the JVM size are exceeded. Although the installed flows(with timeout set) are removed from network (an thus also from controller's operations DS), the expired entries are still present in CONFIG DS. The attack can originate both from NORTH or SOUTH. The above description is for a north bound attack. A south bound attack can originate when an attacker attempts a flow flooding attack and since flows come with timeouts, the attack is not successful. However, the attacker will now be successful in CONTROLLER overflow attack (resource consumption). Although, the network(actual flow tables) and operational DS are only (~)1% occupied, the controller requests for resource consumption. This happens because the installed flows get removed from the network upon timeout. Recommended Actions Management API’s within OpenDayLight should only ever be deployed within a segregated private network and never exposed to public networks, this includes the OpenFlowPlugin. Further protections can be implemented by deploying a rate limiting proxy (such as OpenRepose, HAProxy, nginx, mod_ratelimit etc) or web application firewall.
|
| Comment by Luke Hinds [ 19/Dec/17 ] |
|
ok, three days have passed so will take the radio silence as a consensus to go with the above. this will be sent to downstream stakeholders on Tuesday the 9th of January to allow operators to get passed the seasonal shutdown some will have. |
| Comment by Luke Hinds [ 16/Jan/18 ] |
|
This is now open with notification sent to stakeholder and public lists. vhd please close now. |
| Comment by Vaibhav Hemant Dixit [ 16/Jan/18 ] |
|
Marking the bug as closed. |