[OPNFLWPLUG-971] Node reconciliation installs old, expired flows: switch conquers flows for infinite time Created: 09/Jan/18  Updated: 10/Sep/18

Status: Open
Project: OpenFlowPlugin
Component/s: openflowplugin
Affects Version/s: Nitrogen, Carbon
Fix Version/s: None

Type: Bug Priority: Medium
Reporter: Vaibhav Hemant Dixit Assignee: Anil Vishnoi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by OPNFLWPLUG-530 [FLOW RECONCILIATION]Expirable flows ... Resolved

 Description   

As a part of node reconciliation process in OpenDayLight, all the flows in the config datastore are installed back in switch upon reconnection. This includes the flows which are active and also which have expired.

This is also due to fact that flows are persistent in config datastore, regardless of being active or dead.

Severity:

  • Advanced Persistent Threat from an app leading to table overflow attack on switch.
  • A malicious switch circumventing controller's control: allowing communication for indefinite period.
     

Exploit:

Northbound Attack( Advanced persistent threat [APT]):

Attacker: Application with covert threat

  1. An application with covert intentions keeps installing legitimate timed(with idle/hard timeout) flows for a switch with varying parameters/priorities.
  2. At any point in time during the active cycle of switch, the switch may disconnect and reconnect with the controller.
  3. At every such status change of switch, all the flows stored in config datastore will be installed in the switch as part of reconciliation process of plugin.
  4. The installed flows include the active flows(new) and expired flows(old). Basically, every single flow that had ever come for this switch will be installed back again.
  5. The less severe problem is that unexpected and unwanted communication will be allowed in the network.
  6. Security-critical issue is that this Advanced Persistent Threat would be at some point successful in switch table overflow attack.

Table overflow attack has its own consequences - DoS to authorized hosts is one of them.

Moreover, security application on controller can fail to detect this as it takes into account active flows and not expired flows.

 

South Bound Attack(Uncontrolled communication):

Attacker: Malicious switch with host A

  1. First packet of a traffic flow from host A to any host B will go to controller.
  2. Controller installs a flow rule allowing traffic between hosts for only X minutes.
  3. However, the malicious switch reconnects with controller at every X-Y minute.
  4. For every reconnection, the controller installs same old flow flow allowing communication for X minutes.
  5. The process goes in an infinite loop and the first packet of an otherwise new flow will never go to controller.
  6. In an ideal scenario, the controller should not install the same expired flow. But the flaw with reconciliation workflow becomes a vulnerability.

 

This would mean malicious host A can have communication with another host B for indefinite amount of time circumventing any present and future security mechanisms with controller ( like new ACLs).
Controller would believe that communication lasts only for X minutes which however happens for the entire life cycle.

Mitigation:

  • Change/Fix the workflow of node reconciliation.
  • Do not leave that task of deletion of old and expired flows from config datastore to an app. Instead, openflowplugin or MDSAL should do this.


 Comments   
Comment by Luke Hinds [ 09/Jan/18 ]

Avishnoi please triage

agrimberg please could you add Vaibhav Hemant Dixit as the reporter.

Reminder to all, this is yet to be triaged, so treat it as private until further notice.

Comment by Andrew Grimberg [ 09/Jan/18 ]

lukehinds, I've added vhd as the reporter as requested.

Comment by Luke Hinds [ 16/Jan/18 ]

Avishnoi , please triage this issue.

Comment by Luke Hinds [ 06/Feb/18 ]

Avishnoi , any progress on this issue? has it been confirmed yet?

Comment by Anil Vishnoi [ 13/Feb/18 ]

Configuration present in configuration data is applications configuration and plugin must not modify that configuration. This is by design for most of the southbound plugin in OpenDaylight. OpenFlow plugin provides the operational status of the switch to the application through operational data store. So if the flow is evicted from the switch because of timeout expiration, it will be removed from the operational data store. Application should monitor the operational status of the switch and when it sees that the flow is removed from the operational data store it should remove the flow from config data store.

In case, switch gets disconnected from the controller, OpenFlow plugin will remove the switch from operational data store and application should remove all the configuration related to that node (which it does not want to be reconciled in case switch re-connects) from the configuration data store.

Openflow plugin removing the application configuration breaks the programming model for application because the same configuration is being modified by two entities (application and openflow plugin) who does not share the application context . Application and Plugin might end up creating race conditions when application write the flow to the config data store and plugin deleting the flow from config data store, so this shared responsibility of managing the application configuration is risky path and that's the reason it was intentionally designed the way it is now.

Comment by Vaibhav Hemant Dixit [ 13/Feb/18 ]

Thanks for the insights, Avishnoi.
I agree that it can be prevented if an application which modifies the configuration also manages and owns the outcomes of it, later.

However, that's where the vulnerability lies in the first place. There is no inherent functionality in ODL which verifies what the application is doing. A simple case like switch reconciliation can lead to unauthorized/unwanted traffic in the network. This can be an application with or without adversarial intent.

Why the OFPlugin can help?   :

The argument that it is a risk if the same configuration is being modified by two different entities is valid in the case of modification(write/delete). In the case of reconciliation,however, the plugin just needs to read(which it does) and ignore( which it does not) the expired flow rules. It only makes sense for OpenFlow plugin to ignore the configuration which is expired. An application which had put some configuration in the config tree intended it for only the time mentioned as part of the configuration (timeouts in case of  flow rules). This application can take it's own sweet time to clean the config installed in it's context. Meanwhile, when Openflow plugin(which works independently) comes to check the present configuration, it should simply ignore the expired configuration. This way we still not violate the design principles.

One reason for inconsistency that I can think of is the absence of state variable in ODL datastores which reflect active or expired state of configuration. This is not the case with other controllers in market. Furthermore, not doing any simple validation or sanity checks on the picked configuration affects the entire network as mentioned in the bug description.

Comment by Vaibhav Hemant Dixit [ 13/Feb/18 ]

Another important concern is that the plugin is putting the static and not dynamic state of the switch back upon reconnection.

Reconciliation, if it exists, should only restore the last known operating state of the switch, by definition.

There is a problem though. As Anil already mentioned, any related data structures for a disconnecting switch are removed from the operational tree. Since there is no back up for the last known state, the configuration can only picked from the CONFIG store. A simple solution is to take the snapshot of the operating state of the switch before flushing the datastructures or just retaining the Operational state for some default safe time.

It ensures that the configuration lives for the time it was intended for. An application can desire the config to be installed back in the switches upon reconnection and thus leaves it there. The openflow plugin now takes the configuration from the store but resets the timers. Flow rules with timeout 60s get another life of 60s when it should have only lived for their remaining time.

If an application were to continuously monitor the operational state of the switch to maintain the corresponding timers(and other states) in the config tree, we do not need an operation datastore at all.

Comment by Robert Varga [ 21/Feb/18 ]

I think the key unanswered question is: when does the timer start to tick?

Since the timer is persisted, the answer to this question needs to consider the controller completely dying just after the flow config is stored and similar scenarios.

Once we have that answer we can discuss who does what and when.

Comment by Robert Varga [ 21/Feb/18 ]

I will also note that our state-keeping is based on the IETF topology model, which allows for derived state, scratch spaces and similar – effectively supporting a superset of the operating model described in https://tools.ietf.org/html/draft-ietf-netmod-revised-datastores . Taking advantage of this, though, will require taking a step back from the singular approach (e.g. there is exactly one config and exactly one oper piece of state for each switch).

Comment by Kurt Seifried [ 21/Feb/18 ]

Not to confuse the matter, but assuming this is a security issue and needs a CVE then my questions, from the CVE perspective would be:

1) What is the documented behavior/intentions around this? Are there any claimed security properties?

2) Flows that are expired, the datastore config has enough data about the flow to restore the flow, but does not provide enough data for the switch to figure out that a particular flow might be expired? Is my understanding correct?

Comment by Vaibhav Hemant Dixit [ 21/Feb/18 ]

1) What is the documented behavior/intentions around this? Are there any claimed security properties?

The adversarial intention of an attacker can originate from north and also from south.
This is described with actual experiments in more details as part of the bug description. Security properties claimed:
Integrity:
Essentially, installing flows with extended life(then intended) allows unwanted traffic and communication in the network through the affected switches. This can be an adversarial intention(detailed in description) and even in natural course of network, this vulnerability can potentially allow undesired communication (wrong flow rules). In any way, there will be chaos in the network.
Availability
In case multiple applications are installing flow rules for a switch, at some point in time, when switch restarts or reconnects, the vulnerability can potentially lead to flow table overflow (as all the active and old configuration in config tree will be reconciled). Individual applications responsible for these configurations may have their own cleanup cycles. The reconciliation process takes all the configuration from all the contributing applications and installs it without any checks. 

 

2) Flows that are expired, the datastore config has enough data about the flow to restore the flow, but does not provide enough data for the switch to figure out that a particular flow might be expired? Is my understanding correct?
Yes. To reiterate, in SDN, switches really don't have to figure out anything. It is a responsibility of the responsible services/applications in the controller to figure out if the config is expired or active. If it is active, it is important to figure out the new active period for the config, instead of fetching the static timeouts from the config store.  

Comment by Kurt Seifried [ 21/Feb/18 ]

Can you provide a URL to those above claims? Thanks.

As for #2 it sounds like the config data store should maybe be manipulating the data in order to not just cache the data but process it (e.g. to expire the flow when the timeout hits). I'm not sure what other options there are.

Comment by Vaibhav Hemant Dixit [ 21/Feb/18 ]

I am not sure what exactly I can provide as logs. It is a series of events in different planes(control and data) which lead to a flaw like this.
May be a video when I perform this experiment again.

Comment by Vaibhav Hemant Dixit [ 21/Feb/18 ]

The steps to reproduce are mentioned in the bug description, however.

Comment by Kurt Seifried [ 21/Feb/18 ]

Sorry, to be more clear: I'm talking about documentation from the OpenDayLight project, e.g. are specific security claims made around this issue? Is there documentation that states explicitly that expired flows should always be expired, regardless of what happens, etc. 

Comment by Vaibhav Hemant Dixit [ 21/Feb/18 ]

I have not come across any documentation specific to what happens with expired configuration. As mentioned in Anil's comments above, it is application's responsibility to update the config store.

However the problem discussed here is not just limited to an expired flow. It also entails the fact the a timed configuration which has presently not expired but has lived for some time in the network, should be put back upon reconciliation with the remaining time, not the entire life again.

Exploiting this exact vulnerability, a malicious switch can receive a timeout flow rule for an infinite amount of time by initiating reconciliation workflow just before the flow rule actually expires.

The application that was responsible for the config can be fooled as the config never actually reach the expiry time and is being reset before that.

Some useful links:

I see some work happening to address similar issues in Nitrogen (But still there is no track of clock ticking and maintaining the dynamic state for reconciliation)
http://docs.opendaylight.org/en/stable-nitrogen/submodules/openflowplugin/docs/specs/reconciliation-framework.html

https://drive.google.com/file/d/0B0idHBqccTJybEFwVG93ZUFxY1k/view

https://wiki.opendaylight.org/view/OpenDaylight_OpenFlow_Plugin:Backlog:Node_Status_Reconciliation

 

Comment by Kurt Seifried [ 21/Feb/18 ]

Thanks, so basically: "is a trust boundary violated" and I think the answer here is yes. 

1) The flow has expired, as such re-instantiating the flow is clearly what is not expected or intended (the timer exists for a reason!)

2) This is definitely exploitable by an attacker, even if they cannot trigger it directly they can wait for it to occur and then exploit it. 

3) There is a clear impact, traffic that should NOT be allowed is allowed. If a firewall did this there would be no question as to whether or not this is a security vulnerability, I think the same general logic applies here.

Luke: can I assign a CVE here?

Comment by Anil Vishnoi [ 21/Feb/18 ]

vhd Application is the one who is managing the traffic and programmed the network for that. So application is the one that should be responsible for making sure that the network is in the state it's expecting, it can not rely on the plugin to take action on it's behalf. Application is the one who has context about that configuration, plugin just push it to the device. Given that plugin notifies the application whenever the expired flow (or any flow) is removed from the switch (through operational data store notification), application should take action on it. If application doesn't take action, that might be application's intention and plugin should no interfere with it.

Now assume that we implement the removal of expiry flows, what should plugin do in the following scenario

(1) Application installs the flow with 5 minutes of hard_timeout. Someone deletes the flow after 3 minutes of installation. Given that application is not listening of any kind of network change. Shell plugin remove the configuration at that time, or it should wait for 5 minutes to expire and remove the configuration. In any case application is unaware that flow was basically removed from the switch. This is another security vulnerability in my opinion.

(2) Application A installs the flow for 5 minute expire time. Other application comes and update the flow 3 minutes after the installation by application A, so plugin will reset the timer and will remove the flow after 8 minutes. Is it okay for application-A?

(3) Assuming the application have a logic that takes care of cleanup of the expiry flow. In this scenario, it will conflict with the plugin because both of them will try to remove the config from data store and that can lead to race condition.

There are few more scenario that i can list out, where if application is not watching the network state can lead to security vulnerability.  Given that there is a immediate feedback loop available for application that notifies about all the network changes, it's application responsibility to react to network event. If it's not taking any action, that means they want to keep the existing configuration. 

Current programming model is very simple, application configure the flow/groups, they can get notified about the operational state of these flow/groups (add/update/delete), and they can take action on the change of the state of these resources. Now if move one partial functionality to config, that breaks the programming model, because now application will have to be aware of the fact that plugin will remove the configuration related to the expired flow. So in case they want to install it again, they need to make sure that it's first removed from the config data store and then add it, otherwise it can get into various race condition. In my personal opinion, this actually breaks the programming model and it's very fuzzy when it comes to developing the application. Given that this vulnerability can be take care at the application level (given that application will have to take care of lot of other possible security vulnerabilities), it doesn't make sense to break the current programming model of the openflowplugin. 

Comment by Vaibhav Hemant Dixit [ 21/Feb/18 ]

Thanks Avishnoi.

This issue is not limited to the expired flows or such an event. The definition of an "expired flow" depends on the maintained state of the timers. There is where the problem is. And in my understanding it cannot be and should be handled by an application.

In the scenarios mentioned above, an application can register for the events (add/update/del) but the state(time) update does not cause any update event for this to be handled. There are no events generated for every clock tick in the operation tree.

Moreover, even if application were to probe the operational datastore every second for the status, it would cause huge performance overhead. This when done by multiple applications can cause signification computation and unnecessary resource consumption. You can throw more light on this.

The design principles and the programming model is broken only if openflow plugin modify the data in config tree and not if some logic verifies the remaining life or expired state of the config and a decision is made only to ignore.

There should be some check(not modification) done by trusted services in OpenDayLight: one badly(or malicious) programmed application cannot be accountable to cause chaos in entire enterprise network. SDN is an event driven framework: this means one misconfiguration causes chain of unintended events.

Comment by Kurt Seifried [ 22/Feb/18 ]

If this isn't an OpenDayLight issue then can we please make this issue public so that stake holders can be made aware and fix their applications? 

Comment by Luke Hinds [ 22/Feb/18 ]

I have no issue with going public if another SRT member +1's the action.

 

So from what I am gathering, this is another case of ensuring applications are correctly handling expired flow or operators placing a rate limiting solution in place, should an application act in a way that causes the above issue to occur?

 

 

Comment by Vaibhav Hemant Dixit [ 22/Feb/18 ]

Even if applications handle the events(add/update/del) of the flow rules, the issue persists.

There is no event for applications to listen for time ticks (it is bad idea to continuously probe the datastores). If application is just listening for the "expired->deleted" event, the primarily problem of switch retaining the flow rule for lifetime still haunts the network.

An application listening for such events can be easily fooled because the config never reaches an expiry(reconnection just before expiry leads to faulty reset of timers). Thus, the application misses any such event. 

Comment by Anil Vishnoi [ 22/Feb/18 ]

vhd The definition of "expired flow" depends on it's state in the switch and not on any external timers. User can proactively install the flow with hard_timeout in the openflowplugin config data store, but it will be installed in the switch once switch connects. Openflowplugin can not rely on it's internal timers and decide if the flow is gone from switch because that can result in false positives. The only source of truth here is state of flow in openflowplugin switch. In the current implementation, whenever switch evicts the flow because of hard_timeout expire, it gets updated in the operational data store and application get notified about its removal from switch. So with the current mechanism, application don't have to do any probing, they will be notified whenever flow gets expired and removed from the switch.

Can you please explain a bit what does "event generated per event clock tick" really means here? and why application would like to receive update for each clock tick? 

Plugin's contract with the application is very clear here. It assumes that the configuration present in the config data store is applications intended configuration and plugin will provide the operational status of that intended configuration through operational updates. 

 

Regarding

"There should be some check(not modification) done by trusted services in OpenDayLight: one badly(or malicious) programmed application cannot be accountable to cause chaos in entire enterprise network. SDN is an event driven framework: this means one mis-configuration causes chain of unintended events."

OpenFlow plugin has no context to figure out what is bad configuration. As far as the configuration pushed to plugin is syntactically correct, it will try to program that configuration on switch, if not it will throw error. There is no way plugin has any insight to understand whether any specific flow/group can cause any malicious configuration and cause chaos in network, that's application responsibility, because plugin is not the one who is managing the network, it's just programming the network based on the configuration pushed by application.

In the above comment, you mentioned a really good point that one mis-configuration can cause chain of unintended events in the network, that's the vary reason applications need to be proactive in monitoring the network state and need to figure out if it sees any suspicious configuration in the network, or to keep the right configuration in the configuration data store, rather then relying on plugin to make assumption and take action on it's behalf.

Comment by Anil Vishnoi [ 22/Feb/18 ]

Given that application will get notification about the switch disconnection as well, are you assuming that application will not take any action? Isn't this assumption itself opens a security concerns here? 

Comment by Vaibhav Hemant Dixit [ 22/Feb/18 ]

Those are some interesting insights.

Can you please explain a bit what does "event generated per event clock tick" really means here? and why application would like to receive update for each clock tick? 

As long as flow rule expiry from switch is concerned, I am convinced that application can register for an event and take appropriate actions on the config datastore.

That still does not answer the premise of the bug: an illusion is created in a way that controller never knows that the flow rule is expired.

Application installs the flow with timeout 60 seconds. Sequence of events that will take place:

  1. Flow lives in the network for 30 seconds.
  2. At 30th second, switch initiates a reconnection
  3. Openflowplugin takes the flow rule from config datastore and installs it with timers set to 0.
  4. Flow rule installed back for 60 seconds. Go back to step#1.

This way, there is no event for flow rule expiry. An application listening for such an event will never receive it and in turn never updates the configuration datastore.

Between step #2 and step #3, if an application was listening for switch disconnection event, what should happen here?
Application does not know what was the runtime state of the switch was: all the state is deleted from the operational datastore upon disconnection and it recieved no prior event to take snapshot. If the application deletes the flow rule, the configuration is lost - when it was still to be retained for it's remaining life.

Comment by Anil Vishnoi [ 23/Feb/18 ]

When switch disconnect, application has no clue on whether the switch will connect back or not, so even after knowing that, if application don't want to remove the configuration from the data store that's application's decision. I would be surprise if any network service won't take any action when the switch disconnect from the controller because that opens up an opportunity for traffic black-holing and network disruption. Node disconnecting is a major event for application because of the reason you mentioned above that application (or for that matter even plugin) have no insight of the state of the switch.

Comment by Luke Hinds [ 05/Mar/18 ]

K, folks - I will make this bug public one week for today (unless any objections are made)

Comment by Luke Hinds [ 12/Mar/18 ]

Bug is now open.

Comment by Kurt Seifried [ 12/Mar/18 ]

So a fundamental issue here, going back to the first point:

  1. An application with covert intentions keeps installing legitimate timed(with idle/hard timeout) flows for a switch with varying parameters/priorities.

What is the purpose of these timeouts if the application is supposed to be tracking them and enforcing them? It appears to me that there is an implied contract of "you setup a flow with a timeout, and it will timeout", and to fail at this opens a security vulnerability that appears CVE worthy, so I guess my simple question is "how is this not a security vulnerability when we have a behavior that results in a security impact, especially as it can be fixed?"

Comment by Luke Hinds [ 12/Mar/18 ]

I have no objections to raising a CVE, but as this is not a core / mature project (and therefore not security managed) and no fix is being offered I think this being public now is the right state.

Comment by Kurt Seifried [ 14/Mar/18 ]

This has been assigned CVE-2018-1078.

Comment by Kurt Seifried [ 14/Mar/18 ]

Stupid question but can't the timer have an associated START_TIME so you can more easily expire it... (e.g. when loading the config check if START_TIME + TIMER is equal to or more than CURRENT TIME and if so, not load it?). 

Generated at Wed Feb 07 20:33:52 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.