[CONTROLLER-1487] entity structures are kept even when the entity is removed. can be used as DOS attack Created: 20/Feb/16  Updated: 25/Jul/23  Resolved: 22/May/18

Status: Resolved
Project: controller
Component/s: clustering
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Jamo Luhrsen Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Duplicate
is duplicated by OVSDB-297 openflow entity are not removed after... Resolved
External issue ID: 5397

 Description   

when a node connects to ovsdb southbound, there is an entity-owner node
created. for single node, it looks as below. However, when the device
disconnects, it still remains. It does correctly reflect that there is no
ownershipt, but it should be removed entirely.

when node connected:

"entity-owners": {
"entity-type": [
{
"entity": [
{
"candidate": [

{ "name": "member-1" }

],
"id": "/network-topology:network-topology/network-topology:topology[network-topology:topology-id='ovsdb:1']/network-topology:node[network-topology:node-id='ovsdb://uuid/bc330598-4581-4d9f-b932-e362b452137b']",
"owner": "member-1"
}
],
"type": "ovsdb"
},
{
"entity": [
{
"candidate": [

{ "name": "member-1" }

],
"id": "/general-entity:entity[general-entity:name='ovsdb-southbound-provider']",
"owner": "member-1"
}
],
"type": "ovsdb-southbound-provider"
}
]
}
}

when node disconnected:

"entity-owners": {
"entity-type": [
{
"entity": [

{ "id": "/network-topology:network-topology/network-topology:topology[network-topology:topology-id='ovsdb:1']/network-topology:node[network-topology:node-id='ovsdb://uuid/bc330598-4581-4d9f-b932-e362b452137b']", "owner": "" }

],
"type": "ovsdb"
},
{
"entity": [
{
"candidate": [

{ "name": "member-1" }

],
"id": "/general-entity:entity[general-entity:name='ovsdb-southbound-provider']",
"owner": "member-1"
}
],
"type": "ovsdb-southbound-provider"
}
]
}
}



 Comments   
Comment by Anil Vishnoi [ 21/Feb/16 ]

Hi Jamo,

As per the discussion on following thread, this is expected behavior.

https://lists.opendaylight.org/pipermail/integration-dev/2016-February/005957.html

Comment by Jamo Luhrsen [ 23/Feb/16 ]

the root issue here is that entity owner service (EOS) does not clean up the entry when it goes "candidateless". So in the case when a lot of new/unique
entities are learned and removed we will continue to use resources and
if repeated there will be an OutOfMemory exception and crash.

one quick way to see this is to start an ovs bridge:

sudo ovs-vsctl add-br memLeakerBridge

connect it to openflowplugin:

sudo ovs-vsctl set-controller memLeakerBridge tcp:${ODL_IP}:6633

(NOTE: feature installed is openflowplugin-flow-services-rest)

cycle through mac addresses:

sudo ovs-vsctl set bridge memLeakerBridge01 other-config:hwaddr=00:00:00:00:00:01
sudo ovs-vsctl set bridge memLeakerBridge01 other-config:hwaddr=00:00:00:00:00:02
sudo ovs-vsctl set bridge memLeakerBridge01 other-config:hwaddr=00:00:00:00:00:03
... and so on

here's a hack of a python script to do the cycling:

for i in xrange(0x00, 0xFF):
for j in xrange(0x00,0xFF):
for k in xrange(0x00,0xFF):
print(cmd)
cmd = "sudo ovs-vsctl set bridge memLeakerBridge0java.lang.OutOfMemoryError: Java heap space, 'x') + ":" + format(j, 'x') + ":" + format(k, 'x')
time.sleep(2) # without this pause, it doesn't work. I did not investigate

Comment by Robert Varga [ 24/Feb/16 ]

As indicated, entities do not have a complete lifecycle (as usual, removal is missing) and this is a bug.

Comment by Tom Pantelis [ 24/Feb/16 ]

I've prototyped it but I'm afraid that removing an entity when "candidateless" will introduce latent timing bugs. I'm seeing timing-related failures even in unit tests. Even if we delay removal via timer (say 15 sec) and recheck that there are no current candidates, there may be a candidate add transaction inflight, in which case the delete would remove it afterwards. I don't see any way to alleviate this potential issue with the way the in-memory data tree works.

I'm inclined to leave it as is. The memory footprint for an empty entity node is pretty small so it would likely take millions to run OOM (depending on how much memory is allocated) and, at least with current use cases, complete removal of an entity should be infrequent.

Wrt to DOS attack, any config yang list is a potential DOS attack. Eg, one could easily keep putting nodes to the inventory node list via restconf and run it OOM.

Comment by Jamo Luhrsen [ 24/Feb/16 ]

(In reply to Tom Pantelis from comment #4)
> I've prototyped it but I'm afraid that removing an entity when
> "candidateless" will introduce latent timing bugs. I'm seeing timing-related
> failures even in unit tests. Even if we delay removal via timer (say 15 sec)
> and recheck that there are no current candidates, there may be a candidate
> add transaction inflight, in which case the delete would remove it
> afterwards. I don't see any way to alleviate this potential issue with the
> way the in-memory data tree works.
>
> I'm inclined to leave it as is. The memory footprint for an empty entity
> node is pretty small so it would likely take millions to run OOM (depending
> on how much memory is allocated) and, at least with current use cases,
> complete removal of an entity should be infrequent.

I did not specifically keep track of the count, but I think I ran it OOM
with ovsdb entries numbering in the thousands and maybe slightly less
when doing it with openflow entries. If it's important to get a specific
number, I can. Just wanted to give my observation since it seemed like
a lot less than millions.

> Wrt to DOS attack, any config yang list is a potential DOS attack. Eg, one
> could easily keep putting nodes to the inventory node list via restconf and
> run it OOM.

fair point, but at least in this specific case we can also issue a
delete via restconf. I'm not sure the right way to get rid of the
stale entities, except by restarting.

Comment by Tom Pantelis [ 24/Feb/16 ]

The number will depend on how much memory you allocate to the JVM - I think the default is 2G so thousands would probably do it. But the same issue would occur if you actually had thousands of valid entities.

Also doesn't OVSDB/OF leave the inventory node for a bit of time before purging? If so that's accounting for memory as well and will be affected by your blaster script.

If you run OOM, the process would likely be hosed and you likely wouldn't be able to issue a delete And on restart you would likely run OOM again when it restored from persistence, unless you bumped the memory. At least with entities they don't come back on restart since they're operational

I'm not saying we shouldn't remove the entries - I just don't see a safe/foolproof way to do it automatically w/o introducing potential race conditions that would result in difficult sporadic bugs to track down. Definitely with removing them immediately. Using a timer would be safer and reduce the chance for a race condition. How long is safe? Probably some minutes to hours we could assume the entity won't come back, although it would require more memory for the bookkeeping to track which ones are eligible to purge. But that still wouldn't stop a DOS attack like your script which would hammer it in seconds.

(In reply to Jamo Luhrsen from comment #5)
> (In reply to Tom Pantelis from comment #4)
> > I've prototyped it but I'm afraid that removing an entity when
> > "candidateless" will introduce latent timing bugs. I'm seeing timing-related
> > failures even in unit tests. Even if we delay removal via timer (say 15 sec)
> > and recheck that there are no current candidates, there may be a candidate
> > add transaction inflight, in which case the delete would remove it
> > afterwards. I don't see any way to alleviate this potential issue with the
> > way the in-memory data tree works.
> >
> > I'm inclined to leave it as is. The memory footprint for an empty entity
> > node is pretty small so it would likely take millions to run OOM (depending
> > on how much memory is allocated) and, at least with current use cases,
> > complete removal of an entity should be infrequent.
>
>
> I did not specifically keep track of the count, but I think I ran it OOM
> with ovsdb entries numbering in the thousands and maybe slightly less
> when doing it with openflow entries. If it's important to get a specific
> number, I can. Just wanted to give my observation since it seemed like
> a lot less than millions.
>
>
> > Wrt to DOS attack, any config yang list is a potential DOS attack. Eg, one
> > could easily keep putting nodes to the inventory node list via restconf and
> > run it OOM.
>
> fair point, but at least in this specific case we can also issue a
> delete via restconf. I'm not sure the right way to get rid of the
> stale entities, except by restarting.

Comment by Robert Varga [ 24/Feb/16 ]

I do not pretend to understand the EOS implementation, but EntityOwnershipShard is based on Shard, hence it inherently contains and internal DataTree and can control what data is inside that.

Since entities are stored in a list, the DataTree does not remove them automatically (as it does for-non-presence containers), but it is certainly in the realm of possibility for EntityOwnershipShard to make this list appear and disappear as candidates are added or deleted.

This would require non-trivial amount of surgery, I suspect, probably increasing coupling between Shard and EntityOwnershipShard. Since we have splitting out EOS on our plate for Boron, I suggest we tackle this problem once the split is done, make that squeeky-clean and then consider a backport to Beryllium.

Comment by Tom Pantelis [ 24/Feb/16 ]

The EntityOwnershipShard has full control over the data.

I'll provide an example race condition:

  • node1 registers a candidate for entity1
  • sometime later node1 unregisters its candidate and submits a tx to remove it
  • the EOS leader commits the tx
  • the candidate list DTCL gets triggered and starts processing the change
  • in the meantime, node2 registers a candidate for entity1 and submits tx1 to add it
  • the candidate list DTCL finishes its processing and sends a CandidateRemoved message
  • the leader pre-commits tx1 and replicates
  • the leader receives the CandidateRemoved and sees that there are no more candidates
    and submits tx2 to remove entity1. However tx1 is in progress so tx2 is queued
  • consensus is received for tx1 and it is committed - node2 is now in the candidate
    list
  • however the leader then commits tx2 which deletes entity1

So we end up with a strange condition where a node thinks it has a candidate but the entity is gone. It would also try to write node2 as the owner but that would fail.

This is why we didn't previously deal with entity removal. Maybe there's some trickiness we can do to alleviate this scenario with more EOS/Shard coupling as Robert mentioned. It's possible the EOS could hook into the can-commit/pre-commit (via Tony's new commit cohort stuff) and inspect the modification and abort an entity delete if it has a candidate.

Comment by viru [ 11/Aug/16 ]

(In reply to Anil Vishnoi from comment #9)
> *** OPNFLWPLUG-618 has been marked as a duplicate of this bug. ***

hi,

can you please confirm that OPNFLWPLUG-618 hase been resolved or not.

Generated at Wed Feb 07 19:55:40 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.