[OPNFLWPLUG-322] Need to close the ODL Denial of Service interface Created: 24/Nov/14  Updated: 27/Sep/21  Resolved: 13/Oct/15

Status: Resolved
Project: OpenFlowPlugin
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Jim West Assignee: Michal Rehak
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 2429

 Description   

In some of my testing I've discovered an external message and code path that will cause ODL to fail. If left unfixed, I am concerned that this could be used for a Denial of Service attack

Internally, we us 'monit' to ensure all core processes are up. The following line was added last month to our monit configuration:
if failed host 127.0.0.1 with port 6653 type TCP for 3 times within 3 cycles then restart

This line causes monit to connect to ODL (via the loop-back interface) and then disconnect. After this line was added, we discovered that our ODL installations would fail after a few days of running.

I opened https://bugs.opendaylight.org/show_bug.cgi?id=2394 to track a specific problem associated with this (OFHandshake threads building up).

When an external entity opens an OpenFlow connection, but does not complete the handshake, I see log messages along the lines of:

2014-11-24 15:17:42.200 UTC [OFHandshake-7-0] WARN o.o.o.o.m.c.ErrorHandlerSimpleImpl - exception -> FIRST HELLO sending failed because of connection issue., session -> null
org.opendaylight.openflowplugin.ConnectionException: FIRST HELLO sending failed because of connection issue.
at org.opendaylight.openflowplugin.openflow.md.core.HandshakeManagerImpl.sendHelloMessage(HandshakeManagerImpl.java:297) ~[bundlefile:na]
at org.opendaylight.openflowplugin.openflow.md.core.HandshakeManagerImpl.shake(HandshakeManagerImpl.java:95) ~[bundlefile:na]
at org.opendaylight.openflowplugin.openflow.md.core.HandshakeStepWrapper.run(HandshakeStepWrapper.java:47) [bundlefile:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_71]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]

Resources are not freed and ODL eventually locks up.

I've patched by code with my patch for 2394, but I am concerned that there are other code paths that I haven't investigated.



 Comments   
Comment by Abhijit Kumbhare [ 24/Feb/15 ]

Jim,

Do you have a patch for this? In case not then we will need to assign this bug to someone.

Thanks,
Abhijit

Comment by Jim West [ 24/Feb/15 ]

Hi Abhijit

I have a patch associated with 2394. With this patch, I don't see the problem any longer. I'm not sure that the patch for 2394 covers all the code patches (that's the reason I opened this bug)

(In reply to Abhijit Kumbhare from comment #1)
> Jim,
>
> Do you have a patch for this? In case not then we will need to assign this
> bug to someone.
>
> Thanks,
> Abhijit

Comment by Abhijit Kumbhare [ 06/May/15 ]

Can we lower this from critical to major or normal?

Comment by Abhijit Kumbhare [ 06/May/15 ]

This needs retest - as handshake thread model was changed in Li codebase & this is to do with close and cleanup after handshake failure.

Comment by Jim West [ 06/May/15 ]

What's the reasoning for lowering the Importance? Is it because we think it's fixed?

(In reply to Abhijit Kumbhare from comment #3)
> Can we lower this from critical to major or normal?

Comment by Abhijit Kumbhare [ 06/May/15 ]

(In reply to Jim West from comment #5)
> What's the reasoning for lowering the Importance? Is it because we think
> it's fixed?
>
> (In reply to Abhijit Kumbhare from comment #3)
> > Can we lower this from critical to major or normal?

Reason: you no longer see the issue with your patch - and the developers may need to concentrate on the other higher priority bugs at this moment (during the bug fix cycle). Any case - we need a retest of this (see my previous comment) - can you please retest?

Comment by Jamo Luhrsen [ 07/May/15 ]

This is an easy issue to reproduce with the "nc" linux tool.

steps:
1. while 1>0; do nc -w 1 172.17.7.22 6633
2. get coffee

with a profiler (e.g. jvisual) you can watch the thread count rise, or this linux CLI will work too:

watch -n1 ps -o nlwp $ODL_PID

I've recreated the thread count growth on the Helium release bits, as well as using this distro:
distribution-karaf-0.3.0-20150506.171255-1469.zip

around aprox 30k working threads, there may be some trouble with OOM or file descriptors. In every case, the controller proves to be useless.

Comment by Abhijit Kumbhare [ 09/May/15 ]

We should not reduce the priority of this.

Comment by Jamo Luhrsen [ 11/May/15 ]

This issue is NOT seen in Helium SR3, but is seen in all Helium releases prior to it. I have manually checked this on Helium, Helium-SR1 and Helium-SR2.

Comment by Anil Vishnoi [ 11/May/15 ]

Yes, this issue was fixed in helium SR3, so this is expected. It means this is regression from the master branch.

Comment by Jozef Gloncak [ 12/May/15 ]

This patch should fix problem in Li codebase
https://git.opendaylight.org/gerrit/#/c/20097/1

Comment by Michal Rehak [ 14/May/15 ]

Please retest ans eventually mark resolved.

Comment by Jamo Luhrsen [ 14/May/15 ]

This is resolved, and automation is there to show this. Search the
flow-services robot suites for the test case with title "Bug Validation.2429"

Comment by Peter Gubka [ 25/Jun/15 ]

the test created for this bug start failing for lithium redesigned openflow plugin

in stable/lithium
https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-cds-flow-services-lithium-redesign-only-stable-lithium/47

and master
https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-1node-cds-flow-services-lithium-redesign-only-master/417

Comment by Michal Rehak [ 25/Jun/15 ]

https://git.opendaylight.org/gerrit/#/c/23279/

Comment by Jamo Luhrsen [ 26/Jun/15 ]

I've fixed the test now:

https://git.opendaylight.org/gerrit/#/c/23285/

Lithium (the release distro) does not fail this test case, and I was able to use the sandbox
to verify the bug was there in a distro from 6/18 (before the fix for OPNFLWPLUG-507 was put in
on 6/19)

marking this RESOLVED/FIXED.

Comment by Jamo Luhrsen [ 26/Jun/15 ]

I've fixed the test now:

https://git.opendaylight.org/gerrit/#/c/23285/

Lithium (the release distro) does not fail this test case, and I was able to use the sandbox
to verify the bug was there in a distro from 6/18 (before the fix for OPNFLWPLUG-507 was put in
on 6/19)

marking this RESOLVED/FIXED.

Comment by Jamo Luhrsen [ 25/Sep/15 ]

This issue still seems to be there. It's not showing up in CI for some reason,
but it did happen recently when the releng folks were trying to upgrade their
CI robot VM to use centOS7. with the centOS6 vm this test is passing, so my
guess is that something is broken in that test vm such that this bug is not being
caught by CSIT.

I've reproduced this with the robot suite locally. It has failed with both
the Lithium release as well as the latest Beryllium distro from 9/22.

Comment by Luis Gomez [ 25/Sep/15 ]

Very good Jamo

Comment by Abhijit Kumbhare [ 09/Oct/15 ]

Michal,

Can you look at this one before the next SR? I think next SR is in a month or 2.

Abhijit

Comment by Abhijit Kumbhare [ 09/Oct/15 ]

Luis says fails in both He & Li designs - is this related to OF Java Michal Polkorab?

Comment by Michal Rehak [ 12/Oct/15 ]

Hi all,
I just tested stable/lithium HEAD locally and by both -

{He,Li}

-design I got no issues:

for (( i=0; i<1000; i++ )); do
nc -w 1 localhost 6633 | hexdump -C &
sleep 0.1
pkill -f 'nc -w'
done

netstat -pn | grep -E '66[35]3' | nl

Amount of threads used by controller was like 150. When running this cycle then the thread amount was always less then 160.
Both designs behaved the same.
With commented out the sleep line I got many opened tcp connections waiting for close.
When sleep line was effective then connections got closed correctly.

Comment by Jamo Luhrsen [ 13/Oct/15 ]

(In reply to michal rehak from comment #22)
> Hi all,
> I just tested stable/lithium HEAD locally and by both -

{He,Li}

-design I got
> no issues:
>
> for (( i=0; i<1000; i++ )); do
> nc -w 1 localhost 6633 | hexdump -C &
> sleep 0.1
> pkill -f 'nc -w'
> done
>
>
> netstat -pn | grep -E '66[35]3' | nl
>
> Amount of threads used by controller was like 150. When running this cycle
> then the thread amount was always less then 160.
> Both designs behaved the same.
> With commented out the sleep line I got many opened tcp connections waiting
> for close.
> When sleep line was effective then connections got closed correctly.

Yes, you are right. The test was working quickly enough to accidentally
count the closing sockets as and mark it as a failure. I've pushed a fix
in the test:

https://git.opendaylight.org/gerrit/28356

I verified manually that this new test code will fail on Helium SR2 where
the bug was known to exist, but the test will pass with Lithium SR2.

I also ran extra manual tests on Lithium SR2 to ensure I don't see this
issue, so re-closing the bug now.

Generated at Wed Feb 07 20:32:09 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.