[CONTROLLER-969] Multiple CLOSE_WAIT connections after unhandled memory exception during stress test Created: 29/Oct/14  Updated: 19/Oct/17  Resolved: 19/May/16

Status: Resolved
Project: controller
Component/s: adsal
Affects Version/s: Helium
Fix Version/s: None

Type: Bug
Reporter: EPAMEINONDAS KONTOTHANASIS Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: PC


External issue ID: 2270
Priority: High

 Description   

After a stress test scenario with kc-cbench tool, with a few thousands of
datapaths, a file descriptor leak is taking place due to closed_wait tcp connections. Problem is accumulative until JVM restart.

Steps to reproduce

1) Start only the sdn controller using run.sh
2) run a tool e.g. cbench -s 20000 (kc-cbench)to create thousands of tcp
connections in port 6633 until you see the following error :

2014-10-29 10:46:20.146 EET [ControllerI/O Thread] ERROR o.o.c.l.b.i.UncaughtExceptionPolicy - Thread Thread[ControllerI/O Thread,5,main] died because of an uncaught exception
java.lang.OutOfMemoryError: Direct buffer memory

NOTE:
-The number of connections limit for the error to happen should vary depending on set JVM memory limit or could even not happen if e.g. osgi is not stressed a lot.
Set JVM limit to "-Xmx1G" to avoid this.
-The way kc-cbench sdn performance tool initiates the connections could play a role in this scenario so it is recommended for reproducing it.

3) check with netstat -anp | grep 6633
Many connections stay in close wait state waiting for the controller
to close them but due to the uncaught exception they remain open leading to system resources leak (file descriptors).

tcp6 1 0 127.0.0.1:6633 127.0.0.1:33466 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:50346 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:38301 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:48533 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:37408 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:53192 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:52902 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:34564 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:49069 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:56437 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:58956 CLOSE_WAIT
tcp6 1 0 127.0.0.1:6633 127.0.0.1:47915 CLOSE_WAIT

******************************************************************************
...MANY SIMILAR MESSAGES FOR EACH CONNECTION
******************************************************************************
2014-10-29 10:46:19.650 EET [ControllerI/O Thread] INFO o.o.c.p.o.core.internal.Controller - Switch:127.0.0.1:60968 is connected to the Controller
2014-10-29 10:46:19.652 EET [ControllerI/O Thread] INFO o.o.c.p.o.core.internal.Controller - Switch:127.0.0.1:58972 is connected to the Controller
2014-10-29 10:46:19.653 EET [ControllerI/O Thread] INFO o.o.c.p.o.core.internal.Controller - Switch:127.0.0.1:51264 is connected to the Controller
2014-10-29 10:46:19.655 EET [ControllerI/O Thread] INFO o.o.c.p.o.core.internal.Controller - Switch:127.0.0.1:54264 is connected to the Controller
2014-10-29 10:46:19.656 EET [ControllerI/O Thread] INFO o.o.c.p.o.core.internal.Controller - Switch:127.0.0.1:36014 is connected to the Controller
2014-10-29 10:46:19.658 EET [ControllerI/O Thread] INFO o.o.c.p.o.core.internal.Controller - Switch:127.0.0.1:42616 is connected to the Controller
2014-10-29 10:46:19.659 EET [ControllerI/O Thread] INFO o.o.c.p.o.core.internal.Controller - Switch:127.0.0.1:50015 is connected to the Controller
2014-10-29 10:46:20.146 EET [ControllerI/O Thread] ERROR o.o.c.l.b.i.UncaughtExceptionPolicy - Thread Thread[ControllerI/O Thread,5,main] died because of an uncaught exception
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_67]
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) ~[na:1.7.0_67]
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_67]
at org.opendaylight.controller.protocol_plugin.openflow.core.internal.MessageReadWriteService.<init>(MessageReadWriteService.java:47) ~[na:na]
at org.opendaylight.controller.protocol_plugin.openflow.core.internal.SwitchHandler.getMessageReadWriteService(SwitchHandler.java:826) ~[na:na]
at org.opendaylight.controller.protocol_plugin.openflow.core.internal.SwitchHandler.setupCommChannel(SwitchHandler.java:812) ~[na:na]
at org.opendaylight.controller.protocol_plugin.openflow.core.internal.SwitchHandler.start(SwitchHandler.java:149) ~[na:na]
at org.opendaylight.controller.protocol_plugin.openflow.core.internal.Controller.handleNewConnection(Controller.java:247) ~[na:na]
at org.opendaylight.controller.protocol_plugin.openflow.core.internal.ControllerIO$1.run(ControllerIO.java:90) ~[na:na]
at java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_67]

Possible solution :
catch memory exception and close possible connections.
At closed_wait state client should have already sent FIN to controller but due to the exception, some connections are staying in this state since the controller did not have the time to close them and send FIN.



 Comments   
Comment by Carol Sanders [ 04/May/15 ]

This bug is part of the project to Move all ADSAL associated component bugs to ADSAL

Comment by Robert Varga [ 19/May/16 ]

AD-SAL has been removed, this issue will not be fixed.

Generated at Wed Feb 07 19:54:20 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.