[NETCONF-422] java.lang.OutOfMemoryError: GC overhead limit exceeded generates "Master is down" exception Created: 15/May/17  Updated: 15/Mar/19  Resolved: 16/May/17

Status: Resolved
Project: netconf
Component/s: netconf
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Matej Perina Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


External issue ID: 8454

 Description   

I'm running clustered 3-node ODL with 5 active netconf sessions with Honeycomb netconf server after some time generates
2017-05-15 13:04:04.423 UTC [netconf-netty-1] WARN o.o.n.i.u.DeserializerExceptionHandler - An exception occurred during message handling
io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: GC overhead limit exceeded
in honeycomb and periodic messages in ODL:
Caused by: NetconfDocumentedException{message=RemoteDevice

{overcloud-controller-2.opnfv.org}

:Master is down.

Please contact me (mperina@cisco.com) about logs since thier size is over 1GB.



 Comments   
Comment by A H [ 15/May/17 ]

Is this a blocker bug for Carbon? If so, is there an ETA for when a fix can be completed? If not, could someone from the NETCONF team please retarget the for Nitrogen or Carbon SR1?

Comment by Andrej Mak [ 16/May/17 ]

I've found this in master node logs:

2017-05-15 12:29:51,897 | TRACE | lt-dispatcher-31 | NetconfDeviceCommunicator | 299 - org.opendaylight.netconf.sal-netconf-connector - 1.5.0.Carbon | RemoteDevice

{overcloud-controller-0.opnfv.org}: Sending message <rpc message-id="m-3327" xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">
2017-05-15 12:29:51,898 | TRACE | oupCloseable-3-1 | NetconfDeviceCommunicator | 299 - org.opendaylight.netconf.sal-netconf-connector - 1.5.0.Carbon | Finished sending request <rpc message-id="m-3327" xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">
2017-05-15 12:30:04,211 | DEBUG | oupCloseable-3-1 | NetconfDeviceCommunicator | 299 - org.opendaylight.netconf.sal-netconf-connector - 1.5.0.Carbon | RemoteDevice{overcloud-controller-0.opnfv.org}

: Message received <rpc-reply xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="m-3327">
2017-05-15 12:30:04,213 | TRACE | oupCloseable-3-1 | NetconfDeviceCommunicator | 299 - org.opendaylight.netconf.sal-netconf-connector - 1.5.0.Carbon | RemoteDevice

{overcloud-controller-0.opnfv.org}

: Matched request: <rpc message-id="m-3327" xmlns="urn:ietf:params:xml:ns:netconf:base:1.0">

Master node sends rpc to the device. Probably due to "GC overhead limit exceeded" on the device, rpc-reply is sent from device to ODL 11s later. Default ask timeout in odl-netconf-clustered-topology is 5s, hence "Master is down." message.

Ask timeout can be set when mount point is created by parameter "actor-response-wait-time", see netconf-node-topology.yang.
In my opinion, you need to somehow throttle reads/writes via netconf, so honeycomb manages to process messages in time.
Maybe "concurrent-rpc-limit" parameter from netconf-node-topology.yang could be used. It limits how many rpcs can be sent, before reply to first arrives. If limit is exceeded, it throws an exception.

Comment by Robert Varga [ 16/May/17 ]

This is a problem on the SB device, which is running Boron-SR3 code and is probably not sized properly. Lowering priority and targetting Boron-SR4.

Generated at Wed Feb 07 20:14:59 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.