[NETCONF-418] Failed to process scaled data response from the netconf enabled device Created: 10/May/17 Updated: 06/Aug/19 Resolved: 06/Aug/19 |
|
| Status: | Resolved |
| Project: | netconf |
| Component/s: | netconf |
| Affects Version/s: | None |
| Fix Version/s: | Fluorine SR1 |
| Type: | Bug | ||
| Reporter: | BABU K T | Assignee: | Jeff Hartley |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Attachments: |
|
| External issue ID: | 8413 |
| Description |
|
Version: Boron Controller is not able to process scaled data . Xml output is about 145000 lines POST /restconf/operations/network-topology:network-topology/topology/topology-netconf/node/bag2r/yang-ext:mount/show-route:get-route-information HTTP/1.1 Body: Karaf Log: 017-05-10 14:12:12,658 | WARN | oupCloseable-3-7 | NetconfXMLToMessageDecoder | 98 - org.opendaylight.netconf.netty-util - 1.1.1.Boron-SR1 | XML message with unwanted leading bytes detected. Discarded the 1 leading byte(s): '0a' java.util.concurrent.CancellationException: Task was cancelled. at com.google.common.util.concurrent.AbstractFuture.cancellationExceptionWithCause(AbstractFuture.java:392)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:63)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.Futures$6.run(Futures.java:1308)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture.cancel(AbstractFuture.java:134)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.Futures$ChainingListenableFuture.cancel(Futures.java:871)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.ForwardingFuture.cancel(ForwardingFuture.java:48)[38:com.google.guava:18.0.0] at org.opendaylight.netconf.sal.connect.netconf.sal.KeepaliveSalFacade$RequestTimeoutTask.run(KeepaliveSalFacade.java:254)[263:org.opendaylight.netconf.sal-netconf-connector:1.4.1.Boron-SR1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)[:1.8.0_121] at java.util.concurrent.FutureTask.run(FutureTask.java:266)[:1.8.0_121] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)[:1.8.0_121] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)[:1.8.0_121] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)[:1.8.0_121] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)[:1.8.0_121] at java.lang.Thread.run(Thread.java:745)[:1.8.0_121] Caused by: java.util.concurrent.CancellationException: Future.cancel() was called. at com.google.common.util.concurrent.AbstractFuture$Sync.complete(AbstractFuture.java:378)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture$Sync.cancel(AbstractFuture.java:355)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture.cancel(AbstractFuture.java:131)[38:com.google.guava:18.0.0] ... 10 more 2017-05-10 14:14:04,468 | ERROR | ssing-executor-8 | ExecutionList | 38 - com.google.guava - 18.0.0 | RuntimeException while executing runnable com.google.common.util.concurrent.Futures$6@5523530d with executor INSTANCE java.lang.IllegalStateException: RemoteDevice{bag2r} : Unable to reconnect, session listener is missing java.util.concurrent.CancellationException: Task was cancelled. at com.google.common.util.concurrent.AbstractFuture.cancellationExceptionWithCause(AbstractFuture.java:392)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:63)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.Futures$6.run(Futures.java:1308)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:457)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture.cancel(AbstractFuture.java:134)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.Futures$ChainingListenableFuture.cancel(Futures.java:871)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.ForwardingFuture.cancel(ForwardingFuture.java:48)[38:com.google.guava:18.0.0] at org.opendaylight.netconf.sal.connect.netconf.sal.KeepaliveSalFacade$RequestTimeoutTask.run(KeepaliveSalFacade.java:254)[263:org.opendaylight.netconf.sal-netconf-connector:1.4.1.Boron-SR1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)[:1.8.0_121] at java.util.concurrent.FutureTask.run(FutureTask.java:266)[:1.8.0_121] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)[:1.8.0_121] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)[:1.8.0_121] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)[:1.8.0_121] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)[:1.8.0_121] at java.lang.Thread.run(Thread.java:745)[:1.8.0_121] Caused by: java.util.concurrent.CancellationException: Future.cancel() was called. at com.google.common.util.concurrent.AbstractFuture$Sync.complete(AbstractFuture.java:378)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture$Sync.cancel(AbstractFuture.java:355)[38:com.google.guava:18.0.0] at com.google.common.util.concurrent.AbstractFuture.cancel(AbstractFuture.java:131)[38:com.google.guava:18.0.0] ... 10 more 2017-05-10 14:23:27,708 | ERROR | ssing-executor-4 | ExecutionList | 38 - com.google.guava - 18.0.0 | RuntimeException while executing runnable com.google.common.util.concurrent.Futures$6@19b1fcfc with executor INSTANCE java.lang.IllegalStateException: RemoteDevice{bag2r} : Unable to reconnect, session listener is missing |
| Comments |
| Comment by BABU K T [ 10/May/17 ] |
|
XML Response for the POST query mentioned in Description <errors xmlns="urn:ietf:params:xml:ns:yang:ietf-restconf"> |
| Comment by Ryan Goulding [ 13/Jun/17 ] |
|
Is keepalive enabled? This is doing a get-config to keep the ssh session alive... so if it is big you may want to disable keepalive (set to 0). |
| Comment by Colin Dixon [ 13/Jun/17 ] |
|
Robert Varga also asked if we have data on how long the response took to come back--rather than just it's size. It would also be good to know if it all came back as one slug of data or trickled in. |
| Comment by BABU K T [ 19/Jun/17 ] |
|
(In reply to Ryan Goulding from comment #2) yes. Keep Alive Enabled. <keepalive-delay xmlns="urn:opendaylight:netconf-node-topology">120</keepalive-delay> And response was sent immediately for post request. |
| Comment by BABU K T [ 19/Jun/17 ] |
|
(In reply to Colin Dixon from comment #3) Netconf device sent response immediately in one shot |
| Comment by Colin Dixon [ 28/Jun/17 ] |
|
(In reply to BABU K T from comment #4) Have you tried setting keepalive to 0 to see if that fixes it? |
| Comment by Tomas Cere [ 29/Jun/17 ] |
|
The fact that odl got suck makes me think it started getting low on memory, I would try increasing the heap and see if you can get it to swallow such a large message. |
| Comment by BABU K T [ 30/Jun/17 ] |
|
(In reply to Tomas Cere from comment #7) Can I do configure something on controller to increase size ? or it should be fixed via code ? |
| Comment by BABU K T [ 02/Jul/17 ] |
|
(In reply to Colin Dixon from comment #6) Setting keepalive to 0 not fixed the issue |
| Comment by BABU K T [ 02/Jul/17 ] |
|
(In reply to BABU K T from comment #9) Tried again by setting keepalive to 0, For 50K lines lines of XML, controller took ~300 secs to process response. For 20 Million lines it took 60 mins. |
| Comment by Sai MarapaReddy [ 18/Jul/17 ] |
|
Questions from kernel call :- |
| Comment by BABU K T [ 11/Aug/17 ] |
|
Heap size data as requested in comment 11 |
| Comment by BABU K T [ 11/Aug/17 ] |
|
Attachment Bug_8413.txt has been added with description: Heap size data as requested in comment 11 |
| Comment by BABU K T [ 11/Aug/17 ] |
|
(In reply to Sai MarapaReddy from comment #11) Please check attachment added in comment 12 . log has been taken with carbon release |
| Comment by Atul Gosain [ 11/Sep/17 ] |
|
Hi As per file uploaded by Babu, i see this information regarding heap. [root@nms5-odl-a ~]# jstat -gcutil 1294 uintx HeapSizePerGCThread = 87241520 {product} uintx InitialHeapSize := 528482304 {product}uintx LargePageHeapSizeThreshold = 134217728 {product} uintx MaxHeapSize := 8434745344 {product}openjdk version "1.8.0_141" using thread-local object allocation. Heap Configuration: |
| Comment by JH [ 22/Sep/17 ] |
|
Blocking issue for the netconf connector team: Long parsing jobs appear to be creating a "head of line blocking" situation here; the core/thread handling the netconf-restconf parsing/translation appears to be preventing keepalives (or other netconf transactions) from happening for the mounted device. Disabling keepalives in a production network is undesirable. Babu has some additional findings on Carbon-SR1: Here's the error message when a keepalive fails due to translation taking a long time to complete: <errors xmlns="urn:ietf:params:xml:ns:yang:ietf-restconf"> |
| Comment by Tomas Cere [ 04/Oct/17 ] |
|
Let me clarify, we need a performance snapshot from a profiler(for example yourkit) of the whole test run, and a memory snapshot from the moment that odl seems to get stuck. But if this will show that theres no memory leak and just odl running out of memory due to parsing everything in memory you will have to increase the heap size. To increase the heap size you need to edit the karaf setenv script in karaf/bin to something larger than 2gbs. |
| Comment by Jeff Hartley [ 26/Oct/17 ] |
|
Latest testing with Carbon SR-1, JAVA_MAX_MEM=64G... We're at the point that adding more cores and RAM does not improve performance. The earlier discussion helped. The "head of line blocking" is understood to be "as-designed" now, based on previous threads/discussions read. We've done some performance tests comparing a standalone Python Netconf client (PyEz) to the same network element (running JunOS 17.4), against the same RPC (rpc:get_interface_queue_information – essentially show all queue info). PyEz netconf + OpenSSH_7.4p1 w/default settings, 202K lines of xml in ~30sec. Perhaps the best solution for these "very large payload" operational RPCs is to work on tuning the Apache Mina SSH configuration for better performance? Can this be improved via tuning buffer sizes, TCP settings, etc., or does porting a newer version of Mina solve performance issues? Thanks |
| Comment by Robert Varga [ 28/Aug/18 ] |
|
malloc_failed how does Oxygen SR3 look like? Also, can you provide representative data input, so we can provide CPU/mem usage during the request? |
| Comment by Jeff Hartley [ 06/Aug/19 ] |
|
The performance issues have been resolved by a combination of activities since Carbon: |