[NETCONF-584] Netconf connection is very slow in Neon Created: 24/Nov/18 Updated: 31/Jan/19 Resolved: 31/Jan/19 |
|
| Status: | Resolved |
| Project: | netconf |
| Component/s: | netconf |
| Affects Version/s: | Neon |
| Fix Version/s: | Neon |
| Type: | Bug | Priority: | High |
| Reporter: | Luis Gomez | Assignee: | Jakub Morvay |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
Cluster test is currently failing in Neon: The reason is the netconf connection to the test tool takes very long time to get established: From a karaf log in Neon, it takes 35 sec to establish a connection:
2018-11-23T09:56:25,668 | INFO | opendaylight-cluster-data-notification-dispatcher-58 | NetconfTopologyUtils | 300 - org.opendaylight.netconf.topology-singleton - 1.6.0 | RemoteDevice{netconf-test-device} : using the default directory cache/schema
2018-11-23T09:56:25,720 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | NetconfTopologyContext | 300 - org.opendaylight.netconf.topology-singleton - 1.6.0 | Master was selected: IpAddress{_ipv4Address=Ipv4Address{_value=10.30.170.185}}
2018-11-23T09:56:25,893 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | RemoteDeviceConnectorImpl | 300 - org.opendaylight.netconf.topology-singleton - 1.6.0 | RemoteDevice{netconf-test-device}: Concurrent rpc limit is smaller than 1, no limit will be enforced.
2018-11-23T09:56:43,476 | WARN | sshd-SshClient[37a6a20]-nio2-thread-3 | AcceptAllServerKeyVerifier | 140 - org.apache.sshd.core - 2.0.0 | Server at /10.30.170.185:17830 presented unverified RSA key: SHA256:XjJQHnRvql9JMskfq1z3r4PkLnsid3BfkIudC80RE0k
2018-11-23T09:57:00,299 | INFO | remote-connector-processing-executor-11 | MasterSalFacade | 300 - org.opendaylight.netconf.topology-singleton - 1.6.0 | Device RemoteDevice{netconf-test-device} connected - registering master mount point
2018-11-23T09:57:00,316 | INFO | remote-connector-processing-executor-11 | NetconfDevice | 296 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0 | RemoteDevice{netconf-test-device}: Netconf connector initialized successfully
Same test is Fluorine takes 5 secs:
2018-11-23T21:08:11,758 | INFO | opendaylight-cluster-data-notification-dispatcher-57 | NetconfTopologyUtils | 287 - org.opendaylight.netconf.topology-singleton - 1.5.2 | RemoteDevice{netconf-test-device} : using the default directory cache/schema
2018-11-23T21:08:11,878 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-38 | NetconfTopologyContext | 287 - org.opendaylight.netconf.topology-singleton - 1.5.2 | Master was selected: IpAddress{_ipv4Address=Ipv4Address{_value=10.30.171.2}}
2018-11-23T21:08:12,170 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-38 | RemoteDeviceConnectorImpl | 287 - org.opendaylight.netconf.topology-singleton - 1.5.2 | RemoteDevice{netconf-test-device}: Concurrent rpc limit is smaller than 1, no limit will be enforced.
2018-11-23T21:08:14,221 | WARN | sshd-SshClient[7bacd182]-nio2-thread-4 | AcceptAllServerKeyVerifier | 134 - org.apache.sshd.core - 2.0.0 | Server at /10.30.171.2:17830 presented unverified RSA key: SHA256:jweQ/hbCJY/oBn8B7y8NPGPLTcH1EYgy+dabW5X46gs
2018-11-23T21:08:16,362 | INFO | remote-connector-processing-executor-11 | MasterSalFacade | 287 - org.opendaylight.netconf.topology-singleton - 1.5.2 | Device RemoteDevice{netconf-test-device} connected - registering master mount point
2018-11-23T21:08:16,391 | INFO | remote-connector-processing-executor-11 | NetconfDevice | 282 - org.opendaylight.netconf.sal-netconf-connector - 1.8.2 | RemoteDevice{netconf-test-device}: Netconf connector initialized successfully
Single instance netconf connection in Neon also takes more time than usual (18 secs):
2018-11-20T03:28:50,648 | INFO | opendaylight-cluster-data-notification-dispatcher-43 | AbstractNetconfTopology | 83 - netconf-topology-config - 1.6.0 | Connecting RemoteDevice{Uri{_value=netconf-test-device}} , with config Node{getNodeId=Uri{_value=netconf-test-device}, augmentations={interface org.opendaylight.yang.gen.v1.urn.opendaylight.netconf.node.topology.rev150114.NetconfNode=NetconfNode{getActorResponseWaitTime=5, getBetweenAttemptsTimeoutMillis=2000, getConcurrentRpcLimit=0, getConnectionTimeoutMillis=20000, getCredentials=LoginPassword{getPassword=topsecret, getUsername=admin, augmentations={}}, getDefaultRequestTimeoutMillis=60000, getHost=Host{_ipAddress=IpAddress{_ipv4Address=Ipv4Address{_value=10.30.170.145}}}, getKeepaliveDelay=0, getMaxConnectionAttempts=0, getPort=PortNumber{_value=17830}, getSchemaCacheDirectory=schema, getSleepFactor=1.5, isReconnectOnChangedSchema=false, isSchemaless=false, isTcpOnly=false}}}
2018-11-20T03:28:50,651 | INFO | opendaylight-cluster-data-notification-dispatcher-43 | AbstractNetconfTopology | 83 - netconf-topology-config - 1.6.0 | Concurrent rpc limit is smaller than 1, no limit will be enforced for device RemoteDevice{netconf-test-device}
2018-11-20T03:28:52,379 | WARN | sshd-SshClient[78608917]-nio2-thread-7 | AcceptAllServerKeyVerifier | 149 - org.apache.sshd.core - 2.0.0 | Server at /10.30.170.145:17830 presented unverified RSA key: SHA256:TfybKihnH82uOinnT8MzbdILu8oZU2ku//sClM2e67c
2018-11-20T03:29:08,612 | INFO | remote-connector-processing-executor-1 | NetconfDevice | 407 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0 | RemoteDevice{netconf-test-device}: Netconf connector initialized successfully
Finally, I changed testool version in Neon test to Fluorine but this did not help so the issue has to be in the controller itself. |
| Comments |
| Comment by Robert Varga [ 29/Nov/18 ] |
|
JMorvay can this be downgraded to "High"? |
| Comment by Jakub Morvay [ 29/Nov/18 ] |
|
rovarga Hi sorry I missed couple of last minutes of TSC call where you discussed this. Yeah I am downgrading this to "High" priority because connecting a netconf devices is still possible although it is slower than it used to be. It can be related to neon-MRI bump, mostly about bumping SSHD and other dependencies used in netconf protocol SB stack. |
| Comment by Robert Varga [ 24/Jan/19 ] |
|
JMorvay any progress on this? |
| Comment by Jakub Morvay [ 24/Jan/19 ] |
|
Yeah, so I don't think anymore that this has something to do with bumping SSHD, Netty etc. Neon should use the same versions as Fluorine and I didn't see this issue on Fluorine last time I had checked. The strange thing is that there are almost no changes in the netconf SB in neon.. Need to investigate this further and specify the bottleneck component here. Also, I don't think this is related just to cluster-aware netconf topology. I can see in CSIT tests, that connection time went from ~2 seconds to something like ~17 seconds also in 1-node tests. |
| Comment by Jakub Morvay [ 24/Jan/19 ] |
|
From the following logs: 2019-01-24T20:57:46,627 | INFO | opendaylight-cluster-data-notification-dispatcher-58 | AbstractNetconfTopology | 122 - netconf-topology-config - 1.6.0.SNAPSHOT | Connecting RemoteDevice{Uri{_value=17830-sim-device}} , with config Node{getNodeId=Uri{_value=17830-sim-device}, augmentations={interface org.opendaylight.yang.gen.v1.urn.opendaylight.netconf.node.topology.rev150114.NetconfNode=NetconfNode{getActorResponseWaitTime=5, getBetweenAttemptsTimeoutMillis=2000, getConcurrentRpcLimit=0, getConnectionTimeoutMillis=20000, getCredentials=LoginPassword{getPassword=admin, getUsername=admin, augmentations={}}, getDefaultRequestTimeoutMillis=60000, getHost=Host{_ipAddress=IpAddress{_ipv4Address=Ipv4Address{_value=127.0.0.1}}}, getKeepaliveDelay=0, getMaxConnectionAttempts=0, getPort=PortNumber{_value=17830}, getSchemaCacheDirectory=schema, getSleepFactor=1.5, isReconnectOnChangedSchema=false, isSchemaless=false, isTcpOnly=false}}}
2019-01-24T20:57:47,038 | INFO | opendaylight-cluster-data-notification-dispatcher-58 | AbstractNetconfTopology | 122 - netconf-topology-config - 1.6.0.SNAPSHOT | Concurrent rpc limit is smaller than 1, no limit will be enforced for device RemoteDevice{17830-sim-device}
2019-01-24T20:58:17,738 | WARN | sshd-SshClient[27e0bd6b]-nio2-thread-3 | AcceptAllServerKeyVerifier | 145 - org.apache.sshd.core - 2.0.0 | Server at /127.0.0.1:17830 presented unverified RSA key: SHA256:ctJTXX3gL4XrZYlWN8pTcPHnzxWTDDHEfAnUe8BI6sc
2019-01-24T20:58:37,866 | DEBUG | nioEventLoopGroupCloseable-3-1 | NetconfDeviceCommunicator | 290 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0.SNAPSHOT | RemoteDevice{17830-sim-device}: Session established
2019-01-24T20:58:37,939 | DEBUG | nioEventLoopGroupCloseable-3-1 | NetconfDevice | 290 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0.SNAPSHOT | RemoteDevice{17830-sim-device}: Session to remote device established with NetconfSessionPreferences{capabilities={urn:ietf:params:netconf:capability:exi:1.0=DeviceAdvertised, urn:ietf:params:netconf:capability:candidate:1.0=DeviceAdvertised, urn:ietf:params:netconf:base:1.1=DeviceAdvertised, urn:ietf:params:netconf:base:1.0=DeviceAdvertised}, moduleBasedCapabilities={(urn:ietf:params:xml:ns:yang:ietf-netconf-monitoring-extension?revision=2013-12-10)ietf-netconf-monitoring-extension=DeviceAdvertised, (urn:ietf:params:xml:ns:yang:ietf-inet-types?revision=2013-07-15)ietf-inet-types=DeviceAdvertised, (urn:ietf:params:xml:ns:yang:ietf-netconf-monitoring?revision=2010-10-04)ietf-netconf-monitoring=DeviceAdvertised, (urn:ietf:params:xml:ns:yang:ietf-yang-types?revision=2013-07-15)ietf-yang-types=DeviceAdvertised}, rollback=false, monitoring=true, candidate=true, writableRunning=false}
...
2019-01-24T20:58:39,896 | DEBUG | nioEventLoopGroupCloseable-3-1 | NetconfRemoteSchemaYangSourceProvider | 290 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0.SNAPSHOT | RemoteDevice{17830-sim-device}: YANG Schema successfully retrieved for ietf-netconf-monitoring:Optional[2010-10-04]
2019-01-24T20:58:39,903 | DEBUG | remote-connector-processing-executor-11 | NetconfDevice | 290 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0.SNAPSHOT | RemoteDevice{17830-sim-device}: Unable to map any source identifiers to a capability reported by device : []
2019-01-24T20:58:40,074 | DEBUG | remote-connector-processing-executor-11 | NetconfDevice | 290 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0.SNAPSHOT | RemoteDevice{17830-sim-device}: Schema context built successfully from [RevisionSourceIdentifier [name=ietf-netconf-monitoring-extension@2013-12-10], RevisionSourceIdentifier [name=ietf-inet-types@2013-07-15], RevisionSourceIdentifier [name=ietf-netconf-monitoring@2010-10-04], RevisionSourceIdentifier [name=ietf-yang-types@2013-07-15]]
2019-01-24T20:58:40,131 | DEBUG | remote-connector-processing-executor-11 | NetconfDeviceSalProvider | 290 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0.SNAPSHOT | RemoteDevice{17830-sim-device}: TOPOLOGY Mountpoint exposed into MD-SAL {closed=0, instance=org.opendaylight.mdsal.dom.spi.SimpleDOMMountPoint@4f1a9a0c}
2019-01-24T20:58:40,184 | INFO | remote-connector-processing-executor-11 | NetconfDevice | 290 - org.opendaylight.netconf.sal-netconf-connector - 1.9.0.SNAPSHOT | RemoteDevice{17830-sim-device}: Netconf connector initialized successfully
we can see that the is problem in establishing connection/session to a device. Schema download and schema context resolution should be fine.. I will have a deeper look into it tomorrow . |
| Comment by Jakub Morvay [ 25/Jan/19 ] |
|
I identified that the following patch seems to cause this problem https://git.opendaylight.org/gerrit/#/c/35136/. ecelgp can you please try to run some CSIT netconf tests wit the above mentioned patch reverted? Here is the revert https://git.opendaylight.org/gerrit/#/c/79910/ |
| Comment by Luis Gomez [ 25/Jan/19 ] |
|
Hi JMorvay, I just launched a test on the revert but unfortunately I still see long time waiting for netconf connection: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-1node-gate-userfeatures-all-neon/7/robot-plugin/log.html.gz |
| Comment by Jakub Morvay [ 28/Jan/19 ] |
|
ecelgp Well that is strange because it certainly helped me when testing locally. Run without revert applied: 15:12:42.801 INFO [opendaylight-cluster-data-notification-dispatcher-53] Connecting RemoteDevice{Uri{_value=17830-sim-device}} , with config Node{getNodeId=Uri{_value=17830-sim-device}, augmentations={interface org.opendaylight.yang.gen.v1.urn.opendaylight.netconf.node.topology.rev150114.NetconfNode=NetconfNode{getActorResponseWaitTime=5, getBetweenAttemptsTimeoutMillis=2000, getConcurrentRpcLimit=0, getConnectionTimeoutMillis=20000, getCredentials=LoginPassword{getPassword=admin, getUsername=admin, augmentations={}}, getDefaultRequestTimeoutMillis=60000, getHost=Host{_ipAddress=IpAddress{_ipv4Address=Ipv4Address{_value=127.0.0.1}}}, getKeepaliveDelay=0, getMaxConnectionAttempts=0, getPort=PortNumber{_value=17830}, getSchemaCacheDirectory=schema, getSleepFactor=1.5, isReconnectOnChangedSchema=false, isSchemaless=false, isTcpOnly=false}}}
15:12:43.545 INFO [opendaylight-cluster-data-notification-dispatcher-53] Concurrent rpc limit is smaller than 1, no limit will be enforced for device RemoteDevice{17830-sim-device}
15:13:40.619 INFO [remote-connector-processing-executor-11] RemoteDevice{17830-sim-device}: Netconf connector initialized successfully
with revert applied: 15:40:56.428 INFO [opendaylight-cluster-data-notification-dispatcher-53] Connecting RemoteDevice{Uri{_value=17830-sim-device}} , with config Node{getNodeId=Uri{_value=17830-sim-device}, augmentations={interface org.opendaylight.yang.gen.v1.urn.opendaylight.netconf.node.topology.rev150114.NetconfNode=NetconfNode{getActorResponseWaitTime=5, getBetweenAttemptsTimeoutMillis=2000, getConcurrentRpcLimit=0, getConnectionTimeoutMillis=20000, getCredentials=LoginPassword{getPassword=admin, getUsername=admin, augmentations={}}, getDefaultRequestTimeoutMillis=60000, getHost=Host{_ipAddress=IpAddress{_ipv4Address=Ipv4Address{_value=127.0.0.1}}}, getKeepaliveDelay=0, getMaxConnectionAttempts=0, getPort=PortNumber{_value=17830}, getSchemaCacheDirectory=schema, getSleepFactor=1.5, isReconnectOnChangedSchema=false, isSchemaless=false, isTcpOnly=false}}}
15:40:57.428 INFO [opendaylight-cluster-data-notification-dispatcher-53] Concurrent rpc limit is smaller than 1, no limit will be enforced for device RemoteDevice{17830-sim-device}
15:41:15.300 INFO [remote-connector-processing-executor-11] RemoteDevice{17830-sim-device}: Netconf connector initialized successfully
The problem seems to be with starting exi communitication on the server side (in this case it's test tool). Where do the CSIT jobs get netconf test-tool from? Maybe the test was run against the testtool without this revert.. But it is strange because you said that you tried this also against Fluorine testtool and it didn't help. So maybe the revert is significant also for netconf client (here it is the ODL). |
| Comment by Luis Gomez [ 28/Jan/19 ] |
|
CSIT uses distribution included test-tool. Also according to the CSIT logs, both existing master + revert patch spend ~18 secs in setting the netconf connection for the first time (further deconfigure + configure works fine in test), while the same test in Fluorine takes only ~2 secs. Do you also get ~2 secs when you try locally with the revert patch? |
| Comment by Jakub Morvay [ 28/Jan/19 ] |
|
ecelgp No I cannot get close to ~2 seconds, but that is because currently I have really slow environment. That means I cannot get close to that time also on Fluorine. But I did get however significant speedup when trying locally with reverted patch. You can see it in the log snippets I provided in my last comment. It's like 1 minute vs 20 seconds. So when you tried to run that CSIT test on the above mentioned revert also distro was built against that patch and test tool as well, right? > netconf connection for the first time (further deconfigure + configure works fine in test) I am not sure if I can observe that in my local environment, I will try that and I will let you know. After all maybe I am wrong and the patch I reverted is not related to this issue. |
| Comment by Robert Varga [ 28/Jan/19 ] |
|
This looks very suspicious. Is this perhaps some sort of key generation? Can you enable debugs on the ODL side, perphas? |
| Comment by Luis Gomez [ 28/Jan/19 ] |
|
What DEBUG should I enable for this? |
| Comment by Jakub Morvay [ 28/Jan/19 ] |
|
ecelgp Just one last shot at netconf-exi capability. Can you please try to run the test on https://git.opendaylight.org/gerrit/#/c/79968/? |
| Comment by Jakub Morvay [ 28/Jan/19 ] |
|
At first I also suspected it to be something like this. But then it is strange why do not we see it on Fluorine. SSHD version and configuration (at least if it is not configured somewhere else than netconf project) should be the same as on Fluorine. But since that the issue is present just for the first connection maybe some initialization of some EXI communication components that are reused afterwards can be issue here.. |
| Comment by Luis Gomez [ 28/Jan/19 ] |
|
This the resulting test: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-1node-gate-userfeatures-all-neon/8/robot-plugin/log.html.gz#s1-s4-s2 If you expand the CRUD suite, you see the first time test-tool connects takes 17 secs, similar thing in CRUD-RPC suite. |
| Comment by Jakub Morvay [ 28/Jan/19 ] |
|
Yeah, but from https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netconf-csit-1node-gate-userfeatures-all-neon/8/testtool--netconf-gate-userfeatures-txt-CRUD-CRUD.1548709082.46.log.gz I can see that the test-tool still wants to use exi capability even though it should not (the patch I provided, turns EXI off by default). 20:58:03.557 [main] DEBUG o.o.netconf.test.tool.Main - Trying to start netconf test-tool with parameters TesttoolParameters{editContent='null',
async='false',
threadAmount='1',
throttle='5000',
auth='null',
controllerDestination='null',
schemasDir='./schemas',
deviceCount='1',
devicesPerPort='1',
startingPort='17830',
generateConfigsTimeout='1800000',
generateConfigsAddress='127.0.0.1',
distroFolder='null',
generateConfigBatchSize='1',
ssh='true',
exi='true',
debug='true',
notificationFile='null',
mdSal='true',
initialConfigXMLFile='null',
timeOut='20',
ip='0.0.0.0',
threadPoolSize='8',
rpcConfig='null'}
So do you configure EXI option to be true when starting test-tool in the CSIT? Because I cannot see it anywhere, but on the other hand I am not really experienced in robot framework. |
| Comment by Luis Gomez [ 28/Jan/19 ] |
|
Actually I can see in the log of the test-tool, exi='true', and this is because test-tool is downloaded from Nexus vs copying from distribution. Let me see if I can change that. |
| Comment by Jakub Morvay [ 28/Jan/19 ] |
|
Yup, if you will be able to change that, please try directly the reverted patch. |
| Comment by Luis Gomez [ 28/Jan/19 ] |
|
FYI I tested quick patch that disables EXI in test-tool and it works. Unfortunately current distribution does not contain the test-tool so I cannot deploy from there. Only solution I see to test the patch in CI is merging the patch, after that CSIT will pick the new test-tool artifact from Nexus. |
| Comment by Luis Gomez [ 28/Jan/19 ] |
|
So just to be clear, this patch https://git.opendaylight.org/gerrit/#/c/35136 changes: 1) controller behavior, 2) test-tool behavior, or 3) both? if 1) patch verification test should have worked, if 2) testing with Florine test-tool should have worked, if 3) then we have to merge the revert and trigger a CSIT after to verify. |
| Comment by Jakub Morvay [ 29/Jan/19 ] |
|
I would say that option 3) is most true. Definitely the revert will change the test-tool behavior and that alone should bring some significant performance. But I believe that this issue is also present in netconf SB plugin, that means if we are initializing EXI communication we can see some performance regression comparing to Fluorine. So the revert should help to fix also this issue and we should see similar performance as on Fluorine. But I don't think we have to merge the revert and actually try that. I guess we know where the problem is now. We can discuss that in kernel projects call |
| Comment by Tomas Cere [ 29/Jan/19 ] |
|
I was able to take a look at this and it seems the issue is that testtool hangs while the session is negotiated in EXIParameters.fromXmlElement(). This issue seems very hit and miss in my environment but when i look at this with a profiler it seems That would explain why its only during the first connection attempted since these are only initialized once. |
| Comment by Jakub Morvay [ 29/Jan/19 ] |
|
Yeah I came to the same exact class an method call when debugging this. |