[ODLPARENT-113] After hard reset, Robot fails to establish SSH connection to karaf Created: 24/Aug/17  Updated: 25/Jun/20  Resolved: 25/Jun/20

Status: Resolved
Project: odlparent
Component/s: General
Affects Version/s: 2.0.5
Fix Version/s: None

Type: Bug
Reporter: Vratko Polak Assignee: Stephen Kitt
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: All
Platform: All


Issue Links:
Blocks
is blocked by ODLPARENT-86 Milestore: upgrade karaf to 4.1.2 or ... Verified
Duplicate
is duplicated by NETVIRT-874 CSIT 3node failures - karaf connectio... Resolved
External issue ID: 9044

 Description   

Instead of Odlparent, some other project can be the offender.

This is a regression from Carbon, ant it and affects CSIT. If a workaround is found, severity would be lower. Currently, this is a blocker for Nitrogen release.

I was not able to reproduce this manually in local environment (single node), so this might be something specific to CSIT machines, or Robot framework SSH library.

Hard reset seems to be a necessary condition for this Bug to appear, I have not seen this on 1node CSIT yet. Although it is possible that the difference is just in reset suite not waiting for ODL to finish booting for as long as the initial deploy script does.

So far I have see two robot symptoms:
+ "SSHException: Signature verification (ssh-dss) failed." [0]
+ "SSHException: No existing session" [1]

As the reset suite also connects to karaf ssh (to log a message), we see the first connection works:
2017-08-24 05:36:18,027 | INFO | 6]-nio2-thread-1 | ServerSession | 129 - org.apache.sshd.core - 0.14.0 | Server session created from /10.29.4.11:36962
2017-08-24 05:36:18,032 | INFO | 6]-nio2-thread-1 | SimpleGeneratorHostKeyProvider | 129 - org.apache.sshd.core - 0.14.0 | Generating host key...
2017-08-24 05:36:42,338 | INFO | 6]-nio2-thread-2 | ServerSession | 129 - org.apache.sshd.core - 0.14.0 | Server session created from /10.29.4.11:36968
2017-08-24 05:36:42,496 | INFO | 6]-nio2-thread-2 | LogAuditLoginModule | 109 - org.apache.karaf.jaas.modules - 4.0.9 | Authentication attempt - karaf
2017-08-24 05:36:42,498 | INFO | 6]-nio2-thread-2 | LogAuditLoginModule | 109 - org.apache.karaf.jaas.modules - 4.0.9 | Authentication succeeded - karaf
2017-08-24 05:36:42,503 | INFO | 6]-nio2-thread-2 | ServerUserAuthService | 129 - org.apache.sshd.core - 0.14.0 | Session karaf@/10.29.4.11:36968 authenticated
2017-08-24 05:36:42,771 | INFO | nsole user karaf | core | 112 - org.apache.karaf.log.core - 4.0.9 | ROBOT MESSAGE: Starting suite /w/workspace/controller-csit-3node-clustering-only-nitrogen/test/csit/suites/controller/dom_data_broker/restart_odl_with_tell_based_false.robot

After reset, it seems that it is the client who decides to refuse the connection [2]:
2017-08-22 21:09:43,294 | INFO | e]-nio2-thread-1 | ServerSession | 183 - org.apache.sshd.core - 0.14.0 | Server session created from /10.29.4.11:58260
2017-08-22 21:09:43,301 | WARN | e]-nio2-thread-1 | SimpleGeneratorHostKeyProvider | 183 - org.apache.sshd.core - 0.14.0 | Unable to read key /tmp/karaf-0.7.0/etc/host.key: java.lang.ClassNotFoundException: org.bouncycastle.jcajce.provider.asymmetric.rsa.BCRSAPrivateCrtKey cannot be found by org.apache.sshd.core_0.14.0
2017-08-22 21:09:43,304 | INFO | e]-nio2-thread-1 | SimpleGeneratorHostKeyProvider | 183 - org.apache.sshd.core - 0.14.0 | Generating host key...
2017-08-22 21:09:51,923 | ERROR | e]-nio2-thread-1 | SimpleGeneratorHostKeyProvider | 183 - org.apache.sshd.core - 0.14.0 | Overwriting key (host.key) is disabled: using throwaway java.security.KeyPair@17c31afb
2017-08-22 21:10:11,347 | INFO | e]-nio2-thread-3 | ServerSession | 183 - org.apache.sshd.core - 0.14.0 | Server session created from /10.29.4.11:58266
2017-08-22 21:10:12,668 | INFO | e]-nio2-thread-7 | ServerSession | 183 - org.apache.sshd.core - 0.14.0 | Server session created from /10.29.4.11:58274
2017-08-22 21:10:13,765 | INFO | e]-nio2-thread-2 | ServerSession | 183 - org.apache.sshd.core - 0.14.0 | Server session created from /10.29.4.11:58276
2017-08-22 21:10:13,823 | WARN | e]-nio2-thread-5 | ServerSession | 183 - org.apache.sshd.core - 0.14.0 | Exception caught
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)[:1.8.0_141]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)[:1.8.0_141]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)[:1.8.0_141]
at sun.nio.ch.IOUtil.read(IOUtil.java:197)[:1.8.0_141]
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishRead(UnixAsynchronousSocketChannelImpl.java:387)[:1.8.0_141]
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(UnixAsynchronousSocketChannelImpl.java:191)[:1.8.0_141]
at sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(UnixAsynchronousSocketChannelImpl.java:213)[:1.8.0_141]
at sun.nio.ch.EPollPort$EventHandlerTask.run(EPollPort.java:293)[:1.8.0_141]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_141]

The reset suite kills all ODL members, deletes several directories (including data, but preserving karaf.log), starts members and waits for jolokia to confirm all shards have their leaders elected.

[0] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-nitrogen/123/log.html.gz#s1-s2-k1-k1-k2-k3-k1-k1-k1-k1-k10
[1] https://logs.opendaylight.org/releng/jenkins092/controller-csit-3node-clustering-only-nitrogen/123/log.html.gz#s1-s2-k1-k1-k2-k3-k1-k1-k1-k1-k10
[2] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-nitrogen/123/odl1_karaf.log.gz



 Comments   
Comment by Vratko Polak [ 24/Aug/17 ]

Sandbox shows [3] this happens also in 1-node tests after reset.

Internet search found only one similar report [4], but it is not clear what went wrong.

[3] https://logs.opendaylight.org/sandbox/jenkins091/test-csit-1node-freeze-only-nitrogen/2/log.html.gz#s1-s5-t1-k1-k2-k1-k1-k1-k3-k10
[4] https://github.com/paramiko/paramiko/issues/657

Comment by Sam Hague [ 24/Aug/17 ]

In NetVirt CSIT we see a slightly different exception and some others. Any idea if they are related to this bug?

https://logs.opendaylight.org/releng/jenkins092/netvirt-csit-3node-openstack-ocata-gate-stateful-carbon/18/odl1_karaf.log.gz

2017-08-24 14:05:42,010 | INFO | 4]-nio2-thread-1 | ServerSession | 30 - org.apache.sshd.core - 0.14.0 | Server session created from /10.29.4.11:53378
2017-08-24 14:05:42,014 | INFO | 4]-nio2-thread-1 | SimpleGeneratorHostKeyProvider | 30 - org.apache.sshd.core - 0.14.0 | Generating host key...
2017-08-24 14:06:17,105 | WARN | 4]-nio2-thread-3 | ServerSession | 30 - org.apache.sshd.core - 0.14.0 | Exception caught
java.lang.IllegalStateException: Unsupported command 90
at org.apache.sshd.common.session.AbstractSession.doHandleMessage(AbstractSession.java:434)[30:org.apache.sshd.core:0.14.0]
at org.apache.sshd.common.session.AbstractSession.handleMessage(AbstractSession.java:326)[30:org.apache.sshd.core:0.14.0]
at org.apache.sshd.common.session.AbstractSession.decode(AbstractSession.java:780)[30:org.apache.sshd.core:0.14.0]
at org.apache.sshd.common.session.AbstractSession.messageReceived(AbstractSession.java:308)[30:org.apache.sshd.core:0.14.0]
at org.apache.sshd.common.AbstractSessionIoHandler.messageReceived(AbstractSessionIoHandler.java:54)[30:org.apache.sshd.core:0.14.0]
at org.apache.sshd.common.io.nio2.Nio2Session$1.onCompleted(Nio2Session.java:184)[30:org.apache.sshd.core:0.14.0]
at org.apache.sshd.common.io.nio2.Nio2Session$1.onCompleted(Nio2Session.java:170)[30:org.apache.sshd.core:0.14.0]
at org.apache.sshd.common.io.nio2.Nio2CompletionHandler$1.run(Nio2CompletionHandler.java:32)
at java.security.AccessController.doPrivileged(Native Method)[:1.8.0_141]
at org.apache.sshd.common.io.nio2.Nio2CompletionHandler.completed(Nio2CompletionHandler.java:30)[30:org.apache.sshd.core:0.14.0]
at sun.nio.ch.Invoker.invokeUnchecked(Invoker.java:126)[:1.8.0_141]
at sun.nio.ch.Invoker$2.run(Invoker.java:218)[:1.8.0_141]
at sun.nio.ch.AsynchronousChannelGroupImpl$1.run(AsynchronousChannelGroupImpl.java:112)[:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)[:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)[:1.8.0_141]
at java.lang.Thread.run(Thread.java:748)[:1.8.0_141]

2017-08-24 14:25:37,735 | WARN | lt-dispatcher-20 | NettyTransport | 178 - com.typesafe.akka.slf4j - 2.4.18 | Remote connection to /10.29.12.106:2550 failed with java.io.IOException: Connection reset by peer
2017-08-24 14:25:37,737 | WARN | ult-dispatcher-5 | ReliableDeliverySupervisor | 178 - com.typesafe.akka.slf4j - 2.4.18 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.29.12.106:2550] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-08-24 14:25:41,647 | WARN | lt-dispatcher-20 | ClusterCoreDaemon | 178 - com.typesafe.akka.slf4j - 2.4.18 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.29.14.0:2550] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://opendaylight-cluster-data@10.29.12.106:2550, status = Up)]. Node roles [member-1]
2017-08-24 14:25:41,648 | INFO | rd-dispatcher-44 | ShardManager | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Received UnreachableMember: memberName MemberName

{name=member-3}, address: akka.tcp://opendaylight-cluster-data@10.29.12.106:2550
2017-08-24 14:25:41,648 | INFO | rd-dispatcher-44 | ShardManager | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-3}

, address: akka.tcp://opendaylight-cluster-data@10.29.12.106:2550
2017-08-24 14:25:41,648 | INFO | rd-dispatcher-32 | EntityOwnershipShard | 196 - org.opendaylight.controller.sal-clustering-commons - 1.5.2.SNAPSHOT | member-1-shard-entity-ownership-operational: onPeerDown: PeerDown [memberName=member-3, peerId=member-3-shard-entity-ownership-operational]
2017-08-24 14:25:42,763 | WARN | lt-dispatcher-36 | NettyTransport | 178 - com.typesafe.akka.slf4j - 2.4.18 | Remote connection to null failed with java.net.ConnectException: Connection refused: /10.29.12.106:2550
2017-08-24 14:25:42,764 | WARN | lt-dispatcher-36 | ReliableDeliverySupervisor | 178 - com.typesafe.akka.slf4j - 2.4.18 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.29.12.106:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.29.12.106:2550]] Caused by: [Connection refused: /10.29.12.106:2550]

Comment by Thanh Ha (zxiiro) [ 28/Aug/17 ]

Is this still a blocker for Nitrogen?

Comment by Robert Varga [ 30/Aug/17 ]

According to https://tools.ietf.org/html/rfc4254#section-10, message 90 is SSH_MSG_CHANNEL_OPEN.

Comment by Vratko Polak [ 05/Sep/17 ]

> java.lang.IllegalStateException: Unsupported command 90

I have seen this before, but so far it has never affected test results (if a test failed, the cause turned out to be something else).

Comment by Vratko Polak [ 06/Sep/17 ]

Yes, this is still a blocker.

We are investigating a workaround on Integration/Test (and/or Releng/Builder side) as the connection is perhaps too slow to establish (similarly to ODLPARENT-112) and just retrying few times might help (which would reduce severity).

Comment by Kit Lou [ 08/Sep/17 ]

Vratko, Have you been able to try the workaround? Thanks!

Comment by Vratko Polak [ 11/Sep/17 ]

> just retrying few times might help

It does not help.
The Sandbox run has been deleted over the weekend,
but here is robot console output:

16:35:17 netconf-cluster-stress.txt.Ready.Netconfready :: netconf-connector readines...
16:35:17 ==============================================================================
16:40:26 Check_Whether_Netconf_Topology_Is_Ready :: Checks netconf readiness. | FAIL |
16:40:26 Parent suite setup failed:
16:40:26 Keyword 'KarafKeywords.Open_Controller_Karaf_Console_On_Background' failed after retrying 200 times. The last error was: SSHException: Signature verification (ssh-dss) failed.

Comment by Vratko Polak [ 13/Sep/17 ]

I have found few related facts.

Openssh 7.0 disables ssh-dss [6], making it hard to use "ssh" command when testing manualy. Robot libraries behavior may differ, but I expect it would stop cooperating with karaf 4.0.9 ssh server in future.

Here [7] is a discussion on what to do to make ssh use ssh-dss.

Karaf's own ssh client located in bin/client works on first boot, and stops working after restart.

> Unable to read key /tmp/karaf-0.7.0/etc/host.key

Testing manually, deleting the host.key file before re-starting ODL avoids this Bug symptom. This also works for (current) Robot ssh libraries.

That means we have a workaround to avoid CSIT failures: [8].

This is still a regression from Carbon usability, but severity can be lowered once the new behavior is documented in release notes.

[6] https://www.gentoo.org/support/news-items/2015-08-13-openssh-weak-keys.html
[7] https://community.openhab.org/t/karaf-console/7982/17
[8] https://git.opendaylight.org/gerrit/63035

Comment by Vratko Polak [ 13/Sep/17 ]

> a workaround to avoid CSIT failures: [8].

And this [9] fixes the offline job.

Is there any other test that needs fixing?
If not, severity can be reduced after the two changes are merged.

Proper fix for Karaf SSH server behavior will probably come in the from of Odlparent upgrading to newer Karaf version.

[9] https://git.opendaylight.org/gerrit/62903

Comment by Vratko Polak [ 14/Sep/17 ]

>> a workaround to avoid CSIT failures: [8].
>
> And this [9] fixes the offline job.

Both workarounds merged, reducing severity.

Comment by Michael Vorburger [ 08/Dec/17 ]

For the problem re. ssh-dss there is another workaround mentioned on [3] ;
 
ssh -oHostKeyAlgorithms=+ssh-dss -p 8101 karaf@localhost
 
and a similar configuraiton change can be done in ~/.ssh/config according to ENTESB-4427.

https://github.com/dfarrell07/vagrant-opendaylight/issues/29 uses HostKeyAlgorithms as well.

Comment by Robert Varga [ 25/Jun/20 ]

jluhrsen is this still happening?

Comment by Jamo Luhrsen [ 25/Jun/20 ]

no, I have not seen this in anything I've dug around in. Let's close this.

Generated at Wed Feb 07 20:27:40 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.