[ODLPARENT-113] After hard reset, Robot fails to establish SSH connection to karaf Created: 24/Aug/17 Updated: 25/Jun/20 Resolved: 25/Jun/20 |
|
| Status: | Resolved |
| Project: | odlparent |
| Component/s: | General |
| Affects Version/s: | 2.0.5 |
| Fix Version/s: | None |
| Type: | Bug | ||
| Reporter: | Vratko Polak | Assignee: | Stephen Kitt |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Operating System: All |
||
| Issue Links: |
|
||||||||||||||||
| External issue ID: | 9044 | ||||||||||||||||
| Description |
|
Instead of Odlparent, some other project can be the offender. This is a regression from Carbon, ant it and affects CSIT. If a workaround is found, severity would be lower. Currently, this is a blocker for Nitrogen release. I was not able to reproduce this manually in local environment (single node), so this might be something specific to CSIT machines, or Robot framework SSH library. Hard reset seems to be a necessary condition for this Bug to appear, I have not seen this on 1node CSIT yet. Although it is possible that the difference is just in reset suite not waiting for ODL to finish booting for as long as the initial deploy script does. So far I have see two robot symptoms: As the reset suite also connects to karaf ssh (to log a message), we see the first connection works: After reset, it seems that it is the client who decides to refuse the connection [2]: The reset suite kills all ODL members, deletes several directories (including data, but preserving karaf.log), starts members and waits for jolokia to confirm all shards have their leaders elected. [0] https://logs.opendaylight.org/releng/jenkins092/bgpcep-csit-3node-periodic-bgpclustering-only-nitrogen/123/log.html.gz#s1-s2-k1-k1-k2-k3-k1-k1-k1-k1-k10 |
| Comments |
| Comment by Vratko Polak [ 24/Aug/17 ] |
|
Sandbox shows [3] this happens also in 1-node tests after reset. Internet search found only one similar report [4], but it is not clear what went wrong. [3] https://logs.opendaylight.org/sandbox/jenkins091/test-csit-1node-freeze-only-nitrogen/2/log.html.gz#s1-s5-t1-k1-k2-k1-k1-k1-k3-k10 |
| Comment by Sam Hague [ 24/Aug/17 ] |
|
In NetVirt CSIT we see a slightly different exception and some others. Any idea if they are related to this bug? 2017-08-24 14:05:42,010 | INFO | 4]-nio2-thread-1 | ServerSession | 30 - org.apache.sshd.core - 0.14.0 | Server session created from /10.29.4.11:53378 2017-08-24 14:25:37,735 | WARN | lt-dispatcher-20 | NettyTransport | 178 - com.typesafe.akka.slf4j - 2.4.18 | Remote connection to /10.29.12.106:2550 failed with java.io.IOException: Connection reset by peer 2017-08-24 14:25:41,648 | INFO | rd-dispatcher-44 | ShardManager | 203 - org.opendaylight.controller.sal-distributed-datastore - 1.5.2.SNAPSHOT | Received UnreachableMember: memberName MemberName{name=member-3} , address: akka.tcp://opendaylight-cluster-data@10.29.12.106:2550 |
| Comment by Thanh Ha (zxiiro) [ 28/Aug/17 ] |
|
Is this still a blocker for Nitrogen? |
| Comment by Robert Varga [ 30/Aug/17 ] |
|
According to https://tools.ietf.org/html/rfc4254#section-10, message 90 is SSH_MSG_CHANNEL_OPEN. |
| Comment by Vratko Polak [ 05/Sep/17 ] |
|
> java.lang.IllegalStateException: Unsupported command 90 I have seen this before, but so far it has never affected test results (if a test failed, the cause turned out to be something else). |
| Comment by Vratko Polak [ 06/Sep/17 ] |
|
Yes, this is still a blocker. We are investigating a workaround on Integration/Test (and/or Releng/Builder side) as the connection is perhaps too slow to establish (similarly to |
| Comment by Kit Lou [ 08/Sep/17 ] |
|
Vratko, Have you been able to try the workaround? Thanks! |
| Comment by Vratko Polak [ 11/Sep/17 ] |
|
> just retrying few times might help It does not help. 16:35:17 netconf-cluster-stress.txt.Ready.Netconfready :: netconf-connector readines... |
| Comment by Vratko Polak [ 13/Sep/17 ] |
|
I have found few related facts. Openssh 7.0 disables ssh-dss [6], making it hard to use "ssh" command when testing manualy. Robot libraries behavior may differ, but I expect it would stop cooperating with karaf 4.0.9 ssh server in future. Here [7] is a discussion on what to do to make ssh use ssh-dss. Karaf's own ssh client located in bin/client works on first boot, and stops working after restart. > Unable to read key /tmp/karaf-0.7.0/etc/host.key Testing manually, deleting the host.key file before re-starting ODL avoids this Bug symptom. This also works for (current) Robot ssh libraries. That means we have a workaround to avoid CSIT failures: [8]. This is still a regression from Carbon usability, but severity can be lowered once the new behavior is documented in release notes. [6] https://www.gentoo.org/support/news-items/2015-08-13-openssh-weak-keys.html |
| Comment by Vratko Polak [ 13/Sep/17 ] |
|
> a workaround to avoid CSIT failures: [8]. And this [9] fixes the offline job. Is there any other test that needs fixing? Proper fix for Karaf SSH server behavior will probably come in the from of Odlparent upgrading to newer Karaf version. |
| Comment by Vratko Polak [ 14/Sep/17 ] |
|
>> a workaround to avoid CSIT failures: [8]. Both workarounds merged, reducing severity. |
| Comment by Michael Vorburger [ 08/Dec/17 ] |
|
For the problem re. ssh-dss there is another workaround mentioned on [3] ; https://github.com/dfarrell07/vagrant-opendaylight/issues/29 uses HostKeyAlgorithms as well. |
| Comment by Robert Varga [ 25/Jun/20 ] |
|
jluhrsen is this still happening? |
| Comment by Jamo Luhrsen [ 25/Jun/20 ] |
|
no, I have not seen this in anything I've dug around in. Let's close this. |