[ODLPARENT-49] Karaf ssh EOFError (is it due to low entropy, due to Java's default use of blocking /dev/random instead of /dev/urandom?) Created: 23/Sep/16  Updated: 24/Jan/18  Resolved: 18/Jan/17

Status: Resolved
Project: odlparent
Component/s: General
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Michael Vorburger Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Operating System: Linux
Platform: All


Attachments: Text File karaf.jstack.txt    
External issue ID: 6790

 Description   

JamO & others (incl. dfarrell) report relatively frequently seeing EOFError in CSIT Robot Suites when ssh into Karaf (not OS level sshd).



 Comments   
Comment by Michael Vorburger [ 23/Sep/16 ]

https://lists.opendaylight.org/pipermail/dev/2016-September/002704.html : One theory we have is that perhaps this could be due to low entropy. One solution to this would be to make the JVM process of the ssh server in Karaf (which apparently we've configured to use boucycastle) use the non-blocking /dev/urandom instead of the blocking /dev/random.

https://brooklyn.apache.org/documentation/increase-entropy.html is another Java based system with similar issues, so maybe this is related.

https://lists.opendaylight.org/pipermail/dev/2016-September/002727.html => https://git.opendaylight.org/gerrit/#/c/45749/

Comment by Michael Vorburger [ 23/Sep/16 ]

https://lists.opendaylight.org/pipermail/dev/2016-September/002734.html Ryan Goulding confirms that: "In our downstream internal CI, I had to
make adjustments to seed from /dev/urandom instead as we were experiencing
hanging tests (especially for netconf through sshd-core when mounting
several devices)."

Comment by Michael Vorburger [ 23/Sep/16 ]

https://lists.opendaylight.org/pipermail/dev/2016-September/002785.html Jamo Luhrsen reports entropy was 168 in /proc/sys/kernel/random/entropy_avail when this happened.

https://git.opendaylight.org/gerrit/#/c/45760/ is a change to use -Djava.security.egd=file:/dev/./urandom on start up (based on http://stackoverflow.com/a/2325109/421602).

https://lists.opendaylight.org/pipermail/dev/2016-September/002786.html disagrees that low entropy could be the root cause of this problem.

This http://www.2uo.de/myths-about-urandom/ is interesting.

Comment by Stephen Kitt [ 23/Sep/16 ]

(In reply to Michael Vorburger from comment #3)
> https://lists.opendaylight.org/pipermail/dev/2016-September/002786.html
> disagrees that low entropy could be the root cause of this problem.

But that might well be wrong, given that OpenSSH uses /dev/urandom anyway.

Comment by Michael Vorburger [ 23/Sep/16 ]

FTR: https://bugs.openjdk.java.net/browse/JDK-4705093 = http://bugs.java.com/view_bug.do?bug_id=4705093

https://docs.oracle.com/cd/E13209_01/wlcp/wlss30/configwlss/jvmrand.html

Comment by Michael Vorburger [ 23/Sep/16 ]

Once we have confirmation that with the merge of the https://git.opendaylight.org/gerrit/#/c/45760/ (now cleaned up/refined, thanks Stephen Kitt!) this problem disappears, we should also:

A. Open a bug + pull request (skitt) on upstream Karaf (4) to do the same as we are

B. Open a bug on https://bugzilla.redhat.com to suggest that perhaps OpenJDK RPM packages could "Change the default for java.security in $JAVA_HOME/jre/lib/security/java.security from file:/dev/random to file:/dev/urandom", some time. "If deemed too risky a change for a Java 8 security patch, perhaps consider this for Java 9 packages? If this is controversial, perhaps it's time to re-raise this upstream on OpenJDK?"

Comment by Michael Vorburger [ 11/Jan/17 ]

https://git.opendaylight.org/gerrit/#/c/50327/ fixes SingleFeatureTest up re. this.

Attached a jstack of the Karaf stuck due to this.

Comment by Michael Vorburger [ 11/Jan/17 ]

Attachment karaf.jstack.txt has been added with description: jstack of Karaf stuck in netconf SSH server init due to low entropy

Comment by Vratko Polak [ 12/Jan/17 ]

This is still affecting CSIT. See comment [0].

Either releng/builder manages to prepare machines with sufficinet entropy, or we should switch to testing Karaf with "-Djava.security.egd=file:/dev/./urandom" option added.

[0] https://git.opendaylight.org/gerrit/#/c/50362/3

Comment by Vratko Polak [ 17/Jan/17 ]

More info from https://lists.opendaylight.org/pipermail/integration-dev/2017-January/008955.html

> [3] https://git.opendaylight.org/gerrit/#/c/45760

Oh, that was merged long time ago,
I see karaf started with
-Djava.security.egd=file:/dev/./urandom
already, so something is not working right.
Are we sure Karaf console ssh server takes this option into account?

It the option worked, we would not need more entropy on ODL_SYSTEM.

Vratko.

Comment by Vratko Polak [ 18/Jan/17 ]

https://git.opendaylight.org/gerrit/50594 claim to fix CSIT failures.

If that is true, this error was not about entropy after all.
Karaf SSH server was just slow for some reason, and Robot SSHLibrary with default timeout was not giving a helpful failure message.

There is still some chance that the Karaf SSH server is slow because it blocks on /dev/random, but that would need closer examination to decide.

Generated at Wed Feb 07 20:27:29 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.