[INFRAUTILS-54] Genius CSIT 3 Node fails following infrautils changes in diagstatus Created: 27/Sep/18 Updated: 03/Oct/18 Resolved: 02/Oct/18 |
|
| Status: | Resolved |
| Project: | infrautils |
| Component/s: | None |
| Affects Version/s: | Neon |
| Fix Version/s: | Neon |
| Type: | Bug | Priority: | Highest |
| Reporter: | nidhi adhvaryu | Assignee: | Michael Vorburger |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
Genius CSIT 3 node is failing on master in ShowSVCStatus. Due to this, https://git.opendaylight.org/gerrit/#/c/76345/ changes it is failing which is regarding Diagstatus and Cluster. Please Find Logs here, https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/builder-copy-sandbox-logs/434/nidhigenius-csit-3node-gate-all-neon/10/ |
| Comments |
| Comment by Michael Vorburger [ 28/Sep/18 ] |
|
I'll look more into and sort this out (fix it) on Monday, I promise! |
| Comment by nidhi adhvaryu [ 28/Sep/18 ] |
|
Thank you Michael. |
| Comment by Michael Vorburger [ 01/Oct/18 ] |
|
enidadh I just looked more into it now, but I don't see the actual error yet... the odl_1-3/odl1-3_karaf.log.gz do not contain any ERROR (which seems related to this) and the Robot Log shows: "Keyword 'ClusterManagement.Check Status Of Services Is OPERATIONAL' failed after retrying for 1 minute. The last error was: '[?1l>[?2004l[?1h=[?2004hopendaylight-user@root>' does not contain 'ACTIVE'" but does not show what the output of the "showSvcStatus -n 10.30.170.65" it seems to do actually is - or I'm just not seeing it? That output should contain the failure, hopefully with exception, which I need to understand what broke here... can someone help find that? jluhrsen perhaps CSIT output could be improved? |
| Comment by Michael Vorburger [ 01/Oct/18 ] |
|
Just FTR, a freshly built local genius without clustering is OK, so this is something particular in a clustered CSIT: opendaylight-user@root>showSvcStatus Timestamp: Mon Oct 01 12:29:52 CEST 2018 Node IP Address: 127.0.0.1 System is operational: true System ready state: ACTIVE OPENFLOW : OPERATIONAL (OF::PORTS:: 6653 and 6633 are up.) IFM : OPERATIONAL ITM : OPERATIONAL OVSDB : OPERATIONAL (OVSDB initialization complete) DATASTORE : OPERATIONAL I need to see the output like above from the clustered CSIT to understand what's failing... |
| Comment by Jamo Luhrsen [ 01/Oct/18 ] |
|
that's the exact problem. there is no output coming back. It's just coming back with the maybe you can try jhershbe's cluster tool to pop up a local cluster to reproduce it? |
| Comment by Faseela K [ 02/Oct/18 ] |
|
vorburger csit does log the showsvcstatus output, may be it is not priniting anything. Does the karaf log help? |
| Comment by Michael Vorburger [ 02/Oct/18 ] |
|
> maybe you can try jhershbe's cluster tool to pop up a local cluster to reproduce it? https://github.com/jhershberg/netvirt-ha-docker sounds useful here - lemme see if I can get this up and running |
| Comment by Michael Vorburger [ 02/Oct/18 ] |
|
I've just tried it! The basics seem to work just fine.. here are my exact steps (using netvirt instead of genius, but is most probably the same): sudo setenforce 0 but showSvcStatus -n indeed seems to have somehow broken, I'll dig into this: ./karaf-client 1 PS: I initially got confused until I realized that ./karaf-client 1 showSvcStatus -n 172.28.5.1 does not work because of this problem, which is totally unrelated to what we are chasing here. (PPS: Also stumbled upon |
| Comment by Michael Vorburger [ 02/Oct/18 ] |
|
Looking at the code of DiagStatusCommand it sees to be virtually impossible for it to return without printing anything - it should either at the very least have "Timestamp: ..." OR an Exception - unless somehow it's possible that there is nothing in the log (I checked) and shell:stack-traces-print is broken?! I can try a custom build with an extra catch, but this is weird... |
| Comment by Michael Vorburger [ 02/Oct/18 ] |
|
> try a custom build with an extra catch with https://git.opendaylight.org/gerrit/#/c/76562/, I'm finally seeing this: java.lang.IncompatibleClassChangeError: Found interface org.opendaylight.infrautils.diagstatus.ClusterMemberInfo, but class was expected at Proxy566b334a_4fa8_428a_a448_4444fecf36c0.isLocalAddress(Unknown Source) at org.opendaylight.infrautils.diagstatus.shell.DiagStatusCommand.getNodeSpecificStatus(DiagStatusCommand.java:118) at org.opendaylight.infrautils.diagstatus.shell.DiagStatusCommand.execute2(DiagStatusCommand.java:66) at org.opendaylight.infrautils.diagstatus.shell.DiagStatusCommand.execute(DiagStatusCommand.java:50) at org.apache.karaf.shell.impl.action.command.ActionCommand.execute(ActionCommand.java:84) at org.apache.karaf.shell.impl.console.osgi.secured.SecuredCommand.execute(SecuredCommand.java:68) at org.apache.karaf.shell.impl.console.osgi.secured.SecuredCommand.execute(SecuredCommand.java:86) at org.apache.felix.gogo.runtime.Closure.executeCmd(Closure.java:571) at org.apache.felix.gogo.runtime.Closure.executeStatement(Closure.java:497) at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:386) at org.apache.felix.gogo.runtime.Pipe.doCall(Pipe.java:417) at org.apache.felix.gogo.runtime.Pipe.call(Pipe.java:229) at org.apache.felix.gogo.runtime.Pipe.call(Pipe.java:59) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) PS: I'm also seeing that during boot there are a lot of new exceptions while we are not yet ready... |
| Comment by Michael Vorburger [ 02/Oct/18 ] |
|
> java.lang.IncompatibleClassChangeError: Found interface org.opendaylight.infrautils.diagstatus.ClusterMemberInfo, but class was expected similar to but not exactly https://issues.apache.org/jira/browse/KARAF-5086 https://git.opendaylight.org/gerrit/#/c/76564/ fixes this. |
| Comment by Michael Vorburger [ 02/Oct/18 ] |
|
> PS: I'm also seeing that during boot there are a lot of new exceptions while we are not yet ready... That's those below, but that is just because the ./wait-for-active script from https://github.com/jhershberg/netvirt-ha-docker uses showSvcStatus "too soon". If it used /diagstatus instead, then this wouldn't happen. This therefore also should not happen in CSIT and real world, so ignoring this here: 2018-10-02T14:49:48,323 | INFO | pipe-showsvcstatus | DiagStatusCommand | 265 - org.opendaylight.infrautils.diagstatus-shell - 1.
5.0.SNAPSHOT | invokeRemoteJMXOperation() JMX service URL: service:jmx:rmi://172.28.5.2/jndi/rmi://172.28.5.2:6886/server
2018-10-02T14:49:48,323 | INFO | pipe-showsvcstatus | DiagStatusCommand | 265 - org.opendaylight.infrautils.diagstatus-shell - 1.
5.0.SNAPSHOT | fetching status summary for node : 172.28.5.2
2018-10-02T14:49:48,331 | ERROR | pipe-showsvcstatus | DiagStatusCommand | 265 - org.opendaylight.infrautils.diagstatus-shell - 1.
5.0.SNAPSHOT | Exception while reaching Host 172.28.5.2
java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Co
nnection refused to host: 172.28.5.2; nested exception is:
java.net.ConnectException: Connection refused (Connection refused)]
at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:369) [?:?]
at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270) [?:?]
at org.opendaylight.infrautils.diagstatus.MBeanUtils.invokeRemoteMBeanOperation(MBeanUtils.java:159) [263:org.opendaylight.infrautils.diagstatus-api:1.5.0.SNAPSHOT]
at org.opendaylight.infrautils.diagstatus.shell.DiagStatusCommand.getRemoteStatusSummary(DiagStatusCommand.java:104) [265:org.opendaylight.infrautils.diagstatus-shell:1.5.0.SNAPSHOT]
at org.opendaylight.infrautils.diagstatus.shell.DiagStatusCommand.execute2(DiagStatusCommand.java:76) [265:org.opendaylight.infrautils.diagstatus-shell:1.5.0.SNAPSHOT]
at org.opendaylight.infrautils.diagstatus.shell.DiagStatusCommand.execute(DiagStatusCommand.java:50) [265:org.opendaylight.infrautils.diagstatus-shell:1.5.0.SNAPSHOT]
at org.apache.karaf.shell.impl.action.command.ActionCommand.execute(ActionCommand.java:84) [126:org.apache.karaf.shell.core:4.1.5]
at org.apache.karaf.shell.impl.console.osgi.secured.SecuredCommand.execute(SecuredCommand.java:68) [126:org.apache.karaf.shell.core:4.1.5]
at org.apache.karaf.shell.impl.console.osgi.secured.SecuredCommand.execute(SecuredCommand.java:86) [126:org.apache.karaf.shell.core:4.1.5]
at org.apache.felix.gogo.runtime.Closure.executeCmd(Closure.java:571) [126:org.apache.karaf.shell.core:4.1.5]
at org.apache.felix.gogo.runtime.Closure.executeStatement(Closure.java:497) [126:org.apache.karaf.shell.core:4.1.5]
at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:386) [126:org.apache.karaf.shell.core:4.1.5]
at org.apache.felix.gogo.runtime.Pipe.doCall(Pipe.java:417) [126:org.apache.karaf.shell.core:4.1.5]
at org.apache.felix.gogo.runtime.Pipe.call(Pipe.java:229) [126:org.apache.karaf.shell.core:4.1.5]
at org.apache.felix.gogo.runtime.Pipe.call(Pipe.java:59) [126:org.apache.karaf.shell.core:4.1.5]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:?]
at java.lang.Thread.run(Thread.java:748) [?:?]
Caused by: javax.naming.ServiceUnavailableException
at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:136) ~[?:?]
at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:205) ~[?:?]
at javax.naming.InitialContext.lookup(InitialContext.java:417) ~[?:?]
at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1955) ~[?:?]
at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1922) ~[?:?]
at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:287) ~[?:?]
... 18 more
Caused by: java.rmi.ConnectException: Connection refused to host: 172.28.5.2; nested exception is:
java.net.ConnectException: Connection refused (Connection refused)
at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:619) ~[?:?]
at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216) ~[?:?]
at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202) ~[?:?]
at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:338) ~[?:?]
at sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:112) ~[?:?]
at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:132) ~[?:?]
at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:205) ~[?:?]
at javax.naming.InitialContext.lookup(InitialContext.java:417) ~[?:?]
at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1955) ~[?:?]
at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1922) ~[?:?]
at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:287) ~[?:?]
... 18 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:?]
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:?]
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
at java.net.Socket.connect(Socket.java:589) ~[?:?]
at java.net.Socket.connect(Socket.java:538) ~[?:?]
at java.net.Socket.<init>(Socket.java:434) ~[?:?]
at java.net.Socket.<init>(Socket.java:211) ~[?:?]
at sun.rmi.transport.proxy.RMIDirectSocketFactory.createSocket(RMIDirectSocketFactory.java:40) ~[?:?]
at sun.rmi.transport.proxy.RMIMasterSocketFactory.createSocket(RMIMasterSocketFactory.java:148) ~[?:?]
at sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:613) ~[?:?]
at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:216) ~[?:?]
at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202) ~[?:?]
...
2018-10-02T14:49:48,341 | INFO | pipe-showsvcstatus | DiagStatusCommand | 265 - org.opendaylight.infrautils.diagstatus-shell - 1.5.0.SNAPSHOT | invokeRemoteJMXOperation() JMX service URL: service:jmx:rmi://172.28.5.3/jndi/rmi://172.28.5.3:6886/server
2018-10-02T14:49:48,342 | INFO | pipe-showsvcstatus | DiagStatusCommand | 265 - org.opendaylight.infrautils.diagstatus-shell - 1.5.0.SNAPSHOT | fetching status summary for node : 172.28.5.3
> with https://git.opendaylight.org/gerrit/#/c/76562/, I'm finally seeing this: let's discuss this further in |
| Comment by nidhi adhvaryu [ 03/Oct/18 ] |
|
Thank you Michael. |