[INFRAUTILS-33] Expose ready and/or diagstatus via a non-authenticated URL Created: 11/Apr/18  Updated: 22/Aug/18  Resolved: 02/Jul/18

Status: Resolved
Project: infrautils
Component/s: diagstatus
Affects Version/s: None
Fix Version/s: Oxygen-SR3, Fluorine

Type: Improvement Priority: Medium
Reporter: Michael Vorburger Assignee: Michael Vorburger
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Blocks
blocks INTTEST-45 use REST for diagstatus instead of ch... Open
blocks INFRAUTILS-45 Use /diagstatus JSON to fetch remote ... Verified
Relates
relates to INFRAUTILS-47 diagstatus HTTP GET differs in respon... Resolved
relates to INFRAUTILS-46 diagstatus error code is 418 instead ... Resolved

 Description   

https://bugzilla.redhat.com/show_bug.cgi?id=1549218 and related internal discussion have revealed that it would be useful for the TripleO installation orchestrator to health check a non-authenticated URL for ready and/or diagstatus status.

They cannot use that jolokia JMX HTTP bridge URL which requires authentication at the point where they need to make that check.

It should not be very hard to write a Servlet which exposes similar information than the diag CLI command, with a HTTP status code (like 200 vs 503) and the body containing output similar to that CLI command.

This will not break the existing diagstatus JMX bean exposed via (authenticated) /jolokia, but will be in addition to that.



 Comments   
Comment by Michael Vorburger [ 16/Apr/18 ]

jluhrsentrozet, JankiChhatbar (and FYI k.faseela) I've started looking into, and want to clarify 2 points before code:

  • the /diagstatus would return a JSON; not human readable text - makes sense, right? FYI the format would be as in org.opendaylight.infrautils.diagstatus.internal.DiagStatusServiceMBeanImpl.acquireServiceStatusAsJSON(), or close to it.  (There are just two minor things there to clarify which don't make sense to me, yet.)
  • contrary to the diagstatus:showSvcStatus CLI command, I do not want all that JMX stuff we have in that code related to querying cluster nodes in this feature; my view is that whatever external system queries for such status has to already know about, or have an external way to find, the cluster members, and query them individually; that model seems more robust than one node (which could be down!) itself querying others. This was a personal view I've previously held, but seeing that is how e.g. Prometheus.io goes about the same for metrics, seems to confirm this is the right approach.

Are both OK for all of you?

Comment by Jamo Luhrsen [ 16/Apr/18 ]

Jamo Luhrsen, Tim Rozet, Janki Chhatbar (and FYI Faseela K) I've started looking into, and want to clarify 2 points before code:

the /diagstatus would return a JSON; not human readable text - makes sense, right? FYI the format would be as in org.opendaylight.infrautils.diagstatus.internal.DiagStatusServiceMBeanImpl.acquireServiceStatusAsJSON(), or close to it. (There are just two minor things there to clarify which don't make sense to me, yet.)

yeah, json response is fine. The healthcheck might/could just be as dumb as sending a curl to this endpoint and marking things
healthy if we get a 200 response. Maybe it wouldn't even worry about the body of the response. I mean, it could and that would
probably be smarter, but it doesn't have to.

contrary to the diagstatus:showSvcStatus CLI command, I do not want all that JMX stuff we have in that code related to querying cluster nodes in this feature; my view is that whatever external system queries for such status has to already know about, or have an external way to find, the cluster members, and query them individually; that model seems more robust than one node (which could be down!) itself querying others. This was a personal view I've previously held, but seeing that is how e.g. Prometheus.io goes about the same for metrics, seems to confirm this is the right approach.

Are both OK for all of you?

I'm not totally following what you are getting at here. I think this healthcheck endpoint is on a container by container basis
and there doesn't need to be any broader knowledge of anything other than if that specific ODL container is healthy or
not. But, maybe you are getting at a more intelligent healthcheck that not only tells us if that one container is healthy, but
also if that container is also running in a healthy cluster?

Comment by Faseela K [ 17/Apr/18 ]

Agree to vorburger, the CLI is just an example implementation of a client to show how status can be retrieved from all nodes. Orchestrator will definitely know about all 3 nodes in a cluster, and hence should be easy for the orchestrator to fetch the status from all nodes by constructing the REST URLs, specific to the node IPs.

Comment by Janki Chhatbar [ 17/Apr/18 ]

Healthcheck just curls on specified URL and says container is healthy if the output of command is 0. This is run from inside the container. So each ODL container in a cluster will check for health for itself. Hence no need to worry about the cluster setup. They won't even know whether they are in cluster.

Comment by Muthukumaran Kothandaraman [ 17/Apr/18 ]

A minor clarification. Does orchestrator use a static address list and and curls on the URL ?

When one of cluster node is down during, I assume it would be treated as a timeout on curl for the corresponding node if the list is static. 

Comment by Jamo Luhrsen [ 17/Apr/18 ]

@everyone, I think we should ignore anything to do with clustering in this jira

@janki, I don't think healthcheck has to be a simple curl, only. We can do whatever crazy bash-fu we want to do. So, in the end we could do curl to get the full response
body and pipe it to some script that does anything we want and that script can set the exit code to 0 if healthy, or not.

Comment by Michael Vorburger [ 20/Apr/18 ]

Completed implementation, see 3 linked Gerrit reviews (c/70987, c/71168 and c/71172).

With this, the odl-infrautils-diagstatus feature will expose this on http://localhost:8181/diagstatus/ :

{
  "timeStamp": "Fri Apr 20 16:41:26 CEST 2018",
  "isOperational": true,
  "systemReadyState": "ACTIVE",
  "statusSummary": []
}

in infrautils/common/karaf with no other features, or e.g. with odl-netvirt-openstack like this:

{
  "timeStamp": "Fri Apr 20 17:09:33 CEST 2018",
  "isOperational": true,
  "systemReadyState": "ACTIVE",
  "statusSummary": [
    {
      "serviceName": "OPENFLOW",
      "effectiveStatus": "OPERATIONAL",
      "reportedStatusDescription": "switch connections started",
      "statusTimestamp": "2018-04-20T15:08:29.713Z"
    },
    {
      "serviceName": "IFM",
      "effectiveStatus": "OPERATIONAL",
      "reportedStatusDescription": "Service started",
      "statusTimestamp": "2018-04-20T15:08:17.268Z"
    },
    {
      "serviceName": "ITM",
      "effectiveStatus": "OPERATIONAL",
      "reportedStatusDescription": "Service started",
      "statusTimestamp": "2018-04-20T15:08:19.617Z"
    },
    {
      "serviceName": "ELAN",
      "effectiveStatus": "OPERATIONAL",
      "reportedStatusDescription": "Service started",
      "statusTimestamp": "2018-04-20T15:08:20.191Z"
    },
    {
      "serviceName": "DATASTORE",
      "effectiveStatus": "OPERATIONAL",
      "reportedStatusDescription": "OPERATIONAL",
      "statusTimestamp": "2018-04-20T15:09:33.509Z"
    }
  ]
}

BTW jluhrsen idea in INFRAUTILS-31 would be quite a nice complementary addition here. Anyone willing to wrap that up?

FYI http://localhost:8181/diagstatus/ will return the following HTTP status codes:

  • 418 / 503 (see INFRAUTILS-46) if isOperational = false, with details in the JSON; either systemReadyState not ACTIVE (=technical OSGi problem) OR systemReadyState = ACTIVE but one of the entries in statusSummary NOK (= more of a "functional" problem)
  • 401 Problem accessing /diagstatus/. Reason: Unauthorized indicates, confusingly, NOT that authentication is missing, but that there is nothing at that URL - i.e. infrautils has not even started (then things are REALLY screwed)
  • 500 if infrautils is up but has hit an internal error (details should be in karaf.log)
  • 200 OK if isOperational = true and all is well (will also have details in the JSON)

HTH.

Comment by Michael Vorburger [ 25/Apr/18 ]

This is all completely finished, done and dusted now - from my side.

Comment by Michael Vorburger [ 31/May/18 ]

jluhrsen and thapar have internally suggested that this would be good to have not only on master for Fluorine but also on stable/oxygen already - I'm therefore reopening this issue and look into doing the back-port, hopefully some time next week.

Comment by Michael Vorburger [ 01/Jun/18 ]

jluhrsen is now using jolokia/exec/org.opendaylight.infrautils.diagstatus/... instead of /diagstatus, so my understanding is there is no immediate need / request aanymore to back-port this to stable/oxygen amymore after all, so closing it again.

Comment by Michael Vorburger [ 20/Jun/18 ]

Re-opening this for backporting to Oxygen SR3, so that GENIUS-138 is really useful...

Generated at Wed Feb 07 20:02:05 UTC 2024 using Jira 8.20.10#820010-sha1:ace47f9899e9ee25d7157d59aa17ab06aee30d3d.