[GENIUS-138] Improve Datastore Cluster diagstatus to indicate if node is currently an isolated leader Created: 14/May/18 Updated: 02/Jul/18 Resolved: 02/Jul/18 |
|
| Status: | Resolved |
| Project: | genius |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Oxygen-SR3, Fluorine |
| Type: | Improvement | Priority: | Medium |
| Reporter: | Michael Vorburger | Assignee: | Michael Vorburger |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
trozet in private internal email asked this re. new /diagstatus from
I'm personally not familiar enough with clustering implementation details yet, but assuming that this information is available somewhere in controller, and hopefully already exposed by its JMX bean/s, then it may not be hard to improve our org.opendaylight.genius.mdsalutil.diagstatus.internal.DatastoreServiceStatusProvider to report a ServiceDescriptor in ServiceState.ERROR if it's an isolated leader? |
| Comments |
| Comment by Muthukumaran Kothandaraman [ 14/May/18 ] |
|
We typically use following curl commands on all 3 nodes of ODL cluster to collect the shard details. Output would be JMX output of all shards. So we need to parse "RaftState" attribute for each shard and RaftState can carry values - Leader, Follower as well as "Isolated Leader". Output can also provide other useful information such as involuntary leader moves, log index, repl index etc. We typically use below curl commands in an external script (shell, python etc.) for either polling / for on-demand dumping using Jolokia endpoint For DS to report ERROR status, we can follow simple logic as below If RaftState of any shard has value other than Leader OR Follower , then declare the entire DS (across shards) to be in ERROR state. Of course, that may not sound like a purist definition. But for all practical reasons, it should be good enough.
For all Operational Shards ===================== curl -u $jolokia_username:$jolokia_password http://$cluster_node_ip_address:$jolokia_port/jolokia/read/org.opendaylight.controller:type=DistributedOperationalDatastore,Category=Shards,name=*
For all Config Shards ================ curl -u $jolokia_username:$jolokia_password http://$cluster_node_ip_address:$jolokia_port/jolokia/read/org.opendaylight.controller:type=DistributedConfigDatastore,Category=Shards,name=*
|
| Comment by Tim Rozet [ 15/May/18 ] |
|
Yeah, I was more specifically looking for an HTTP error code if the node is isolated leader, rather than parsing HTTP response output. Thought maybe that could be exposed with the new diagstatus. |
| Comment by Michael Vorburger [ 06/Jun/18 ] |
|
Could I suggest that instead of piling on more untraceable black magic with typeless querying of internal MBeans from one project to another we use this issue as the opportunity to do this right and strongly typed? Either entirely move org.opendaylight.genius.mdsalutil.diagstatus.internal.DatastoreServiceStatusProvider into controller (with a dependency from controller to infrautils.diagstatus; would that be acceptable tpantelis and rovarga ?) OR create a real public OSGi Service API to query the required information in controller, and have genius...DatastoreServiceStatusProvider invoke that. |
| Comment by Robert Varga [ 06/Jun/18 ] |
|
JMX is available locally (https://docs.oracle.com/javase/1.5.0/docs/api/java/lang/management/ManagementFactory.html#getPlatformMBeanServer() ), no need to connect all over the place. It does not really make sense to monitor remote nodes (i.e. if the cluster does not work, why do you think you can still access that node?). As for diagstatus, I don't particularly care as long as there is equivalent information available from a single source and the framework we are going to (diagstatus) is clearly superior than the one we are leaving (JMX). If anyone disagrees, with that point, I will underline the difference in management of your typical router via CLI and SNMP – this difference comes from having two disjoint management infrastructures, where SNMP is a second class citizen (because admins interact with the router via CLI, not through SNMP). |
| Comment by Michael Vorburger [ 07/Jun/18 ] |
|
https://lists.opendaylight.org/pipermail/controller-dev/2018-June/014404.html |
| Comment by Michael Vorburger [ 08/Jun/18 ] |
|
MuthukumaranK thanks for tips above, that was very helpful, and is now codified in c/72797. rovarga thank you for your feedback above; we'll keep DatastoreServiceStatusProvider in genius; no move to controller planned (by me). |
| Comment by Michael Vorburger [ 08/Jun/18 ] |
|
> instead of piling on more untraceable black magic with typeless querying of internal MBeans from one project to another 72789 addresses this in a way that satisfies me. |
| Comment by Michael Vorburger [ 14/Jun/18 ] |
|
This is now fully done, and just merged: diagstatus (whether via its CLI or /diagstatus or /jolokia URLs) will now report OPERATIONAL for the DATASTORE only if all Shards are Leaders or Followers; if they are in other states (which presumably includes an isolated leader; the original requirement above), the status will be ERROR and the Description will provide details. Took a couple of things to get this, see topic:GENIUS-138. |
| Comment by Michael Vorburger [ 20/Jun/18 ] |
|
Re-opening this for backporting to Oxygen SR3 (requested by trozet)... |
| Comment by Michael Vorburger [ 20/Jun/18 ] |
|
> Re-opening this for backporting to Oxygen SR3 (requested by trozet)... which would only really be useful if we also back-port |
| Comment by Faseela K [ 02/Jul/18 ] |
|
vorburger : Merged all the patches under the topic. Please double check if everything is in, and if so we can close the JIRA |