Service Status Dashboard
Open, NormalPublic
Actions

Assigned To

None

Authored By

	vanmeeuwen
	Apr 20 2015, 5:02 AM

Description

We want to provide publicly available “service status” information for our customers and for monitoring service quality ourselves. For this we need a dedicated page (which does not necessarily need to be part of the HKCCP) to collect and display status information.

Ideally, the information is updated on that page automatically without human intervention to always correspond to the actual system status.

Some examples of such pages are:

The specific design for the page will be provided by the same designers that also designed our websites.

Implementation Design Considerations

These are existing environment details:

The status of systems is monitored and alerted about through Nagios, which is therefore the holder of all performance data.
Trending happens through Munin, and is stored in RRD databases.
Puppet is used to manage the configuration of systems during their life-time.

The goal might therefore include:

Current system health is derived from Nagios status indicators,
Certain aspects of current system health are contained within Munin RRD database files (disk latency for example),
Puppet catalogs with pending actions can give us an indication on 'future stability', as all configuration changes pending naturally result in a level of risk associated with them.

The following links already exist:

A node in an environment including is applied a configuration management resource to, such as 'imap::frontend',
We therefore know that provided the domain name space and environment stage match, the node is a participatory provider of the service,
It becomes a host group,
The number of functional members of the host group divided by the total number of members of the host group gives us a ratio.

The latter doesn't precisely indicate the level of "functionality" that remains, since the "service" as such may still be "fully functional". However, it is relatively easy to achieve and possibly a good first milestone. It means reading Nagios configuration (using shinken.objects.config.Config) and status.dat.

This could result in something similar to:

{
  "Development::Hypervisors": {
    "host_health": 95.83,
    "hosts": {
      "81aebb894c2003fa4068b83148112343": 95.83,
      "f1654b11144ac890ec1826f20fbcd421": 95.83
    },
    "last_check": "1456474751",
    "last_state_change": "1455153473",
    "service_health": 95.83,
    "services": {
      "Alive": 100.0,
      "Disk": 100.0,
      "Kernel Update": 50.0,
      "Load Average": 100.0,
      "Munin Process": 100.0,
      "NRPE Process": 100.0,
      "NTP": 100.0,
      "Puppet Memory": 100.0,
      "Puppet Process": 100.0,
      "SSH Service": 100.0,
      "Total Processes": 100.0,
      "Zombie Processes": 100.0
    }
  },
  "host_health": 99.6,
  "service_health": 99.6
}

Details

Ticket Type: Epic

Related Objects
Search...

Status	Assigned	Task
Open	None	T31 Service Status Dashboard
		Restricted Maniphest Task
		Restricted Maniphest Task
Spite	vanmeeuwen	T1069 Status Board based on existing Nagios state.

Event Timeline

vanmeeuwen created this task.Apr 20 2015, 5:02 AM

vanmeeuwen raised the priority of this task from to 60.

vanmeeuwen updated the task description. (Show Details)

vanmeeuwen added projects: Architecture & Design, Product Owners.

vanmeeuwen changed Ticket Type from Task to Epic.

vanmeeuwen subscribed.

grote moved this task from Incoming to In Triage on the Product Owners board.Apr 20 2015, 5:06 AM

There are unanswered questions about the sources of data, metrics calculation and status granularity to display.

We have collected the requirements for the status dashboard in a dedicated wiki page.

Please specify what questions are left unanswered.

They're a good set of indications of what metrics could be included, but they do not describe how the data should be retrieved in real life;

A remote host can be used to determine latency, but cannot be made to poll each individual participant in providing a service, unless a VPN is created also.

A local host (in current infrastructure) can pull information from existing monitoring and trending, but is likely to suffer from issues in that infrastructure similar to how other local systems would, potentially up to and including itself becoming dysfunctional.

What level of integration with existing monitoring (Nagios) and trending (Munin) should the status board provide? This point distinguishes writing new software vs. extending existing (with monitoring plugins).

Should this dashboard include status messaging? Should it include service window scheduling?

What status should an unreported service result in (gray, stale)?

vanmeeuwen moved this task from Backlog to Elaboration on the Architecture & Design board.Apr 27 2015, 9:44 PM

vanmeeuwen merged a task: Restricted Maniphest Task.May 26 2015, 2:39 PM

vanmeeuwen added a subscriber: petersen.

petersen added a project: Restricted Project.May 26 2015, 2:50 PM

petersen added a subscriber: • seigo.

Cachet HQ is a Free Software dashboard software that we might want to consider to use for this. There is a public demo as well.

grote raised the priority of this task from 60 to High.Jun 17 2015, 2:41 PM

grote lowered the priority of this task from High to 60.Jul 1 2015, 2:29 PM

vanmeeuwen removed a project: Architecture & Design.Dec 8 2015, 9:24 AM

vanmeeuwen edited projects, added Architecture & Design; removed Restricted Project, Product Owners.Feb 27 2016, 6:56 PM

vanmeeuwen moved this task from Elaboration to Inception on the Architecture & Design board.

vanmeeuwen updated the task description. (Show Details)Feb 28 2016, 3:03 PM

vanmeeuwen closed subtask Restricted Maniphest Task as Invalid.

vanmeeuwen awarded a token.Feb 28 2016, 3:32 PM

A feature with high marketing value. Something that has been requested a lot.

petersen awarded a token.Feb 28 2016, 3:42 PM

Noting that our current service health percentage is 99.71%, I would note that some services ("Kernel update installed, running an older kernel") may need to be excluded -- this could be achieved with recognizing the service type.

vanmeeuwen moved this task from Inception to Construction on the Architecture & Design board.Feb 28 2016, 4:00 PM

vanmeeuwen created subtask T1069: Status Board based on existing Nagios state..Feb 28 2016, 4:03 PM

vanmeeuwen closed subtask T1069: Status Board based on existing Nagios state. as Resolved.Mar 2 2016, 3:59 PM

vanmeeuwen changed the status of subtask T1069: Status Board based on existing Nagios state. from Resolved to Spite.Aug 23 2016, 2:27 PM

pasik subscribed.Nov 25 2017, 2:38 PM

vanmeeuwen lowered the priority of this task from 60 to Normal.Mar 28 2019, 8:13 AM

Service Status DashboardOpen, NormalPublicActions