Page MenuHomePhorge

Service Status Dashboard
Open, NormalPublic

Description

We want to provide publicly available “service status” information for our customers and for monitoring service quality ourselves. For this we need a dedicated page (which does not necessarily need to be part of the HKCCP) to collect and display status information.

Ideally, the information is updated on that page automatically without human intervention to always correspond to the actual system status.

Some examples of such pages are:

The specific design for the page will be provided by the same designers that also designed our websites.

Implementation Design Considerations

These are existing environment details:

  • The status of systems is monitored and alerted about through Nagios, which is therefore the holder of all performance data.
  • Trending happens through Munin, and is stored in RRD databases.
  • Puppet is used to manage the configuration of systems during their life-time.

The goal might therefore include:

  1. Current system health is derived from Nagios status indicators,
  2. Certain aspects of current system health are contained within Munin RRD database files (disk latency for example),
  3. Puppet catalogs with pending actions can give us an indication on 'future stability', as all configuration changes pending naturally result in a level of risk associated with them.

The following links already exist:

  • A node in an environment including is applied a configuration management resource to, such as 'imap::frontend',
  • We therefore know that provided the domain name space and environment stage match, the node is a participatory provider of the service,
  • It becomes a host group,
  • The number of functional members of the host group divided by the total number of members of the host group gives us a ratio.

The latter doesn't precisely indicate the level of "functionality" that remains, since the "service" as such may still be "fully functional". However, it is relatively easy to achieve and possibly a good first milestone. It means reading Nagios configuration (using shinken.objects.config.Config) and status.dat.

This could result in something similar to:

{
  "Development::Hypervisors": {
    "host_health": 95.83,
    "hosts": {
      "81aebb894c2003fa4068b83148112343": 95.83,
      "f1654b11144ac890ec1826f20fbcd421": 95.83
    },
    "last_check": "1456474751",
    "last_state_change": "1455153473",
    "service_health": 95.83,
    "services": {
      "Alive": 100.0,
      "Disk": 100.0,
      "Kernel Update": 50.0,
      "Load Average": 100.0,
      "Munin Process": 100.0,
      "NRPE Process": 100.0,
      "NTP": 100.0,
      "Puppet Memory": 100.0,
      "Puppet Process": 100.0,
      "SSH Service": 100.0,
      "Total Processes": 100.0,
      "Zombie Processes": 100.0
    }
  },
  "host_health": 99.6,
  "service_health": 99.6
}

Details

Ticket Type
Epic

Related Objects

StatusAssignedTask
OpenNone
Spitevanmeeuwen

Event Timeline

vanmeeuwen raised the priority of this task from to 60.
vanmeeuwen updated the task description. (Show Details)
vanmeeuwen changed Ticket Type from Task to Epic.
vanmeeuwen subscribed.

There are unanswered questions about the sources of data, metrics calculation and status granularity to display.

We have collected the requirements for the status dashboard in a dedicated wiki page.

Please specify what questions are left unanswered.

They're a good set of indications of what metrics could be included, but they do not describe how the data should be retrieved in real life;

A remote host can be used to determine latency, but cannot be made to poll each individual participant in providing a service, unless a VPN is created also.

A local host (in current infrastructure) can pull information from existing monitoring and trending, but is likely to suffer from issues in that infrastructure similar to how other local systems would, potentially up to and including itself becoming dysfunctional.

What level of integration with existing monitoring (Nagios) and trending (Munin) should the status board provide? This point distinguishes writing new software vs. extending existing (with monitoring plugins).

Should this dashboard include status messaging? Should it include service window scheduling?

What status should an unreported service result in (gray, stale)?

vanmeeuwen merged a task: Restricted Maniphest Task.May 26 2015, 2:39 PM
vanmeeuwen added a subscriber: petersen.
petersen added a project: Restricted Project.May 26 2015, 2:50 PM
petersen added a subscriber: seigo.

Cachet HQ is a Free Software dashboard software that we might want to consider to use for this. There is a public demo as well.

grote raised the priority of this task from 60 to High.Jun 17 2015, 2:41 PM
grote lowered the priority of this task from High to 60.Jul 1 2015, 2:29 PM
vanmeeuwen moved this task from Elaboration to Inception on the Architecture & Design board.
vanmeeuwen closed subtask Restricted Maniphest Task as Invalid.
vanmeeuwen closed subtask Restricted Maniphest Task as Invalid.

A feature with high marketing value. Something that has been requested a lot.

Noting that our current service health percentage is 99.71%, I would note that some services ("Kernel update installed, running an older kernel") may need to be excluded -- this could be achieved with recognizing the service type.

vanmeeuwen lowered the priority of this task from 60 to Normal.Mar 28 2019, 8:13 AM