We want to provide publicly available “service status” information for our customers and for monitoring service quality ourselves. For this we need a dedicated page (which does not necessarily need to be part of the HKCCP) to collect and display status information.
Ideally, the information is updated on that page automatically without human intervention to always correspond to the actual system status.
Some examples of such pages are:
* https://www.infomaniak.com/en/network-status
* https://dev.twitter.com/overview/status
* http://azure.microsoft.com/en-us/status/
* http://www.google.com/appsstatus#hl=de&v=status
* https://status.github.com/
The specific design for the page will be provided by the same designers that also designed our websites.
## Implementation Design Considerations
These are existing environment details:
* The status of systems is monitored and alerted about through Nagios, which is therefore the holder of all performance data.
* Trending happens through Munin, and is stored in RRD databases.
* Puppet is used to manage the configuration of systems during their life-time.
The goal might therefore include:
# Current system health is derived from Nagios status indicators,
# Certain aspects of current system health are contained within Munin RRD database files (disk latency for example),
# Puppet catalogs with pending actions can give us an indication on 'future stability', as all configuration changes pending naturally result in a level of risk associated with them.
The following links already exist:
* A node in an environment including is applied a configuration management resource to, such as 'imap::frontend',
* We therefore know that provided the domain name space and environment stage match, the node is a participatory provider of the service,
* It becomes a host group,
* The number of functional members of the host group divided by the total number of members of the host group gives us a ratio.
The latter doesn't precisely indicate the level of "functionality" that remains, since the "service" as such may still be "fully functional". However, it is relatively easy to achieve and possibly a good first milestone. It means reading Nagios configuration (using `shinken.objects.config.Config`) and `status.dat`.
This could result in something similar to:
```
{
"Development::Hypervisors": {
"host_health": 95.83,
"hosts": {
"81aebb894c2003fa4068b83148112343": 95.83,
"f1654b11144ac890ec1826f20fbcd421": 95.83
},
"last_check": "1456474751",
"last_state_change": "1455153473",
"service_health": 95.83,
"services": {
"Alive": 100.0,
"Disk": 100.0,
"Kernel Update": 50.0,
"Load Average": 100.0,
"Munin Process": 100.0,
"NRPE Process": 100.0,
"NTP": 100.0,
"Puppet Memory": 100.0,
"Puppet Process": 100.0,
"SSH Service": 100.0,
"Total Processes": 100.0,
"Zombie Processes": 100.0
}
},
"host_health": 99.6,
"service_health": 99.6
}
```