2014-02-18 18:05:18 -08:00
|
|
|
---
|
|
|
|
layout: "docs"
|
|
|
|
page_title: "Check Definition"
|
|
|
|
sidebar_current: "docs-agent-checks"
|
|
|
|
---
|
|
|
|
|
|
|
|
# Checks
|
|
|
|
|
|
|
|
One of the primary roles of the agent is the management of system and
|
|
|
|
application level health checks. A health check is considered to be application
|
|
|
|
level if it associated with a service. A check is defined in a configuration file,
|
|
|
|
or added at runtime over the HTTP interface.
|
|
|
|
|
|
|
|
There are two different kinds of checks:
|
|
|
|
|
|
|
|
* Script + Interval - These checks depend on invoking an external application
|
|
|
|
which does the health check and exits with an appropriate exit code, potentially
|
|
|
|
generating some output. A script is paired with an invocation interval (e.g.
|
|
|
|
every 30 seconds). This is similar to the Nagios plugin system.
|
|
|
|
|
|
|
|
* TTL - These checks retain their last known state for a given TTL. The state
|
2014-04-30 15:26:07 -04:00
|
|
|
of the check must be updated periodically over the HTTP interface. If an
|
2014-02-18 18:05:18 -08:00
|
|
|
external system fails to update the status within a given TTL, the check is
|
|
|
|
set to the failed state. This mechanism is used to allow an application to
|
2014-05-03 18:23:16 -04:00
|
|
|
directly report its health. For example, a web app can periodically curl the
|
2014-02-18 18:05:18 -08:00
|
|
|
endpoint, and if the app fails, then the TTL will expire and the health check
|
2014-04-09 11:06:27 -07:00
|
|
|
enters a critical state. This is conceptually similar to a dead man's switch.
|
2014-02-18 18:05:18 -08:00
|
|
|
|
|
|
|
## Check Definition
|
|
|
|
|
|
|
|
A check definition that is a script looks like:
|
|
|
|
|
|
|
|
{
|
|
|
|
"check": {
|
|
|
|
"id": "mem-util",
|
|
|
|
"name": "Memory utilization",
|
|
|
|
"script": "/usr/local/bin/check_mem.py",
|
|
|
|
"interval": "10s"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
A TTL based check is very similar:
|
|
|
|
|
|
|
|
{
|
|
|
|
"check": {
|
|
|
|
"id": "web-app",
|
|
|
|
"name": "Web App Status",
|
|
|
|
"notes": "Web app does a curl internally every 10 seconds",
|
|
|
|
"ttl": "30s"
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
Both types of definitions must include a `name`, and may optionally
|
|
|
|
provide an `id` and `notes` field. The `id` is set to the `name` if not
|
2014-06-08 17:31:44 -07:00
|
|
|
provided. It is required that all checks have a unique ID per node, so if names
|
|
|
|
might conflict then unique ID's should be provided.
|
2014-02-18 18:05:18 -08:00
|
|
|
|
|
|
|
The `notes` field is opaque to Consul, but may be used for human
|
|
|
|
readable descriptions. The field is set to any output that a script
|
|
|
|
generates, and similarly the TTL update hooks can update the `notes`
|
|
|
|
as well.
|
|
|
|
|
2014-02-22 18:53:31 -08:00
|
|
|
To configure a check, either provide it as a `-config-file` option to the
|
|
|
|
agent, or place it inside the `-config-dir` of the agent. The file must
|
|
|
|
end in the ".json" extension to be loaded by Consul. Check definitions can
|
|
|
|
also be updated by sending a `SIGHUP` to the agent. Alternatively, the
|
|
|
|
check can be registered dynamically using the [HTTP API](/docs/agent/http.html).
|
|
|
|
|
2014-02-19 12:05:18 -08:00
|
|
|
## Check Scripts
|
|
|
|
|
|
|
|
A check script is generally free to do anything to determine the status
|
|
|
|
of the check. The only limitations placed are the exit codes must convey
|
|
|
|
a specific meaning. Specifically:
|
|
|
|
|
|
|
|
* Exit code 0 - Check is passing
|
|
|
|
* Exit code 1 - Check is warning
|
|
|
|
* Any other code - Check is failing
|
|
|
|
|
|
|
|
This is the only convention that Consul depends on. Any output of the script
|
|
|
|
will be captured and stored in the `notes` field so that it can be viewed
|
|
|
|
by human operators.
|