Merge pull request #13932 from hashicorp/docs/crossref-maint-mode-from-health-checks

docs: improve health check related docs
This commit is contained in:
Jared Kirschner 2022-08-25 16:56:30 -04:00 committed by GitHub
commit 21bc0add9a
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 365 additions and 246 deletions

View File

@ -6,7 +6,10 @@ description: The /agent/check endpoints interact with checks on the local agent
# Check - Agent HTTP API # Check - Agent HTTP API
The `/agent/check` endpoints interact with checks on the local agent in Consul. Consul's health check capabilities are described in the
[health checks overview](/docs/discovery/checks).
The `/agent/check` endpoints interact with health checks
managed by the local agent in Consul.
These should not be confused with checks in the catalog. These should not be confused with checks in the catalog.
## List Checks ## List Checks
@ -418,6 +421,10 @@ $ curl \
This endpoint is used with a TTL type check to set the status of the check to This endpoint is used with a TTL type check to set the status of the check to
`critical` and to reset the TTL clock. `critical` and to reset the TTL clock.
If you want to manually mark a service as unhealthy,
use [maintenance mode](/api-docs/agent#enable-maintenance-mode)
instead of defining a TTL health check and using this endpoint.
| Method | Path | Produces | | Method | Path | Produces |
| ------ | ----------------------------- | ------------------ | | ------ | ----------------------------- | ------------------ |
| `PUT` | `/agent/check/fail/:check_id` | `application/json` | | `PUT` | `/agent/check/fail/:check_id` | `application/json` |
@ -456,6 +463,10 @@ $ curl \
This endpoint is used with a TTL type check to set the status of the check and This endpoint is used with a TTL type check to set the status of the check and
to reset the TTL clock. to reset the TTL clock.
If you want to manually mark a service as unhealthy,
use [maintenance mode](/api-docs/agent#enable-maintenance-mode)
instead of defining a TTL health check and using this endpoint.
| Method | Path | Produces | | Method | Path | Produces |
| ------ | ------------------------------- | ------------------ | | ------ | ------------------------------- | ------------------ |
| `PUT` | `/agent/check/update/:check_id` | `application/json` | | `PUT` | `/agent/check/update/:check_id` | `application/json` |

View File

@ -14,6 +14,9 @@ optional health checking mechanisms. Additionally, some of the query results
from the health endpoints are filtered while the catalog endpoints provide the from the health endpoints are filtered while the catalog endpoints provide the
raw entries. raw entries.
To modify health check registration or information,
use the [`/agent/check`](/api-docs/agent/check) endpoints.
## List Checks for Node ## List Checks for Node
This endpoint returns the checks specific to the node provided on the path. This endpoint returns the checks specific to the node provided on the path.

View File

@ -13,144 +13,72 @@ description: >-
One of the primary roles of the agent is management of system-level and application-level health One of the primary roles of the agent is management of system-level and application-level health
checks. A health check is considered to be application-level if it is associated with a checks. A health check is considered to be application-level if it is associated with a
service. If not associated with a service, the check monitors the health of the entire node. service. If not associated with a service, the check monitors the health of the entire node.
Review the [health checks tutorial](https://learn.hashicorp.com/tutorials/consul/service-registration-health-checks) to get a more complete example on how to leverage health check capabilities in Consul.
A check is defined in a configuration file or added at runtime over the HTTP interface. Checks Review the [service health checks tutorial](https://learn.hashicorp.com/tutorials/consul/service-registration-health-checks)
created via the HTTP interface persist with that node. to get a more complete example on how to leverage health check capabilities in Consul.
There are several different kinds of checks: ## Registering a health check
- Script + Interval - These checks depend on invoking an external application There are three ways to register a service with health checks:
that performs the health check, exits with an appropriate exit code, and potentially
generates some output. A script is paired with an invocation interval (e.g.
every 30 seconds). This is similar to the Nagios plugin system. The output of
a script check is limited to 4KB. Output larger than this will be truncated.
By default, Script checks will be configured with a timeout equal to 30 seconds.
It is possible to configure a custom Script check timeout value by specifying the
`timeout` field in the check definition. When the timeout is reached on Windows,
Consul will wait for any child processes spawned by the script to finish. For any
other system, Consul will attempt to force-kill the script and any child processes
it has spawned once the timeout has passed.
In Consul 0.9.0 and later, script checks are not enabled by default. To use them you
can either use :
- [`enable_local_script_checks`](/docs/agent/config/cli-flags#_enable_local_script_checks): 1. Start or reload a Consul agent with a service definition file in the
enable script checks defined in local config files. Script checks defined via the HTTP [agent's configuration directory](/docs/agent#configuring-consul-agents).
API will not be allowed. 1. Call the
- [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks): enable [`/agent/service/register`](/api-docs/agent/service#register-service)
script checks regardless of how they are defined. HTTP API endpoint to register the service.
1. Use the
[`consul services register`](/commands/services/register)
CLI command to register the service.
~> **Security Warning:** Enabling script checks in some configurations may When a service is registered using the HTTP API endpoint or CLI command,
introduce a remote execution vulnerability which is known to be targeted by the checks persist in the Consul data folder across Consul agent restarts.
malware. We strongly recommend `enable_local_script_checks` instead. See [this
blog post](https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations)
for more details.
- `HTTP + Interval` - These checks make an HTTP `GET` request to the specified URL, ## Types of checks
waiting the specified `interval` amount of time between requests (eg. 30 seconds).
The status of the service depends on the HTTP response code: any `2xx` code is
considered passing, a `429 Too ManyRequests` is a warning, and anything else is
a failure. This type of check
should be preferred over a script that uses `curl` or another external process
to check a simple HTTP operation. By default, HTTP checks are `GET` requests
unless the `method` field specifies a different method. Additional header
fields can be set through the `header` field which is a map of lists of
strings, e.g. `{"x-foo": ["bar", "baz"]}`. By default, HTTP checks will be
configured with a request timeout equal to 10 seconds.
It is possible to configure a custom HTTP check timeout value by This section describes the available types of health checks you can use to
specifying the `timeout` field in the check definition. The output of the automatically monitor the health of a service instance or node.
check is limited to roughly 4KB. Responses larger than this will be truncated.
HTTP checks also support TLS. By default, a valid TLS certificate is expected.
Certificate verification can be turned off by setting the `tls_skip_verify`
field to `true` in the check definition. When using TLS, the SNI will be set
automatically from the URL if it uses a hostname (as opposed to an IP address);
the value can be overridden by setting `tls_server_name`.
Consul follows HTTP redirects by default. Set the `disable_redirects` field to -> **To manually mark a service unhealthy:** Use the maintenance mode
`true` to disable redirects. [CLI command](/commands/maint) or
[HTTP API endpoint](/api-docs/agent#enable-maintenance-mode)
to temporarily remove one or all service instances on a node
from service discovery DNS and HTTP API query results.
- `TCP + Interval` - These checks make a TCP connection attempt to the specified ### Script check ((#script-interval))
IP/hostname and port, waiting `interval` amount of time between attempts
(e.g. 30 seconds). If no hostname
is specified, it defaults to "localhost". The status of the service depends on
whether the connection attempt is successful (ie - the port is currently
accepting connections). If the connection is accepted, the status is
`success`, otherwise the status is `critical`. In the case of a hostname that
resolves to both IPv4 and IPv6 addresses, an attempt will be made to both
addresses, and the first successful connection attempt will result in a
successful check. This type of check should be preferred over a script that
uses `netcat` or another external process to check a simple socket operation.
By default, TCP checks will be configured with a request timeout of 10 seconds.
It is possible to configure a custom TCP check timeout value by specifying the
`timeout` field in the check definition.
- `UDP + Interval` - These checks direct the client to periodically send UDP datagrams Script checks periodically invoke an external application that performs the health check,
to the specified IP/hostname and port. The duration specified in the `interval` field sets the amount of time exits with an appropriate exit code, and potentially generates some output.
between attempts, such as `30s` to indicate 30 seconds. The check is logged as healthy if any response from the UDP server is received. Any other result sets the status to `critical`. The specified `interval` determines the time between check invocations.
The default interval for, UDP checks is `10s`, but you can configure a custom UDP check timeout value by specifying the The output of a script check is limited to 4KB.
`timeout` field in the check definition. If any timeout on read exists, the check is still considered healthy. Larger outputs are truncated.
- `Time to Live (TTL)` ((#ttl)) - These checks retain their last known state By default, script checks are configured with a timeout equal to 30 seconds.
for a given TTL. The state of the check must be updated periodically over the HTTP To configure a custom script check timeout value,
interface. If an external system fails to update the status within a given TTL, specify the `timeout` field in the check definition.
the check is set to the failed state. This mechanism, conceptually similar to a After reaching the timeout on a Windows system,
dead man's switch, relies on the application to directly report its health. For Consul waits for any child processes spawned by the script to finish.
example, a healthy app can periodically `PUT` a status update to the HTTP endpoint; After reaching the timeout on other systems,
if the app fails, the TTL will expire and the health check enters a critical state. Consul attempts to force-kill the script and any child processes it spawned.
The endpoints used to update health information for a given check are: [pass](/api-docs/agent/check#ttl-check-pass),
[warn](/api-docs/agent/check#ttl-check-warn), [fail](/api-docs/agent/check#ttl-check-fail),
and [update](/api-docs/agent/check#ttl-check-update). TTL checks also persist their
last known status to disk. This allows the Consul agent to restore the last known
status of the check across restarts. Persisted check status is valid through the
end of the TTL from the time of the last check.
- `Docker + Interval` - These checks depend on invoking an external application which Script checks are not enabled by default.
is packaged within a Docker Container. The application is triggered within the running To enable a Consul agent to perform script checks,
container via the Docker Exec API. We expect that the Consul agent user has access use one of the following agent configuration options:
to either the Docker HTTP API or the unix socket. Consul uses `$DOCKER_HOST` to
determine the Docker API endpoint. The application is expected to run, perform a health
check of the service running inside the container, and exit with an appropriate exit code.
The check should be paired with an invocation interval. The shell on which the check
has to be performed is configurable which makes it possible to run containers which
have different shells on the same host. Check output for Docker is limited to
4KB. Any output larger than this will be truncated. In Consul 0.9.0 and later, the agent
must be configured with [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks)
set to `true` in order to enable Docker health checks.
- `gRPC + Interval` - These checks are intended for applications that support the standard - [`enable_local_script_checks`](/docs/agent/config/cli-flags#_enable_local_script_checks):
[gRPC health checking protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md). Enable script checks defined in local config files.
The state of the check will be updated by probing the configured endpoint, waiting `interval` Script checks registered using the HTTP API are not allowed.
amount of time between probes (eg. 30 seconds). By default, gRPC checks will be configured - [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks):
with a default timeout of 10 seconds. Enable script checks no matter how they are registered.
It is possible to configure a custom timeout value by specifying the `timeout` field in
the check definition. gRPC checks will default to not using TLS, but TLS can be enabled by
setting `grpc_use_tls` in the check definition. If TLS is enabled, then by default, a valid
TLS certificate is expected. Certificate verification can be turned off by setting the
`tls_skip_verify` field to `true` in the check definition.
To check on a specific service instead of the whole gRPC server, add the service identifier after the `gRPC` check's endpoint in the following format `/:service_identifier`.
- `H2ping + Interval` - These checks test an endpoint that uses http2 ~> **Security Warning:**
by connecting to the endpoint and sending a ping frame. TLS is assumed to be configured by default. Enabling non-local script checks in some configurations may introduce
To disable TLS and use h2c, set `h2ping_use_tls` to `false`. If the ping is successful a remote execution vulnerability known to be targeted by malware.
within a specified timeout, then the check is updated as passing. We strongly recommend `enable_local_script_checks` instead.
The timeout defaults to 10 seconds, but is configurable using the `timeout` field. If TLS is enabled a valid For more information, refer to
certificate is required, unless `tls_skip_verify` is set to `true`. [this blog post](https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations).
The check will be run on the interval specified by the `interval` field.
- `Alias` - These checks alias the health state of another registered The following service definition file snippet is an example
node or service. The state of the check will be updated asynchronously, but is of a script check definition:
nearly instant. For aliased services on the same agent, the local state is monitored
and no additional network resources are consumed. For other services and nodes,
the check maintains a blocking query over the agent's connection with a current
server and allows stale requests. If there are any errors in watching the aliased
node or service, the check state will be critical. For the blocking query, the
check will use the ACL token set on the service or check definition or otherwise
will fall back to the default ACL token set with the agent (`acl_token`).
## Check Definition
A script check:
<CodeTabs heading="Script Check"> <CodeTabs heading="Script Check">
@ -162,7 +90,6 @@ check = {
interval = "10s" interval = "10s"
timeout = "1s" timeout = "1s"
} }
``` ```
```json ```json
@ -179,7 +106,47 @@ check = {
</CodeTabs> </CodeTabs>
A HTTP check: #### Check script conventions
A check script's exit code is used to determine the health check status:
- Exit code 0 - Check is passing
- Exit code 1 - Check is warning
- Any other code - Check is failing
Any output of the script is captured and made available in the
`Output` field of checks included in HTTP API responses,
as in this example from the [local service health endpoint](/api-docs/agent/service#by-name-json).
### HTTP check ((#http-interval))
HTTP checks periodically make an HTTP `GET` request to the specified URL,
waiting the specified `interval` amount of time between requests.
The status of the service depends on the HTTP response code: any `2xx` code is
considered passing, a `429 Too ManyRequests` is a warning, and anything else is
a failure. This type of check
should be preferred over a script that uses `curl` or another external process
to check a simple HTTP operation. By default, HTTP checks are `GET` requests
unless the `method` field specifies a different method. Additional request
headers can be set through the `header` field which is a map of lists of
strings, such as `{"x-foo": ["bar", "baz"]}`.
By default, HTTP checks are configured with a request timeout equal to 10 seconds.
To configure a custom HTTP check timeout value,
specify the `timeout` field in the check definition.
The output of an HTTP check is limited to approximately 4KB.
Larger outputs are truncated.
HTTP checks also support TLS. By default, a valid TLS certificate is expected.
Certificate verification can be turned off by setting the `tls_skip_verify`
field to `true` in the check definition. When using TLS, the SNI is implicitly
determined from the URL if it uses a hostname instead of an IP address.
You can explicitly set the SNI value by setting `tls_server_name`.
Consul follows HTTP redirects by default.
To disable redirects, set the `disable_redirects` field to `true`.
The following service definition file snippet is an example
of an HTTP check definition:
<CodeTabs heading="HTTP Check"> <CodeTabs heading="HTTP Check">
@ -220,7 +187,23 @@ check = {
</CodeTabs> </CodeTabs>
A TCP check: ### TCP check ((#tcp-interval))
TCP checks periodically make a TCP connection attempt to the specified IP/hostname and port, waiting `interval` amount of time between attempts.
If no hostname is specified, it defaults to "localhost".
The health check status is `success` if the target host accepts the connection attempt,
otherwise the status is `critical`. In the case of a hostname that
resolves to both IPv4 and IPv6 addresses, an attempt is made to both
addresses, and the first successful connection attempt results in a
successful check. This type of check should be preferred over a script that
uses `netcat` or another external process to check a simple socket operation.
By default, TCP checks are configured with a request timeout equal to 10 seconds.
To configure a custom TCP check timeout value,
specify the `timeout` field in the check definition.
The following service definition file snippet is an example
of a TCP check definition:
<CodeTabs heading="TCP Check"> <CodeTabs heading="TCP Check">
@ -232,7 +215,6 @@ check = {
interval = "10s" interval = "10s"
timeout = "1s" timeout = "1s"
} }
``` ```
```json ```json
@ -249,7 +231,21 @@ check = {
</CodeTabs> </CodeTabs>
A UDP check: ### UDP check ((#udp-interval))
UDP checks periodically direct the Consul agent to send UDP datagrams
to the specified IP/hostname and port,
waiting `interval` amount of time between attempts.
The check status is set to `success` if any response is received from the targeted UDP server.
Any other result sets the status to `critical`.
By default, UDP checks are configured with a request timeout equal to 10 seconds.
To configure a custom UDP check timeout value,
specify the `timeout` field in the check definition.
If any timeout on read exists, the check is still considered healthy.
The following service definition file snippet is an example
of a UDP check definition:
<CodeTabs heading="UDP Check"> <CodeTabs heading="UDP Check">
@ -261,7 +257,6 @@ check = {
interval = "10s" interval = "10s"
timeout = "1s" timeout = "1s"
} }
``` ```
```json ```json
@ -278,7 +273,32 @@ check = {
</CodeTabs> </CodeTabs>
A TTL check: ### Time to live (TTL) check ((#ttl))
TTL checks retain their last known state for the specified `ttl` duration.
If the `ttl` duration elapses before a new check update
is provided over the HTTP interface,
the check is set to `critical` state.
This mechanism relies on the application to directly report its health.
For example, a healthy app can periodically `PUT` a status update to the HTTP endpoint.
Then, if the app is disrupted and unable to perform this update
before the TTL expires, the health check enters the `critical` state.
The endpoints used to update health information for a given check are: [pass](/api-docs/agent/check#ttl-check-pass),
[warn](/api-docs/agent/check#ttl-check-warn), [fail](/api-docs/agent/check#ttl-check-fail),
and [update](/api-docs/agent/check#ttl-check-update). TTL checks also persist their
last known status to disk. This persistence allows the Consul agent to restore the last known
status of the check across agent restarts. Persisted check status is valid through the
end of the TTL from the time of the last check.
To manually mark a service unhealthy,
it is far more convenient to use the maintenance mode
[CLI command](/commands/maint) or
[HTTP API endpoint](/api-docs/agent#enable-maintenance-mode)
rather than a TTL health check with arbitrarily high `ttl`.
The following service definition file snippet is an example
of a TTL check definition:
<CodeTabs heading="TTL Check"> <CodeTabs heading="TTL Check">
@ -304,7 +324,24 @@ check = {
</CodeTabs> </CodeTabs>
A Docker check: ### Docker check ((#docker-interval))
These checks depend on periodically invoking an external application that
is packaged within a Docker Container. The application is triggered within the running
container through the Docker Exec API. We expect that the Consul agent user has access
to either the Docker HTTP API or the unix socket. Consul uses `$DOCKER_HOST` to
determine the Docker API endpoint. The application is expected to run, perform a health
check of the service running inside the container, and exit with an appropriate exit code.
The check should be paired with an invocation interval. The shell on which the check
has to be performed is configurable, making it possible to run containers which
have different shells on the same host.
The output of a Docker check is limited to 4KB.
Larger outputs are truncated.
The agent must be configured with [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks)
set to `true` in order to enable Docker health checks.
The following service definition file snippet is an example
of a Docker check definition:
<CodeTabs heading="Docker Check"> <CodeTabs heading="Docker Check">
@ -334,7 +371,26 @@ check = {
</CodeTabs> </CodeTabs>
A gRPC check for the whole application: ### gRPC check ((##grpc-interval))
gRPC checks are intended for applications that support the standard
[gRPC health checking protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md).
The state of the check will be updated by periodically probing the configured endpoint,
waiting `interval` amount of time between attempts.
By default, gRPC checks are configured with a timeout equal to 10 seconds.
To configure a custom Docker check timeout value,
specify the `timeout` field in the check definition.
gRPC checks default to not using TLS.
To enable TLS, set `grpc_use_tls` in the check definition.
If TLS is enabled, then by default, a valid TLS certificate is expected.
Certificate verification can be turned off by setting the
`tls_skip_verify` field to `true` in the check definition.
To check on a specific service instead of the whole gRPC server, add the service identifier after the `gRPC` check's endpoint in the following format `/:service_identifier`.
The following service definition file snippet is an example
of a gRPC check for a whole application:
<CodeTabs heading="gRPC Check"> <CodeTabs heading="gRPC Check">
@ -362,7 +418,8 @@ check = {
</CodeTabs> </CodeTabs>
A gRPC check for the specific `my_service` service: The following service definition file snippet is an example
of a gRPC check for the specific `my_service` service
<CodeTabs heading="gRPC Specific Service Check"> <CodeTabs heading="gRPC Specific Service Check">
@ -390,7 +447,23 @@ check = {
</CodeTabs> </CodeTabs>
A h2ping check: ### H2ping check ((#h2ping-interval))
H2ping checks test an endpoint that uses http2 by connecting to the endpoint
and sending a ping frame, waiting `interval` amount of time between attempts.
If the ping is successful within a specified timeout,
then the check status is set to `success`.
By default, h2ping checks are configured with a request timeout equal to 10 seconds.
To configure a custom h2ping check timeout value,
specify the `timeout` field in the check definition.
TLS is enabled by default.
To disable TLS and use h2c, set `h2ping_use_tls` to `false`.
If TLS is not disabled, a valid certificate is required unless `tls_skip_verify` is set to `true`.
The following service definition file snippet is an example
of an h2ping check definition:
<CodeTabs heading="H2ping Check"> <CodeTabs heading="H2ping Check">
@ -418,7 +491,29 @@ check = {
</CodeTabs> </CodeTabs>
An alias check for a local service: ### Alias check
These checks alias the health state of another registered
node or service. The state of the check updates asynchronously, but is
nearly instant. For aliased services on the same agent, the local state is monitored
and no additional network resources are consumed. For other services and nodes,
the check maintains a blocking query over the agent's connection with a current
server and allows stale requests. If there are any errors in watching the aliased
node or service, the check state is set to `critical`.
For the blocking query, the check uses the ACL token set on the service or check definition.
If no ACL token is set in the service or check definition,
the blocking query uses the agent's default ACL token
([`acl.tokens.default`](/docs/agent/config/config-files#acl_tokens_default)).
~> **Configuration info**: The alias check configuration expects the alias to be
registered on the same agent as the one you are aliasing. If the service is
not registered with the same agent, `"alias_node": "<node_id>"` must also be
specified. When using `alias_node`, if no service is specified, the check will
alias the health of the node. If a service is specified, the check will alias
the specified service on this particular node.
The following service definition file snippet is an example
of an alias check for a local service:
<CodeTabs heading="Alias Check"> <CodeTabs heading="Alias Check">
@ -440,72 +535,137 @@ check = {
</CodeTabs> </CodeTabs>
~> Configuration info: The alias check configuration expects the alias to be ## Check definition
registered on the same agent as the one you are aliasing. If the service is
not registered with the same agent, `"alias_node": "<node_id>"` must also be
specified. When using `alias_node`, if no service is specified, the check will
alias the health of the node. If a service is specified, the check will alias
the specified service on this particular node.
Each type of definition must include a `name` and may optionally provide an This section covers some of the most common options for check definitions.
`id` and `notes` field. The `id` must be unique per _agent_ otherwise only the For a complete list of all check options, refer to the
last defined check with that `id` will be registered. If the `id` is not set [Register Check HTTP API endpoint documentation](/api-docs/agent/check#json-request-body-schema).
and the check is embedded within a service definition a unique check id is
generated. Otherwise, `id` will be set to `name`. If names might conflict,
unique IDs should be provided.
The `notes` field is opaque to Consul but can be used to provide a human-readable -> **Casing for check options:**
description of the current state of the check. Similarly, an external process The correct casing for an option depends on whether the check is defined in
updating a TTL check via the HTTP interface can set the `notes` value. a service definition file or an HTTP API JSON request body.
For example, the option `deregister_critical_service_after` in a service
definition file is instead named `DeregisterCriticalServiceAfter` in an
HTTP API JSON request body.
Checks may also contain a `token` field to provide an ACL token. This token is #### General options
used for any interaction with the catalog for the check, including
[anti-entropy syncs](/docs/architecture/anti-entropy) and deregistration.
For Alias checks, this token is used if a remote blocking query is necessary
to watch the state of the aliased node or service.
Script, TCP, UDP, HTTP, Docker, and gRPC checks must include an `interval` field. This - `name` `(string: <required>)` - Specifies the name of the check.
field is parsed by Go's `time` package, and has the following
[formatting specification](https://golang.org/pkg/time/#ParseDuration):
> A duration string is a possibly signed sequence of decimal numbers, each with - `id` `(string: "")` - Specifies a unique ID for this check on this node.
> optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m".
> Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
In Consul 0.7 and later, checks that are associated with a service may also contain If unspecified, Consul defines the check id by:
an optional `deregister_critical_service_after` field, which is a timeout in the - If the check definition is embedded within a service definition file,
same Go time format as `interval` and `ttl`. If a check is in the critical state a unique check id is auto-generated.
for more than this configured value, then its associated service (and all of its - Otherwise, the `id` is set to the value of `name`.
associated checks) will automatically be deregistered. The minimum timeout is 1 If names might conflict, you must provide unique IDs to avoid
minute, and the process that reaps critical services runs every 30 seconds, so it overwriting existing checks with the same id on this node.
may take slightly longer than the configured timeout to trigger the deregistration.
This should generally be configured with a timeout that's much, much longer than
any expected recoverable outage for the given service.
To configure a check, either provide it as a `-config-file` option to the - `interval` `(string: <required for interval-based checks>)` - Specifies
agent or place it inside the `-config-dir` of the agent. The file must the frequency at which to run this check.
end in a ".json" or ".hcl" extension to be loaded by Consul. Check definitions Required for all check types except TTL and alias checks.
can also be updated by sending a `SIGHUP` to the agent. Alternatively, the
check can be registered dynamically using the [HTTP API](/api).
## Check Scripts The value is parsed by Go's `time` package, and has the following
[formatting specification](https://golang.org/pkg/time/#ParseDuration):
A check script is generally free to do anything to determine the status > A duration string is a possibly signed sequence of decimal numbers, each with
of the check. The only limitations placed are that the exit codes must obey > optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m".
this convention: > Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
- Exit code 0 - Check is passing - `service_id` `(string: <required for service health checks>)` - Specifies
- Exit code 1 - Check is warning the ID of a service instance to associate this check with.
- Any other code - Check is failing That service instance must be on this node.
If not specified, this check is treated as a node-level check.
For more information, refer to the
[service-bound checks](#service-bound-checks) section.
This is the only convention that Consul depends on. Any output of the script - `status` `(string: "")` - Specifies the initial status of the health check as
will be captured and stored in the `output` field. "critical" (default), "warning", or "passing". For more details, refer to
the [initial health check status](#initial-health-check-status) section.
In Consul 0.9.0 and later, the agent must be configured with -> **Health defaults to critical:** If health status it not initially specified,
[`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks) set to `true` it defaults to "critical" to protect against including a service
in order to enable script checks. in discovery results before it is ready.
## Initial Health Check Status - `deregister_critical_service_after` `(string: "")` - If specified,
the associated service and all its checks are deregistered
after this check is in the critical state for more than the specified value.
The value has the same formatting specification as the [`interval`](#interval) field.
The minimum timeout is 1 minute,
and the process that reaps critical services runs every 30 seconds,
so it may take slightly longer than the configured timeout to trigger the deregistration.
This field should generally be configured with a timeout that's significantly longer than
any expected recoverable outage for the given service.
- `notes` `(string: "")` - Provides a human-readable description of the check.
This field is opaque to Consul and can be used however is useful to the user.
For example, it could be used to describe the current state of the check.
- `token` `(string: "")` - Specifies an ACL token used for any interaction
with the catalog for the check, including
[anti-entropy syncs](/docs/architecture/anti-entropy) and deregistration.
For alias checks, this token is used if a remote blocking query is necessary to watch the state of the aliased node or service.
#### Success/failures before passing/warning/critical
To prevent flapping health checks and limit the load they cause on the cluster,
a health check may be configured to become passing/warning/critical only after a
specified number of consecutive checks return as passing/critical.
The status does not transition states until the configured threshold is reached.
- `success_before_passing` - Number of consecutive successful results required
before check status transitions to passing. Defaults to `0`. Added in Consul 1.7.0.
- `failures_before_warning` - Number of consecutive unsuccessful results required
before check status transitions to warning. Defaults to the same value as that of
`failures_before_critical` to maintain the expected behavior of not changing the
status of service checks to `warning` before `critical` unless configured to do so.
Values higher than `failures_before_critical` are invalid. Added in Consul 1.11.0.
- `failures_before_critical` - Number of consecutive unsuccessful results required
before check status transitions to critical. Defaults to `0`. Added in Consul 1.7.0.
This feature is available for all check types except TTL and alias checks.
By default, both passing and critical thresholds are set to 0 so the check
status always reflects the last check result.
<CodeTabs heading="Flapping Prevention Example">
```hcl
checks = [
{
name = "HTTP TCP on port 80"
tcp = "localhost:80"
interval = "10s"
timeout = "1s"
success_before_passing = 3
failures_before_warning = 1
failures_before_critical = 3
}
]
```
```json
{
"checks": [
{
"name": "HTTP TCP on port 80",
"tcp": "localhost:80",
"interval": "10s",
"timeout": "1s",
"success_before_passing": 3,
"failures_before_warning": 1,
"failures_before_critical": 3
}
]
}
```
</CodeTabs>
## Initial health check status
By default, when checks are registered against a Consul agent, the state is set By default, when checks are registered against a Consul agent, the state is set
immediately to "critical". This is useful to prevent services from being immediately to "critical". This is useful to prevent services from being
@ -576,13 +736,13 @@ In the above configuration, if the web-app health check begins failing, it will
only affect the availability of the web-app service. All other services only affect the availability of the web-app service. All other services
provided by the node will remain unchanged. provided by the node will remain unchanged.
## Agent Certificates for TLS Checks ## Agent certificates for TLS checks
The [enable_agent_tls_for_checks](/docs/agent/config/config-files#enable_agent_tls_for_checks) The [enable_agent_tls_for_checks](/docs/agent/config/config-files#enable_agent_tls_for_checks)
agent configuration option can be utilized to have HTTP or gRPC health checks agent configuration option can be utilized to have HTTP or gRPC health checks
to use the agent's credentials when configured for TLS. to use the agent's credentials when configured for TLS.
## Multiple Check Definitions ## Multiple check definitions
Multiple check definitions can be defined using the `checks` (plural) Multiple check definitions can be defined using the `checks` (plural)
key in your configuration file. key in your configuration file.
@ -640,58 +800,3 @@ checks = [
``` ```
</CodeTabs> </CodeTabs>
## Success/Failures before passing/warning/critical
To prevent flapping health checks, and limit the load they cause on the cluster,
a health check may be configured to become passing/warning/critical only after a
specified number of consecutive checks return passing/critical.
The status will not transition states until the configured threshold is reached.
- `success_before_passing` - Number of consecutive successful results required
before check status transitions to passing. Defaults to `0`. Added in Consul 1.7.0.
- `failures_before_warning` - Number of consecutive unsuccessful results required
before check status transitions to warning. Defaults to the same value as that of
`failures_before_critical` to maintain the expected behavior of not changing the
status of service checks to `warning` before `critical` unless configured to do so.
Values higher than `failures_before_critical` are invalid. Added in Consul 1.11.0.
- `failures_before_critical` - Number of consecutive unsuccessful results required
before check status transitions to critical. Defaults to `0`. Added in Consul 1.7.0.
This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.
By default, both passing and critical thresholds will be set to 0 so the check
status will always reflect the last check result.
<CodeTabs heading="Flapping Prevention Example">
```hcl
checks = [
{
name = "HTTP TCP on port 80"
tcp = "localhost:80"
interval = "10s"
timeout = "1s"
success_before_passing = 3
failures_before_warning = 1
failures_before_critical = 3
}
]
```
```json
{
"checks": [
{
"name": "HTTP TCP on port 80",
"tcp": "localhost:80",
"interval": "10s",
"timeout": "1s",
"success_before_passing": 3,
"failures_before_warning": 1,
"failures_before_critical": 3
}
]
}
```
</CodeTabs>