Major updates and reorganizing of checks.mdx (#15806)

* Major updates and reorganizing of checks.mdx

* Update checks.mdx

Additional suggestion for clarity around gRPC `:/service-identifier` example

Signed-off-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* Update website/content/docs/discovery/checks.mdx

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

Signed-off-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
This commit is contained in:
am-ak 2023-01-19 16:55:13 +01:00 committed by GitHub
parent 794277371f
commit ff477c44d6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 54 additions and 145 deletions

View File

@ -10,149 +10,44 @@ description: >-
One of the primary roles of the agent is management of system-level and application-level health One of the primary roles of the agent is management of system-level and application-level health
checks. A health check is considered to be application-level if it is associated with a checks. A health check is considered to be application-level if it is associated with a
service. If not associated with a service, the check monitors the health of the entire node. service. If not associated with a service, the check monitors the health of the entire node.
Review the [health checks tutorial](https://learn.hashicorp.com/tutorials/consul/service-registration-health-checks) Review the [health checks tutorial](https://learn.hashicorp.com/tutorials/consul/service-registration-health-checks)
to get a more complete example on how to leverage health check capabilities in Consul. to get a more complete example on how to leverage health check capabilities in Consul.
A check is defined in a configuration file or added at runtime over the HTTP interface. Checks A check is defined in a configuration file or added at runtime over the HTTP interface. Checks
created via the HTTP interface persist with that node. created via the HTTP interface persist with that node.
There are several different kinds of checks: There are severeal types of checks:
- Script + Interval - These checks depend on invoking an external application - [`Script + Interval`](#script-check) - These checks invoke an external application
that performs the health check, exits with an appropriate exit code, and potentially that performs the health check.
generates some output. A script is paired with an invocation interval (e.g.
every 30 seconds). This is similar to the Nagios plugin system. The output of
a script check is limited to 4KB. Output larger than this will be truncated.
By default, Script checks will be configured with a timeout equal to 30 seconds.
It is possible to configure a custom Script check timeout value by specifying the
`timeout` field in the check definition. When the timeout is reached on Windows,
Consul will wait for any child processes spawned by the script to finish. For any
other system, Consul will attempt to force-kill the script and any child processes
it has spawned once the timeout has passed.
In Consul 0.9.0 and later, script checks are not enabled by default. To use them you
can either use :
- [`enable_local_script_checks`](/docs/agent/config/cli-flags#_enable_local_script_checks): - [`HTTP + Interval`](#http-check) - These checks make an HTTP `GET` request to the specified URL
enable script checks defined in local config files. Script checks defined via the HTTP in the health check definition.
API will not be allowed.
- [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks): enable
script checks regardless of how they are defined.
~> **Security Warning:** Enabling script checks in some configurations may - [`TCP + Interval`](#tcp-check) - These checks attempt a TCP connection to the specified
introduce a remote execution vulnerability which is known to be targeted by address and port in the health check definition.
malware. We strongly recommend `enable_local_script_checks` instead. See [this
blog post](https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations)
for more details.
- `HTTP + Interval` - These checks make an HTTP `GET` request to the specified URL, - [`UDP + Interval`](#udp-check) - These checks direct the client to periodically send UDP datagrams
waiting the specified `interval` amount of time between requests (eg. 30 seconds). to the specified address and port in the health check definition.
The status of the service depends on the HTTP response code: any `2xx` code is
considered passing, a `429 Too ManyRequests` is a warning, and anything else is
a failure. This type of check
should be preferred over a script that uses `curl` or another external process
to check a simple HTTP operation. By default, HTTP checks are `GET` requests
unless the `method` field specifies a different method. Additional header
fields can be set through the `header` field which is a map of lists of
strings, e.g. `{"x-foo": ["bar", "baz"]}`. By default, HTTP checks will be
configured with a request timeout equal to 10 seconds.
It is possible to configure a custom HTTP check timeout value by - [`OSService + Interval`](#osservice-check) - These checks periodically direct the Consul agent to monitor
specifying the `timeout` field in the check definition. The output of the the health of a service running on the host operating system.
check is limited to roughly 4KB. Responses larger than this will be truncated.
HTTP checks also support TLS. By default, a valid TLS certificate is expected.
Certificate verification can be turned off by setting the `tls_skip_verify`
field to `true` in the check definition. When using TLS, the SNI will be set
automatically from the URL if it uses a hostname (as opposed to an IP address);
the value can be overridden by setting `tls_server_name`.
Consul follows HTTP redirects by default. Set the `disable_redirects` field to - [`Time to Live (TTL)`](#time-to-live-ttl-check) - These checks attempt an HTTP connection after a given TTL elapses.
`true` to disable redirects.
- `TCP + Interval` - These checks make a TCP connection attempt to the specified - [`Docker + Interval`](#docker-check) - These checks invoke an external application that
IP/hostname and port, waiting `interval` amount of time between attempts is packaged within a Docker container.
(e.g. 30 seconds). If no hostname
is specified, it defaults to "localhost". The status of the service depends on
whether the connection attempt is successful (ie - the port is currently
accepting connections). If the connection is accepted, the status is
`success`, otherwise the status is `critical`. In the case of a hostname that
resolves to both IPv4 and IPv6 addresses, an attempt will be made to both
addresses, and the first successful connection attempt will result in a
successful check. This type of check should be preferred over a script that
uses `netcat` or another external process to check a simple socket operation.
By default, TCP checks will be configured with a request timeout of 10 seconds.
It is possible to configure a custom TCP check timeout value by specifying the
`timeout` field in the check definition.
- `UDP + Interval` - These checks direct the client to periodically send UDP datagrams - [`gRPC + Interval`](#grpc-check) - These checks are intended for applications that support the standard
to the specified IP/hostname and port. The duration specified in the `interval` field sets the amount of time
between attempts, such as `30s` to indicate 30 seconds. The check is logged as healthy if any response from the UDP server is received. Any other result sets the status to `critical`.
For UDP checks, the default value for the `timeout` field is `10s`, but you can configure a custom timeout by specifying the
`timeout` field in the check definition. If any timeout on read exists, the check is still considered healthy.
- `Time to Live (TTL)` ((#ttl)) - These checks retain their last known state
for a given TTL. The state of the check must be updated periodically over the HTTP
interface. If an external system fails to update the status within a given TTL,
the check is set to the failed state. This mechanism, conceptually similar to a
dead man's switch, relies on the application to directly report its health. For
example, a healthy app can periodically `PUT` a status update to the HTTP endpoint;
if the app fails, the TTL will expire and the health check enters a critical state.
The endpoints used to update health information for a given check are: [pass](/api-docs/agent/check#ttl-check-pass),
[warn](/api-docs/agent/check#ttl-check-warn), [fail](/api-docs/agent/check#ttl-check-fail),
and [update](/api-docs/agent/check#ttl-check-update). TTL checks also persist their
last known status to disk. This allows the Consul agent to restore the last known
status of the check across restarts. Persisted check status is valid through the
end of the TTL from the time of the last check.
- `Docker + Interval` - These checks depend on invoking an external application which
is packaged within a Docker Container. The application is triggered within the running
container via the Docker Exec API. We expect that the Consul agent user has access
to either the Docker HTTP API or the unix socket. Consul uses `$DOCKER_HOST` to
determine the Docker API endpoint. The application is expected to run, perform a health
check of the service running inside the container, and exit with an appropriate exit code.
The check should be paired with an invocation interval. The shell on which the check
has to be performed is configurable which makes it possible to run containers which
have different shells on the same host. Check output for Docker is limited to
4KB. Any output larger than this will be truncated. In Consul 0.9.0 and later, the agent
must be configured with [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks)
set to `true` in order to enable Docker health checks.
- `gRPC + Interval` - These checks are intended for applications that support the standard
[gRPC health checking protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md). [gRPC health checking protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md).
The state of the check will be updated by probing the configured endpoint, waiting `interval`
amount of time between probes (eg. 30 seconds). By default, gRPC checks will be configured
with a default timeout of 10 seconds.
It is possible to configure a custom timeout value by specifying the `timeout` field in
the check definition. gRPC checks will default to not using TLS, but TLS can be enabled by
setting `grpc_use_tls` in the check definition. If TLS is enabled, then by default, a valid
TLS certificate is expected. Certificate verification can be turned off by setting the
`tls_skip_verify` field to `true` in the check definition.
To check on a specific service instead of the whole gRPC server, add the service identifier after the `gRPC` check's endpoint in the following format `/:service_identifier`.
- `H2ping + Interval` - These checks test an endpoint that uses http2 - [`H2ping + Interval`](#h2ping-check) - These checks test an endpoint that uses HTTP/2
by connecting to the endpoint and sending a ping frame. TLS is assumed to be configured by default. by connecting to the endpoint and sending a ping frame.
To disable TLS and use h2c, set `h2ping_use_tls` to `false`. If the ping is successful
within a specified timeout, then the check is updated as passing.
The timeout defaults to 10 seconds, but is configurable using the `timeout` field. If TLS is enabled a valid
certificate is required, unless `tls_skip_verify` is set to `true`.
The check will be run on the interval specified by the `interval` field.
- `Alias` - These checks alias the health state of another registered - [`Alias`](#alias-check) - These checks alias the health state of another registered
node or service. The state of the check will be updated asynchronously, but is node or service.
nearly instant. For aliased services on the same agent, the local state is monitored
and no additional network resources are consumed. For other services and nodes,
the check maintains a blocking query over the agent's connection with a current
server and allows stale requests. If there are any errors in watching the aliased
node or service, the check state will be critical. For the blocking query, the
check will use the ACL token set on the service or check definition or otherwise
will fall back to the default ACL token set with the agent (`acl_token`).
## Check Definition
A script check:
=======
Review the [service health checks tutorial](https://learn.hashicorp.com/tutorials/consul/service-registration-health-checks)
to get a more complete example on how to leverage health check capabilities in Consul.
## Registering a health check ## Registering a health check
@ -181,7 +76,7 @@ automatically monitor the health of a service instance or node.
to temporarily remove one or all service instances on a node to temporarily remove one or all service instances on a node
from service discovery DNS and HTTP API query results. from service discovery DNS and HTTP API query results.
### Script check ((#script-interval)) ### Script check
Script checks periodically invoke an external application that performs the health check, Script checks periodically invoke an external application that performs the health check,
exits with an appropriate exit code, and potentially generates some output. exits with an appropriate exit code, and potentially generates some output.
@ -255,7 +150,7 @@ Any output of the script is captured and made available in the
`Output` field of checks included in HTTP API responses, `Output` field of checks included in HTTP API responses,
as in this example from the [local service health endpoint](/api-docs/agent/service#by-name-json). as in this example from the [local service health endpoint](/api-docs/agent/service#by-name-json).
### HTTP check ((#http-interval)) ### HTTP check
HTTP checks periodically make an HTTP `GET` request to the specified URL, HTTP checks periodically make an HTTP `GET` request to the specified URL,
waiting the specified `interval` amount of time between requests. waiting the specified `interval` amount of time between requests.
@ -324,7 +219,7 @@ check = {
</CodeTabs> </CodeTabs>
### TCP check ((#tcp-interval)) ### TCP check
TCP checks periodically make a TCP connection attempt to the specified IP/hostname and port, waiting `interval` amount of time between attempts. TCP checks periodically make a TCP connection attempt to the specified IP/hostname and port, waiting `interval` amount of time between attempts.
If no hostname is specified, it defaults to "localhost". If no hostname is specified, it defaults to "localhost".
@ -368,7 +263,7 @@ check = {
</CodeTabs> </CodeTabs>
### UDP check ((#udp-interval)) ### UDP check
UDP checks periodically direct the Consul agent to send UDP datagrams UDP checks periodically direct the Consul agent to send UDP datagrams
to the specified IP/hostname and port, to the specified IP/hostname and port,
@ -416,7 +311,8 @@ OSService checks periodically direct the Consul agent to monitor the health of a
the host operating system as either a Windows service (Windows) or a SystemD service (Unix). the host operating system as either a Windows service (Windows) or a SystemD service (Unix).
The check is logged as `healthy` if the service is running. The check is logged as `healthy` if the service is running.
If it is stopped or not running, the status is `critical`. All other results set If it is stopped or not running, the status is `critical`. All other results set
the status to `warning`, which indicates that the check is not reliable because an issue is preventing the check from determining the health of the service. the status to `warning`, which indicates that the check is not reliable because
an issue is preventing the check from determining the health of the service.
The following service definition file snippet is an example The following service definition file snippet is an example
of an OSService check definition: of an OSService check definition:
@ -447,9 +343,10 @@ check = {
</CodeTabs> </CodeTabs>
### Time to live (TTL) check ((#ttl)) ### Time to live (TTL) check
TTL checks retain their last known state for the specified `ttl` duration. TTL checks retain their last known state for the specified `ttl` duration.
The state of the check updates periodically over the HTTP interface.
If the `ttl` duration elapses before a new check update If the `ttl` duration elapses before a new check update
is provided over the HTTP interface, is provided over the HTTP interface,
the check is set to `critical` state. the check is set to `critical` state.
@ -498,7 +395,7 @@ check = {
</CodeTabs> </CodeTabs>
### Docker check ((#docker-interval)) ### Docker check
These checks depend on periodically invoking an external application that These checks depend on periodically invoking an external application that
is packaged within a Docker Container. The application is triggered within the running is packaged within a Docker Container. The application is triggered within the running
@ -511,8 +408,21 @@ has to be performed is configurable, making it possible to run containers which
have different shells on the same host. have different shells on the same host.
The output of a Docker check is limited to 4KB. The output of a Docker check is limited to 4KB.
Larger outputs are truncated. Larger outputs are truncated.
The agent must be configured with [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks)
set to `true` in order to enable Docker health checks. Docker checks are not enabled by default.
To enable a Consul agent to perform Docker checks,
use one of the following agent configuration options:
- [`enable_local_script_checks`](/docs/agent/config/cli-flags#_enable_local_script_checks):
Enable script checks defined in local config files.
Script checks registered using the HTTP API are not allowed.
- [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks):
Enable script checks no matter how they are registered.
!> **Security Warning:**
We recommend using `enable_local_script_checks` instead of `enable_script_checks` in production
environments, as remote script checks are more vulnerable to malware attacks. Learn more about how [script checks can be exploited](https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations#how-script-checks-can-be-exploited).
The following service definition file snippet is an example The following service definition file snippet is an example
of a Docker check definition: of a Docker check definition:
@ -545,7 +455,7 @@ check = {
</CodeTabs> </CodeTabs>
### gRPC check ((##grpc-interval)) ### gRPC check
gRPC checks are intended for applications that support the standard gRPC checks are intended for applications that support the standard
[gRPC health checking protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md). [gRPC health checking protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md).
@ -561,10 +471,10 @@ To enable TLS, set `grpc_use_tls` in the check definition.
If TLS is enabled, then by default, a valid TLS certificate is expected. If TLS is enabled, then by default, a valid TLS certificate is expected.
Certificate verification can be turned off by setting the Certificate verification can be turned off by setting the
`tls_skip_verify` field to `true` in the check definition. `tls_skip_verify` field to `true` in the check definition.
To check on a specific service instead of the whole gRPC server, add the service identifier after the `gRPC` check's endpoint in the following format `/:service_identifier`. To check on a specific service instead of the whole gRPC server,
add the service identifier after the `gRPC` check's endpoint.
The following service definition file snippet is an example The following example shows a gRPC check for a whole application:
of a gRPC check for a whole application:
<CodeTabs heading="gRPC Check"> <CodeTabs heading="gRPC Check">
@ -592,8 +502,7 @@ check = {
</CodeTabs> </CodeTabs>
The following service definition file snippet is an example The following example shows a gRPC check for the specific `my_service` service:
of a gRPC check for the specific `my_service` service
<CodeTabs heading="gRPC Specific Service Check"> <CodeTabs heading="gRPC Specific Service Check">
@ -621,7 +530,7 @@ check = {
</CodeTabs> </CodeTabs>
### H2ping check ((#h2ping-interval)) ### H2ping check
H2ping checks test an endpoint that uses http2 by connecting to the endpoint H2ping checks test an endpoint that uses http2 by connecting to the endpoint
and sending a ping frame, waiting `interval` amount of time between attempts. and sending a ping frame, waiting `interval` amount of time between attempts.