mirror of https://github.com/status-im/consul.git
Updated autopilot guide (#4997)
* Updated the autopilot guide to prepare for the migration to the learn platform. * Correcting typos
This commit is contained in:
parent
7d32989e52
commit
8881d352aa
|
@ -8,7 +8,7 @@ description: |-
|
|||
|
||||
# Autopilot
|
||||
|
||||
Autopilot is a set of new features added in Consul 0.8 to allow for automatic
|
||||
Autopilot features allow for automatic,
|
||||
operator-friendly management of Consul servers. It includes cleanup of dead
|
||||
servers, monitoring the state of the Raft cluster, and stable server introduction.
|
||||
|
||||
|
@ -19,11 +19,25 @@ the Agent configuration must be set to 3 or higher on all servers. In Consul
|
|||
information, see the [Version Upgrade section](/docs/upgrade-specific.html#raft_protocol)
|
||||
on Raft Protocol versions.
|
||||
|
||||
## Configuration
|
||||
In this guide we will learn more about Autopilot's features.
|
||||
|
||||
* Dead server cleanup
|
||||
* Server Stabilization
|
||||
* Redundancy zone tags
|
||||
* Upgrade migration
|
||||
|
||||
Finally, we will review how to ensure Autopilot is healthy.
|
||||
|
||||
Note, in this guide we are using examples from a Consul 1.4 cluster, we
|
||||
are starting with Autopilot enabled by default.
|
||||
|
||||
## Default Configuration
|
||||
|
||||
The configuration of Autopilot is loaded by the leader from the agent's
|
||||
[Autopilot settings](/docs/agent/options.html#autopilot) when initially
|
||||
bootstrapping the cluster:
|
||||
bootstrapping the cluster. Since Autopilot and it's features are already
|
||||
enabled, we only need to update the configuration to disable them. The
|
||||
following are the defaults.
|
||||
|
||||
```
|
||||
{
|
||||
|
@ -31,16 +45,22 @@ bootstrapping the cluster:
|
|||
"last_contact_threshold": "200ms",
|
||||
"max_trailing_logs": 250,
|
||||
"server_stabilization_time": "10s",
|
||||
"redundancy_zone_tag": "az",
|
||||
"redundancy_zone_tag": "",
|
||||
"disable_upgrade_migration": false,
|
||||
"upgrade_version_tag": ""
|
||||
}
|
||||
```
|
||||
|
||||
All Consul servers should have Autopilot and its features either enabled
|
||||
or disabled to ensure consistency accross servers in case of a failure. Additionally,
|
||||
Autopilot must be enabled to use any of the features, but the features themselves
|
||||
can be configured independently. Meaning you can enable or disable any of the features
|
||||
separately, at any time.
|
||||
|
||||
After bootstrapping, the configuration can be viewed or modified either via the
|
||||
[`operator autopilot`](/docs/commands/operator/autopilot.html) subcommand or the
|
||||
[`/v1/operator/autopilot/configuration`](/api/operator.html#autopilot-configuration)
|
||||
HTTP endpoint:
|
||||
HTTP endpoint.
|
||||
|
||||
```
|
||||
$ consul operator autopilot get-config
|
||||
|
@ -51,7 +71,29 @@ ServerStabilizationTime = 10s
|
|||
RedundancyZoneTag = ""
|
||||
DisableUpgradeMigration = false
|
||||
UpgradeVersionTag = ""
|
||||
```
|
||||
|
||||
In the example above, we used the `operator autopilot get-config` subcommand to check
|
||||
the autopilot configuration. You can see we still have all the defaults.
|
||||
|
||||
## Dead Server Cleanup
|
||||
|
||||
If Autopilot is disabled, it will take 72 hours for dead servers to be automatically reaped
|
||||
or an operator had to script a `consul force-leave`. If another server failure occurred
|
||||
it could jeopardize the quorum, even if the failed Consul server had been automatically
|
||||
replaced. Autopilot helps prevent these kinds of outages by quickly removing failed
|
||||
servers as soon as a replacement Consul server comes online. When servers are removed
|
||||
by the cleanup process they will enter the "left" state.
|
||||
|
||||
With Autopilot's dead server cleanup enabled, dead servers will periodically be
|
||||
cleaned up and removed from the Raft peer set to prevent them from interfering with
|
||||
the quorum size and leader elections. The cleanup process will also be automatically
|
||||
triggered whenever a new server is successfully added to the cluster.
|
||||
|
||||
To update the dead server cleanup feature use `consul operator autopilot set-config`
|
||||
with the `-cleanup-dead-servers` flag.
|
||||
|
||||
```sh
|
||||
$ consul operator autopilot set-config -cleanup-dead-servers=false
|
||||
Configuration updated!
|
||||
|
||||
|
@ -65,34 +107,143 @@ DisableUpgradeMigration = false
|
|||
UpgradeVersionTag = ""
|
||||
```
|
||||
|
||||
## Dead Server Cleanup
|
||||
We have disabled dead server cleanup, but sill have all the other Autopilot defaults.
|
||||
|
||||
Dead servers will periodically be cleaned up and removed from the Raft peer
|
||||
set, to prevent them from interfering with the quorum size and leader elections.
|
||||
This cleanup will also happen whenever a new server is successfully added to the
|
||||
cluster.
|
||||
## Server Stabalization
|
||||
|
||||
Prior to Autopilot, it would take 72 hours for dead servers to be automatically reaped,
|
||||
or operators had to script a `consul force-leave`. If another server failure occurred,
|
||||
it could jeopardize the quorum, even if the failed Consul server had been automatically
|
||||
replaced. Autopilot helps prevent these kinds of outages by quickly removing failed
|
||||
servers as soon as a replacement Consul server comes online. When servers are removed
|
||||
by the cleanup process they will enter the "left" state.
|
||||
When a new server is added to the cluster, there is a waiting period where it
|
||||
must be healthy and stable for a certain amount of time before being promoted
|
||||
to a full, voting member. This can be configured via the `ServerStabilizationTime`
|
||||
setting.
|
||||
|
||||
This option can be disabled by running `consul operator autopilot set-config`
|
||||
with the `-cleanup-dead-servers=false` option.
|
||||
```sh
|
||||
consul operator autopilot set-config -server-stabalization-time=5s
|
||||
Configuration updated!
|
||||
|
||||
$ consul operator autopilot get-config
|
||||
CleanupDeadServers = false
|
||||
LastContactThreshold = 200ms
|
||||
MaxTrailingLogs = 250
|
||||
ServerStabilizationTime = 5s
|
||||
RedundancyZoneTag = ""
|
||||
DisableUpgradeMigration = false
|
||||
UpgradeVersionTag = ""
|
||||
```
|
||||
|
||||
Now we have disabled dead server cleanup and set the server stabilization time to 5 seconds.
|
||||
When a new server is added to our cluster, it will only need to be healthy and stable for
|
||||
5 seconds.
|
||||
|
||||
## Redundancy Zones
|
||||
|
||||
Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of
|
||||
isolated failure domains such as AWS Availability Zones; users would be forced to either
|
||||
have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by
|
||||
deploying just one server in each.
|
||||
|
||||
If the `RedundancyZoneTag` setting is set, Consul will use its value to look for a
|
||||
zone in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta)
|
||||
tag. For example, if `RedundancyZoneTag` is set to `zone`, and `-node-meta zone:east1a`
|
||||
is used when starting a server, that server's redundancy zone will be `east1a`.
|
||||
|
||||
```
|
||||
$ consul operator autopilot set-config -redundancy-zone-tag=uswest1
|
||||
Configuration updated!
|
||||
|
||||
$ consul operator autopilot get-config
|
||||
CleanupDeadServers = false
|
||||
LastContactThreshold = 200ms
|
||||
MaxTrailingLogs = 250
|
||||
ServerStabilizationTime = 5s
|
||||
RedundancyZoneTag = "uswest1"
|
||||
DisableUpgradeMigration = false
|
||||
UpgradeVersionTag = ""
|
||||
```
|
||||
|
||||
For our Autopilot features, we now have disabled dead server cleanup, server stabilization time to 5 seconds, and
|
||||
the redundancy zone tag is uswest1.
|
||||
|
||||
Consul will then use these values to partition the servers by redundancy zone, and will
|
||||
aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters
|
||||
on standby to be promoted if the active voter leaves or dies.
|
||||
|
||||
## Upgrade Migrations
|
||||
|
||||
Autopilot in Consul *Enterprise* supports upgrade migrations by default. To disable this
|
||||
functionality, set `DisableUpgradeMigration` to true.
|
||||
|
||||
```sh
|
||||
$ consul operator autopilot set-config -disable-upgrade-migration=true
|
||||
Configuration updated!
|
||||
|
||||
$ consul operator autopilot get-config
|
||||
CleanupDeadServers = false
|
||||
LastContactThreshold = 200ms
|
||||
MaxTrailingLogs = 250
|
||||
ServerStabilizationTime = 5s
|
||||
RedundancyZoneTag = "uswest1"
|
||||
DisableUpgradeMigration = true
|
||||
UpgradeVersionTag = ""
|
||||
```
|
||||
|
||||
With upgrade migration enabled, when a new server is added and Autopilot detects that
|
||||
its Consul version is newer than that of the existing servers, Autopilot will avoid
|
||||
promoting the new server until enough newer-versioned servers have been added to the
|
||||
cluster. When the count of new servers equals or exceeds that of the old servers,
|
||||
Autopilot will begin promoting the new servers to voters and demoting the old servers.
|
||||
After this is finished, the old servers can be safely removed from the cluster.
|
||||
|
||||
To check the consul version of the servers, you can either use the [autopilot health]
|
||||
(/api/operator.html#autopilot-health) endpoint or the `consul members`
|
||||
command.
|
||||
|
||||
```
|
||||
$ consul members
|
||||
Node Address Status Type Build Protocol DC Segment
|
||||
node1 127.0.0.1:8301 alive server 1.4.0 2 dc1 <all>
|
||||
node2 127.0.0.1:8703 alive server 1.4.0 2 dc1 <all>
|
||||
node3 127.0.0.1:8803 alive server 1.4.0 2 dc1 <all>
|
||||
node4 127.0.0.1:8203 alive server 1.3.0 2 dc1 <all>
|
||||
```
|
||||
|
||||
### Migrations Without a Consul Version Change
|
||||
|
||||
The `UpgradeVersionTag` can be used to override the version information used during
|
||||
a migration, so that the migration logic can be used for updating the cluster when
|
||||
changing configuration.
|
||||
|
||||
If the `UpgradeVersionTag` setting is set, Consul will use its value to look for a
|
||||
version in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta)
|
||||
tag. For example, if `UpgradeVersionTag` is set to `build`, and `-node-meta build:0.0.2`
|
||||
is used when starting a server, that server's version will be `0.0.2` when considered in
|
||||
a migration. The upgrade logic will follow semantic versioning and the version string
|
||||
must be in the form of either `X`, `X.Y`, or `X.Y.Z`.
|
||||
|
||||
```sh
|
||||
$ consul operator autopilot set-config -upgrade-version-tag=1.4.0
|
||||
Configuration updated!
|
||||
|
||||
$ consul operator autopilot get-config
|
||||
CleanupDeadServers = false
|
||||
LastContactThreshold = 200ms
|
||||
MaxTrailingLogs = 250
|
||||
ServerStabilizationTime = 5s
|
||||
RedundancyZoneTag = "uswest1"
|
||||
DisableUpgradeMigration = true
|
||||
UpgradeVersionTag = "1.4.0"
|
||||
```
|
||||
|
||||
## Server Health Checking
|
||||
|
||||
An internal health check runs on the leader to track the stability of servers.
|
||||
<br>A server is considered healthy if all of the following conditions are true:
|
||||
<br>A server is considered healthy if all of the following conditions are true.
|
||||
|
||||
- It has a SerfHealth status of 'Alive'
|
||||
- It has a SerfHealth status of 'Alive'.
|
||||
- The time since its last contact with the current leader is below
|
||||
`LastContactThreshold`
|
||||
- Its latest Raft term matches the leader's term
|
||||
`LastContactThreshold`.
|
||||
- Its latest Raft term matches the leader's term.
|
||||
- The number of Raft log entries it trails the leader by does not exceed
|
||||
`MaxTrailingLogs`
|
||||
`MaxTrailingLogs`.
|
||||
|
||||
The status of these health checks can be viewed through the [`/v1/operator/autopilot/health`]
|
||||
(/api/operator.html#autopilot-health) HTTP endpoint, with a top level
|
||||
|
@ -136,83 +287,13 @@ $ curl localhost:8500/v1/operator/autopilot/health
|
|||
}
|
||||
```
|
||||
|
||||
## Stable Server Introduction
|
||||
## Summary
|
||||
|
||||
When a new server is added to the cluster, there is a waiting period where it
|
||||
must be healthy and stable for a certain amount of time before being promoted
|
||||
to a full, voting member. This can be configured via the `ServerStabilizationTime`
|
||||
setting.
|
||||
In this guide we configured most of the Autopilot features; dead server cleanup, server
|
||||
stabalization, redunancy zone tags, upgrade migration, and upgrade version tag.
|
||||
|
||||
---
|
||||
|
||||
~> The following Autopilot features are available only in
|
||||
[Consul Enterprise](https://www.hashicorp.com/products/consul/) version 0.8.0 and later.
|
||||
|
||||
## Server Read Scaling
|
||||
|
||||
With the [`-non-voting-server`](/docs/agent/options.html#_non_voting_server) option, a
|
||||
server can be explicitly marked as a non-voter and will never be promoted to a voting
|
||||
member. This can be useful when more read scaling is needed; being a non-voter means
|
||||
that the server will still have data replicated to it, but it will not be part of the
|
||||
quorum that the leader must wait for before committing log entries.
|
||||
|
||||
## Redundancy Zones
|
||||
|
||||
Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of
|
||||
isolated failure domains such as AWS Availability Zones; users would be forced to either
|
||||
have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by
|
||||
deploying just one server in each.
|
||||
|
||||
If the `RedundancyZoneTag` setting is set, Consul will use its value to look for a
|
||||
zone in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta)
|
||||
tag. For example, if `RedundancyZoneTag` is set to `zone`, and `-node-meta zone:east1a`
|
||||
is used when starting a server, that server's redundancy zone will be `east1a`.
|
||||
|
||||
Here's an example showing how to configure this:
|
||||
|
||||
```
|
||||
$ consul operator autopilot set-config -redundancy-zone-tag=zone
|
||||
Configuration updated!
|
||||
```
|
||||
|
||||
Consul will then use these values to partition the servers by redundancy zone, and will
|
||||
aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters
|
||||
on standby to be promoted if the active voter leaves or dies.
|
||||
|
||||
## Upgrade Migrations
|
||||
|
||||
Autopilot in Consul Enterprise supports upgrade migrations by default. To disable this
|
||||
functionality, set `DisableUpgradeMigration` to true.
|
||||
|
||||
When a new server is added and Autopilot detects that its Consul version is newer than
|
||||
that of the existing servers, Autopilot will avoid promoting the new server until enough
|
||||
newer-versioned servers have been added to the cluster. When the count of new servers
|
||||
equals or exceeds that of the old servers, Autopilot will begin promoting the new servers
|
||||
to voters and demoting the old servers. After this is finished, the old servers can be
|
||||
safely removed from the cluster.
|
||||
|
||||
To check the consul version of the servers, either the [autopilot health]
|
||||
(/api/operator.html#autopilot-health) endpoint or the `consul members`
|
||||
command can be used:
|
||||
|
||||
```
|
||||
$ consul members
|
||||
Node Address Status Type Build Protocol DC
|
||||
node1 127.0.0.1:8301 alive server 0.7.5 2 dc1
|
||||
node2 127.0.0.1:8703 alive server 0.7.5 2 dc1
|
||||
node3 127.0.0.1:8803 alive server 0.7.5 2 dc1
|
||||
node4 127.0.0.1:8203 alive server 0.8.0 2 dc1
|
||||
```
|
||||
|
||||
### Migrations Without a Consul Version Change
|
||||
|
||||
The `UpgradeVersionTag` can be used to override the version information used during
|
||||
a migration, so that the migration logic can be used for updating the cluster when
|
||||
changing configuration.
|
||||
|
||||
If the `UpgradeVersionTag` setting is set, Consul will use its value to look for a
|
||||
version in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta)
|
||||
tag. For example, if `UpgradeVersionTag` is set to `build`, and `-node-meta build:0.0.2`
|
||||
is used when starting a server, that server's version will be `0.0.2` when considered in
|
||||
a migration. The upgrade logic will follow semantic versioning and the version string
|
||||
must be in the form of either `X`, `X.Y`, or `X.Y.Z`.
|
||||
To learn more about the Autopilot settings we did not configure,
|
||||
[last_contact_threshold](https://www.consul.io/docs/agent/options.html#last_contact_threshold)
|
||||
and [max_trailing_logs](https://www.consul.io/docs/agent/options.html#max_trailing_logs),
|
||||
either read the agent configuration documenation or use the help flag with the
|
||||
opertor autopilot `consul operator autopilot set-config -h`.
|
||||
|
|
Loading…
Reference in New Issue