Updated autopilot guide (#4997)

* Updated the autopilot guide to prepare for the migration to the learn platform.

* Correcting typos
This commit is contained in:
kaitlincarter-hc 2018-11-27 09:10:44 -08:00 committed by GitHub
parent 7d32989e52
commit 8881d352aa
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 182 additions and 101 deletions

View File

@ -8,7 +8,7 @@ description: |-
# Autopilot
Autopilot is a set of new features added in Consul 0.8 to allow for automatic
Autopilot features allow for automatic,
operator-friendly management of Consul servers. It includes cleanup of dead
servers, monitoring the state of the Raft cluster, and stable server introduction.
@ -19,11 +19,25 @@ the Agent configuration must be set to 3 or higher on all servers. In Consul
information, see the [Version Upgrade section](/docs/upgrade-specific.html#raft_protocol)
on Raft Protocol versions.
## Configuration
In this guide we will learn more about Autopilot's features.
* Dead server cleanup
* Server Stabilization
* Redundancy zone tags
* Upgrade migration
Finally, we will review how to ensure Autopilot is healthy.
Note, in this guide we are using examples from a Consul 1.4 cluster, we
are starting with Autopilot enabled by default.
## Default Configuration
The configuration of Autopilot is loaded by the leader from the agent's
[Autopilot settings](/docs/agent/options.html#autopilot) when initially
bootstrapping the cluster:
bootstrapping the cluster. Since Autopilot and it's features are already
enabled, we only need to update the configuration to disable them. The
following are the defaults.
```
{
@ -31,16 +45,22 @@ bootstrapping the cluster:
"last_contact_threshold": "200ms",
"max_trailing_logs": 250,
"server_stabilization_time": "10s",
"redundancy_zone_tag": "az",
"redundancy_zone_tag": "",
"disable_upgrade_migration": false,
"upgrade_version_tag": ""
}
```
All Consul servers should have Autopilot and its features either enabled
or disabled to ensure consistency accross servers in case of a failure. Additionally,
Autopilot must be enabled to use any of the features, but the features themselves
can be configured independently. Meaning you can enable or disable any of the features
separately, at any time.
After bootstrapping, the configuration can be viewed or modified either via the
[`operator autopilot`](/docs/commands/operator/autopilot.html) subcommand or the
[`/v1/operator/autopilot/configuration`](/api/operator.html#autopilot-configuration)
HTTP endpoint:
HTTP endpoint.
```
$ consul operator autopilot get-config
@ -51,7 +71,29 @@ ServerStabilizationTime = 10s
RedundancyZoneTag = ""
DisableUpgradeMigration = false
UpgradeVersionTag = ""
```
In the example above, we used the `operator autopilot get-config` subcommand to check
the autopilot configuration. You can see we still have all the defaults.
## Dead Server Cleanup
If Autopilot is disabled, it will take 72 hours for dead servers to be automatically reaped
or an operator had to script a `consul force-leave`. If another server failure occurred
it could jeopardize the quorum, even if the failed Consul server had been automatically
replaced. Autopilot helps prevent these kinds of outages by quickly removing failed
servers as soon as a replacement Consul server comes online. When servers are removed
by the cleanup process they will enter the "left" state.
With Autopilot's dead server cleanup enabled, dead servers will periodically be
cleaned up and removed from the Raft peer set to prevent them from interfering with
the quorum size and leader elections. The cleanup process will also be automatically
triggered whenever a new server is successfully added to the cluster.
To update the dead server cleanup feature use `consul operator autopilot set-config`
with the `-cleanup-dead-servers` flag.
```sh
$ consul operator autopilot set-config -cleanup-dead-servers=false
Configuration updated!
@ -65,34 +107,143 @@ DisableUpgradeMigration = false
UpgradeVersionTag = ""
```
## Dead Server Cleanup
We have disabled dead server cleanup, but sill have all the other Autopilot defaults.
Dead servers will periodically be cleaned up and removed from the Raft peer
set, to prevent them from interfering with the quorum size and leader elections.
This cleanup will also happen whenever a new server is successfully added to the
cluster.
## Server Stabalization
Prior to Autopilot, it would take 72 hours for dead servers to be automatically reaped,
or operators had to script a `consul force-leave`. If another server failure occurred,
it could jeopardize the quorum, even if the failed Consul server had been automatically
replaced. Autopilot helps prevent these kinds of outages by quickly removing failed
servers as soon as a replacement Consul server comes online. When servers are removed
by the cleanup process they will enter the "left" state.
When a new server is added to the cluster, there is a waiting period where it
must be healthy and stable for a certain amount of time before being promoted
to a full, voting member. This can be configured via the `ServerStabilizationTime`
setting.
This option can be disabled by running `consul operator autopilot set-config`
with the `-cleanup-dead-servers=false` option.
```sh
consul operator autopilot set-config -server-stabalization-time=5s
Configuration updated!
$ consul operator autopilot get-config
CleanupDeadServers = false
LastContactThreshold = 200ms
MaxTrailingLogs = 250
ServerStabilizationTime = 5s
RedundancyZoneTag = ""
DisableUpgradeMigration = false
UpgradeVersionTag = ""
```
Now we have disabled dead server cleanup and set the server stabilization time to 5 seconds.
When a new server is added to our cluster, it will only need to be healthy and stable for
5 seconds.
## Redundancy Zones
Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of
isolated failure domains such as AWS Availability Zones; users would be forced to either
have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by
deploying just one server in each.
If the `RedundancyZoneTag` setting is set, Consul will use its value to look for a
zone in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta)
tag. For example, if `RedundancyZoneTag` is set to `zone`, and `-node-meta zone:east1a`
is used when starting a server, that server's redundancy zone will be `east1a`.
```
$ consul operator autopilot set-config -redundancy-zone-tag=uswest1
Configuration updated!
$ consul operator autopilot get-config
CleanupDeadServers = false
LastContactThreshold = 200ms
MaxTrailingLogs = 250
ServerStabilizationTime = 5s
RedundancyZoneTag = "uswest1"
DisableUpgradeMigration = false
UpgradeVersionTag = ""
```
For our Autopilot features, we now have disabled dead server cleanup, server stabilization time to 5 seconds, and
the redundancy zone tag is uswest1.
Consul will then use these values to partition the servers by redundancy zone, and will
aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters
on standby to be promoted if the active voter leaves or dies.
## Upgrade Migrations
Autopilot in Consul *Enterprise* supports upgrade migrations by default. To disable this
functionality, set `DisableUpgradeMigration` to true.
```sh
$ consul operator autopilot set-config -disable-upgrade-migration=true
Configuration updated!
$ consul operator autopilot get-config
CleanupDeadServers = false
LastContactThreshold = 200ms
MaxTrailingLogs = 250
ServerStabilizationTime = 5s
RedundancyZoneTag = "uswest1"
DisableUpgradeMigration = true
UpgradeVersionTag = ""
```
With upgrade migration enabled, when a new server is added and Autopilot detects that
its Consul version is newer than that of the existing servers, Autopilot will avoid
promoting the new server until enough newer-versioned servers have been added to the
cluster. When the count of new servers equals or exceeds that of the old servers,
Autopilot will begin promoting the new servers to voters and demoting the old servers.
After this is finished, the old servers can be safely removed from the cluster.
To check the consul version of the servers, you can either use the [autopilot health]
(/api/operator.html#autopilot-health) endpoint or the `consul members`
command.
```
$ consul members
Node Address Status Type Build Protocol DC Segment
node1 127.0.0.1:8301 alive server 1.4.0 2 dc1 <all>
node2 127.0.0.1:8703 alive server 1.4.0 2 dc1 <all>
node3 127.0.0.1:8803 alive server 1.4.0 2 dc1 <all>
node4 127.0.0.1:8203 alive server 1.3.0 2 dc1 <all>
```
### Migrations Without a Consul Version Change
The `UpgradeVersionTag` can be used to override the version information used during
a migration, so that the migration logic can be used for updating the cluster when
changing configuration.
If the `UpgradeVersionTag` setting is set, Consul will use its value to look for a
version in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta)
tag. For example, if `UpgradeVersionTag` is set to `build`, and `-node-meta build:0.0.2`
is used when starting a server, that server's version will be `0.0.2` when considered in
a migration. The upgrade logic will follow semantic versioning and the version string
must be in the form of either `X`, `X.Y`, or `X.Y.Z`.
```sh
$ consul operator autopilot set-config -upgrade-version-tag=1.4.0
Configuration updated!
$ consul operator autopilot get-config
CleanupDeadServers = false
LastContactThreshold = 200ms
MaxTrailingLogs = 250
ServerStabilizationTime = 5s
RedundancyZoneTag = "uswest1"
DisableUpgradeMigration = true
UpgradeVersionTag = "1.4.0"
```
## Server Health Checking
An internal health check runs on the leader to track the stability of servers.
<br>A server is considered healthy if all of the following conditions are true:
<br>A server is considered healthy if all of the following conditions are true.
- It has a SerfHealth status of 'Alive'
- It has a SerfHealth status of 'Alive'.
- The time since its last contact with the current leader is below
`LastContactThreshold`
- Its latest Raft term matches the leader's term
`LastContactThreshold`.
- Its latest Raft term matches the leader's term.
- The number of Raft log entries it trails the leader by does not exceed
`MaxTrailingLogs`
`MaxTrailingLogs`.
The status of these health checks can be viewed through the [`/v1/operator/autopilot/health`]
(/api/operator.html#autopilot-health) HTTP endpoint, with a top level
@ -136,83 +287,13 @@ $ curl localhost:8500/v1/operator/autopilot/health
}
```
## Stable Server Introduction
## Summary
When a new server is added to the cluster, there is a waiting period where it
must be healthy and stable for a certain amount of time before being promoted
to a full, voting member. This can be configured via the `ServerStabilizationTime`
setting.
In this guide we configured most of the Autopilot features; dead server cleanup, server
stabalization, redunancy zone tags, upgrade migration, and upgrade version tag.
---
~> The following Autopilot features are available only in
[Consul Enterprise](https://www.hashicorp.com/products/consul/) version 0.8.0 and later.
## Server Read Scaling
With the [`-non-voting-server`](/docs/agent/options.html#_non_voting_server) option, a
server can be explicitly marked as a non-voter and will never be promoted to a voting
member. This can be useful when more read scaling is needed; being a non-voter means
that the server will still have data replicated to it, but it will not be part of the
quorum that the leader must wait for before committing log entries.
## Redundancy Zones
Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of
isolated failure domains such as AWS Availability Zones; users would be forced to either
have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by
deploying just one server in each.
If the `RedundancyZoneTag` setting is set, Consul will use its value to look for a
zone in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta)
tag. For example, if `RedundancyZoneTag` is set to `zone`, and `-node-meta zone:east1a`
is used when starting a server, that server's redundancy zone will be `east1a`.
Here's an example showing how to configure this:
```
$ consul operator autopilot set-config -redundancy-zone-tag=zone
Configuration updated!
```
Consul will then use these values to partition the servers by redundancy zone, and will
aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters
on standby to be promoted if the active voter leaves or dies.
## Upgrade Migrations
Autopilot in Consul Enterprise supports upgrade migrations by default. To disable this
functionality, set `DisableUpgradeMigration` to true.
When a new server is added and Autopilot detects that its Consul version is newer than
that of the existing servers, Autopilot will avoid promoting the new server until enough
newer-versioned servers have been added to the cluster. When the count of new servers
equals or exceeds that of the old servers, Autopilot will begin promoting the new servers
to voters and demoting the old servers. After this is finished, the old servers can be
safely removed from the cluster.
To check the consul version of the servers, either the [autopilot health]
(/api/operator.html#autopilot-health) endpoint or the `consul members`
command can be used:
```
$ consul members
Node Address Status Type Build Protocol DC
node1 127.0.0.1:8301 alive server 0.7.5 2 dc1
node2 127.0.0.1:8703 alive server 0.7.5 2 dc1
node3 127.0.0.1:8803 alive server 0.7.5 2 dc1
node4 127.0.0.1:8203 alive server 0.8.0 2 dc1
```
### Migrations Without a Consul Version Change
The `UpgradeVersionTag` can be used to override the version information used during
a migration, so that the migration logic can be used for updating the cluster when
changing configuration.
If the `UpgradeVersionTag` setting is set, Consul will use its value to look for a
version in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta)
tag. For example, if `UpgradeVersionTag` is set to `build`, and `-node-meta build:0.0.2`
is used when starting a server, that server's version will be `0.0.2` when considered in
a migration. The upgrade logic will follow semantic versioning and the version string
must be in the form of either `X`, `X.Y`, or `X.Y.Z`.
To learn more about the Autopilot settings we did not configure,
[last_contact_threshold](https://www.consul.io/docs/agent/options.html#last_contact_threshold)
and [max_trailing_logs](https://www.consul.io/docs/agent/options.html#max_trailing_logs),
either read the agent configuration documenation or use the help flag with the
opertor autopilot `consul operator autopilot set-config -h`.