Outage Recover Guide Updates. (#5003)

* Adding examples and a summary. Minor structure updates.

* Added a link to the deployment guide, but needed to remove a sentence referring to a guide that's not published yet.

* fixed typo
This commit is contained in:
kaitlincarter-hc 2018-12-11 14:51:24 -06:00 committed by GitHub
parent cc57615445
commit 6d81117544
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -15,9 +15,19 @@ Depending on your
may take only a single server failure for cluster unavailability. Recovery may take only a single server failure for cluster unavailability. Recovery
requires an operator to intervene, but the process is straightforward. requires an operator to intervene, but the process is straightforward.
~> This guide is for recovery from a Consul outage due to a majority This guide is for recovery from a Consul outage due to a majority
of server nodes in a datacenter being lost. If you are just looking to of server nodes in a datacenter being lost. There are several types
add or remove a server, [see this guide](/docs/guides/servers.html). of outages, depending on the number of server nodes and number of failed
server nodes. We will outline how to recover from:
* Failure of a Single Server Cluster. This is when you have a single Consul
server and it fails.
* Failure of a Server in a Multi-Server Cluster. This is when one server fails,
the Consul cluster has 3 or more servers.
* Failure of Multiple Servers in a Multi-Server Cluster. This when more than one
Consul server fails in a cluster of 3 or more servers. This scenario is potentially
the most serious, because it can result in data loss.
## Failure of a Single Server Cluster ## Failure of a Single Server Cluster
@ -25,13 +35,18 @@ If you had only a single server and it has failed, simply restart it. A
single server configuration requires the single server configuration requires the
[`-bootstrap`](/docs/agent/options.html#_bootstrap) or [`-bootstrap`](/docs/agent/options.html#_bootstrap) or
[`-bootstrap-expect=1`](/docs/agent/options.html#_bootstrap_expect) [`-bootstrap-expect=1`](/docs/agent/options.html#_bootstrap_expect)
flag. If the server cannot be recovered, you need to bring up a new flag.
server. See the [bootstrapping guide](/docs/guides/bootstrapping.html)
for more detail.
In the case of an unrecoverable server failure in a single server cluster, data ```sh
loss is inevitable since data was not replicated to any other servers. This is consul agent -bootstrap-expect=1
why a single server deploy is **never** recommended. ```
If the server cannot be recovered, you need to bring up a new
server using the [deployment guide](https://www.consul.io/docs/guides/deployment-guide.html).
In the case of an unrecoverable server failure in a single server cluster and
no backup procedure, data loss is inevitable since data was not replicated
to any other servers. This is why a single server deploy is **never** recommended.
Any services registered with agents will be re-populated when the new server Any services registered with agents will be re-populated when the new server
comes online as agents perform [anti-entropy](/docs/internals/anti-entropy.html). comes online as agents perform [anti-entropy](/docs/internals/anti-entropy.html).
@ -46,12 +61,20 @@ Keep in mind that the rebuilt server needs to have the same IP address as the fa
server. Again, once this server is online and has rejoined, the cluster will return server. Again, once this server is online and has rejoined, the cluster will return
to a fully healthy state. to a fully healthy state.
```sh
consul agent -bootstrap-expect=3 -bind=192.172.2.4 -auto-rejoin=192.172.2.3
```
Both of these strategies involve a potentially lengthy time to reboot or rebuild Both of these strategies involve a potentially lengthy time to reboot or rebuild
a failed server. If this is impractical or if building a new server with the same a failed server. If this is impractical or if building a new server with the same
IP isn't an option, you need to remove the failed server. Usually, you can issue IP isn't an option, you need to remove the failed server. Usually, you can issue
a [`consul force-leave`](/docs/commands/force-leave.html) command to remove the failed a [`consul force-leave`](/docs/commands/force-leave.html) command to remove the failed
server if it's still a member of the cluster. server if it's still a member of the cluster.
```sh
consul force-leave <node.name.consul>
```
If [`consul force-leave`](/docs/commands/force-leave.html) isn't able to remove the If [`consul force-leave`](/docs/commands/force-leave.html) isn't able to remove the
server, you have two methods available to remove it, depending on your version of Consul: server, you have two methods available to remove it, depending on your version of Consul:
@ -67,10 +90,10 @@ command to inspect the Raft configuration:
``` ```
$ consul operator raft list-peers $ consul operator raft list-peers
Node ID Address State Voter Node ID Address State Voter RaftProtocol
alice 10.0.1.8:8300 10.0.1.8:8300 follower true alice 10.0.1.8:8300 10.0.1.8:8300 follower true 3
bob 10.0.1.6:8300 10.0.1.6:8300 leader true bob 10.0.1.6:8300 10.0.1.6:8300 leader true 3
carol 10.0.1.7:8300 10.0.1.7:8300 follower true carol 10.0.1.7:8300 10.0.1.7:8300 follower true 3
``` ```
## Failure of Multiple Servers in a Multi-Server Cluster ## Failure of Multiple Servers in a Multi-Server Cluster
@ -82,7 +105,7 @@ servers were lost, so information about what's committed could be incomplete.
The recovery process implicitly commits all outstanding Raft log entries, so The recovery process implicitly commits all outstanding Raft log entries, so
it's also possible to commit data that was uncommitted before the failure. it's also possible to commit data that was uncommitted before the failure.
See the [section below](#peers.json) for details of the recovery procedure. You See the section below on manual recovery using peers.json for details of the recovery procedure. You
simply include just the remaining servers in the `raft/peers.json` recovery file. simply include just the remaining servers in the `raft/peers.json` recovery file.
The cluster should be able to elect a leader once the remaining servers are all The cluster should be able to elect a leader once the remaining servers are all
restarted with an identical `raft/peers.json` configuration. restarted with an identical `raft/peers.json` configuration.
@ -90,6 +113,10 @@ restarted with an identical `raft/peers.json` configuration.
Any new servers you introduce later can be fresh with totally clean data directories Any new servers you introduce later can be fresh with totally clean data directories
and joined using Consul's `join` command. and joined using Consul's `join` command.
```sh
consul agent -join=192.172.2.3
```
In extreme cases, it should be possible to recover with just a single remaining In extreme cases, it should be possible to recover with just a single remaining
server by starting that single server with itself as the only peer in the server by starting that single server with itself as the only peer in the
`raft/peers.json` recovery file. `raft/peers.json` recovery file.
@ -103,7 +130,7 @@ all Raft log entries, so should only be used to recover from an outage, but it
should allow recovery from any situation where there's some cluster data available. should allow recovery from any situation where there's some cluster data available.
<a name="peers.json"></a> <a name="peers.json"></a>
## Manual Recovery Using peers.json ### Manual Recovery Using peers.json
To begin, stop all remaining servers. You can attempt a graceful leave, To begin, stop all remaining servers. You can attempt a graceful leave,
but it will not work in most cases. Do not worry if the leave exits with an but it will not work in most cases. Do not worry if the leave exits with an
@ -217,7 +244,14 @@ command to inspect the Raft configuration:
``` ```
$ consul operator raft list-peers $ consul operator raft list-peers
Node ID Address State Voter RaftProtocol Node ID Address State Voter RaftProtocol
alice 10.0.1.8:8300 10.0.1.8:8300 follower true 2 alice 10.0.1.8:8300 10.0.1.8:8300 follower true 3
bob 10.0.1.6:8300 10.0.1.6:8300 leader true 2 bob 10.0.1.6:8300 10.0.1.6:8300 leader true 3
carol 10.0.1.7:8300 10.0.1.7:8300 follower true 2 carol 10.0.1.7:8300 10.0.1.7:8300 follower true 3
``` ```
## Summary
In this guided we reviewed how to recover from a Consul server outage. Depending on the
quorum size and number of failed servers, the recovery process will vary. In the event of
complete failure it is beneficial to have a
[backup process](https://www.consul.io/docs/guides/deployment-guide.html#backups).