Website: rework docs/guides/outage.html to cover cases where recovery might be easier than manual removal of failed nodes from peers.json.

This commit is contained in:
Ryan Breen 2015-03-01 18:21:33 -05:00
parent 3f971e694b
commit aa9a22df9a
2 changed files with 22 additions and 9 deletions

View File

@ -8,12 +8,14 @@ description: |-
# Outage Recovery
Don't panic! This is a critical first step. Depending on your
[deployment configuration](/docs/internals/consensus.html#toc_4), it may
take only a single server failure for cluster unavailability. Recovery
Don't panic! This is a critical first step.
Depending on your
[deployment configuration](/docs/internals/consensus.html#deployment_table), it
may take only a single server failure for cluster unavailability. Recovery
requires an operator to intervene, but the process is straightforward.
~> This page covers recovery from Consul becoming unavailable due to a majority
~> This guide is for recovery from a Consul outage due to a majority
of server nodes in a datacenter being lost. If you are just looking to
add or remove a server, [see this guide](/docs/guides/servers.html).
@ -28,15 +30,26 @@ See the [bootstrapping guide](/docs/guides/bootstrapping.html) for more detail.
In the case of an unrecoverable server failure in a single server cluster, data
loss is inevitable since data was not replicated to any other servers. This is
why a single server deploy is never recommended.
why a single server deploy is **never** recommended.
Any services registered with agents will be re-populated when the new server
comes online as agents perform anti-entropy.
## Failure of a Server in a Multi-Server Cluster
In a multi-server deploy, there are at least N remaining servers. The first
step is to simply stop all the servers. You can attempt a graceful leave,
If you think the failed server is recoverable, the easiest option is to bring
it back online and have it rejoin the cluster, returning the cluster to a fully
healthy state. Similarly, even if you need to rebuild a new Consul server to
replace the failed node, you may wish to do that immediately. Keep in mind that
the rebuilt server needs to have the same IP as the failed server. Again, once
this server is online, the cluster will return to a fully healthy state.
Both of these strategies involve a potentially lengthy time to reboot or rebuild
a failed server. If this is impractical, if building a new server with the same
IP isn't an option, or if your failed server is unrecoverable, you need to remove
the failed server from the `raft/peers.json` file on all remaining servers.
To begin, stop all remaining servers. You can attempt a graceful leave,
but it will not work in most cases. Do not worry if the leave exits with an
error. The cluster is in an unhealthy state, so this is expected.
@ -76,7 +89,7 @@ nodes should claim leadership and emit a log like:
[INFO] consul: cluster leadership acquired
```
Additional, the [`info`](/docs/commands/info.html) command can be a useful
Additionally, the [`info`](/docs/commands/info.html) command can be a useful
debugging tool:
```text

View File

@ -164,7 +164,7 @@ The three read modes are:
For more documentation about using these various modes, see the [HTTP API](/docs/agent/http.html).
## Deployment Table
## <a name="deployment_table"></a>Deployment Table
Below is a table that shows for the number of servers how large the
quorum is, as well as how many node failures can be tolerated. The