Website: GH-730 and cleanup for docs/guides/outage.html

This commit is contained in:
Ryan Breen 2015-02-28 23:36:25 -05:00
parent 53ee3ffba2
commit c1e4eb2f2c
1 changed files with 35 additions and 25 deletions

View File

@ -3,38 +3,47 @@ layout: "docs"
page_title: "Outage Recovery" page_title: "Outage Recovery"
sidebar_current: "docs-guides-outage" sidebar_current: "docs-guides-outage"
description: |- description: |-
Do not panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but is straightforward. Don't panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but recovery is straightforward.
--- ---
# Outage Recovery # Outage Recovery
Do not panic! This is a critical first step. Depending on your Don't panic! This is a critical first step. Depending on your
[deployment configuration](/docs/internals/consensus.html#toc_4), it may [deployment configuration](/docs/internals/consensus.html#toc_4), it may
take only a single server failure for cluster unavailability. Recovery take only a single server failure for cluster unavailability. Recovery
requires an operator to intervene, but is straightforward. requires an operator to intervene, but the process is straightforward.
~> This page covers recovery from Consul becoming unavailable due to a majority ~> This page covers recovery from Consul becoming unavailable due to a majority
of server nodes in a datacenter being lost. If you are just looking to of server nodes in a datacenter being lost. If you are just looking to
add or remove a server [see this page](/docs/guides/servers.html). add or remove a server, [see this guide](/docs/guides/servers.html).
## Failure of a Single Server Cluster
If you had only a single server and it has failed, simply restart it. If you had only a single server and it has failed, simply restart it.
Note that a single server configuration requires the `-bootstrap` or Note that a single server configuration requires the
`-bootstrap-expect 1` flag. If that server cannot be recovered, you need to [`-bootstrap`](/docs/agent/options.html#_bootstrap) or
bring up a new server. [`-bootstrap-expect 1`](/docs/agent/options.html#_bootstrap_expect) flag. If
See the [bootstrapping guide](/docs/guides/bootstrapping.html). Data loss the server cannot be recovered, you need to bring up a new server.
is inevitable, since data was not replicated to any other servers. This See the [bootstrapping guide](/docs/guides/bootstrapping.html) for more detail.
is why a single server deploy is never recommended. Any services registered
with agents will be re-populated when the new server comes online, as
agents perform anti-entropy.
In a multi-server deploy, there are at least N remaining servers. The first step In the case of an unrecoverable server failure in a single server cluster, data
is to simply stop all the servers. You can attempt a graceful leave, but loss is inevitable since data was not replicated to any other servers. This is
it will not work in most cases. Do not worry if the leave exits with an why a single server deploy is never recommended.
error, since the cluster is in an unhealthy state.
The next step is to go to the `-data-dir` of each Consul server. Inside Any services registered with agents will be re-populated when the new server
that directory, there will be a `raft/` sub-directory. We need to edit comes online as agents perform anti-entropy.
the `raft/peers.json` file. It should be something like:
## Failure of a Server in a Multi-Server Cluster
In a multi-server deploy, there are at least N remaining servers. The first
step is to simply stop all the servers. You can attempt a graceful leave,
but it will not work in most cases. Do not worry if the leave exits with an
error. The cluster is in an unhealthy state, so this is expected.
The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir)
of each Consul server. Inside that directory, there will be a `raft/`
sub-directory. We need to edit the `raft/peers.json` file. It should look
something like:
```javascript ```javascript
[ [
@ -45,29 +54,30 @@ the `raft/peers.json` file. It should be something like:
``` ```
Simply delete the entries for all the failed servers. You must confirm Simply delete the entries for all the failed servers. You must confirm
those servers have indeed failed, and will not later rejoin the cluster. those servers have indeed failed and will not later rejoin the cluster.
Ensure that this file is the same across all remaining server nodes. Ensure that this file is the same across all remaining server nodes.
At this point, you can restart all the remaining servers. If any servers At this point, you can restart all the remaining servers. If any servers
managed to perform a graceful leave, you may need to have then rejoin managed to perform a graceful leave, you may need to have then rejoin
the cluster using the `join` command: the cluster using the [`join`](/docs/commands/join.html) command:
```text ```text
$ consul join <Node Address> $ consul join <Node Address>
Successfully joined cluster by contacting 1 nodes. Successfully joined cluster by contacting 1 nodes.
``` ```
It should be noted that any existing member can be used to rejoin the cluster, It should be noted that any existing member can be used to rejoin the cluster
as the gossip protocol will take care of discovering the server nodes. as the gossip protocol will take care of discovering the server nodes.
At this point the cluster should be in an operable state again. One of the At this point, the cluster should be in an operable state again. One of the
nodes should claim leadership and emit a log like: nodes should claim leadership and emit a log like:
```text ```text
[INFO] consul: cluster leadership acquired [INFO] consul: cluster leadership acquired
``` ```
Additional, the `info` command can be a useful debugging tool: Additional, the [`info`](/docs/commands/info.html) command can be a useful
debugging tool:
```text ```text
$ consul info $ consul info
@ -86,7 +96,7 @@ raft:
... ...
``` ```
You should verify that one server claims to be the `Leader`, and all the You should verify that one server claims to be the `Leader` and all the
others should be in the `Follower` state. All the nodes should agree on the others should be in the `Follower` state. All the nodes should agree on the
peer count as well. This count is (N-1), since a server does not count itself peer count as well. This count is (N-1), since a server does not count itself
as a peer. as a peer.