Website: GH-730 and cleanup for docs/guides/outage.html

This commit is contained in:
Ryan Breen 2015-02-28 23:36:25 -05:00
parent 53ee3ffba2
commit c1e4eb2f2c

View File

@ -3,38 +3,47 @@ layout: "docs"
page_title: "Outage Recovery"
sidebar_current: "docs-guides-outage"
description: |-
Do not panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but is straightforward.
Don't panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but recovery is straightforward.
---
# Outage Recovery
Do not panic! This is a critical first step. Depending on your
Don't panic! This is a critical first step. Depending on your
[deployment configuration](/docs/internals/consensus.html#toc_4), it may
take only a single server failure for cluster unavailability. Recovery
requires an operator to intervene, but is straightforward.
requires an operator to intervene, but the process is straightforward.
~> This page covers recovery from Consul becoming unavailable due to a majority
of server nodes in a datacenter being lost. If you are just looking to
add or remove a server [see this page](/docs/guides/servers.html).
add or remove a server, [see this guide](/docs/guides/servers.html).
## Failure of a Single Server Cluster
If you had only a single server and it has failed, simply restart it.
Note that a single server configuration requires the `-bootstrap` or
`-bootstrap-expect 1` flag. If that server cannot be recovered, you need to
bring up a new server.
See the [bootstrapping guide](/docs/guides/bootstrapping.html). Data loss
is inevitable, since data was not replicated to any other servers. This
is why a single server deploy is never recommended. Any services registered
with agents will be re-populated when the new server comes online, as
agents perform anti-entropy.
Note that a single server configuration requires the
[`-bootstrap`](/docs/agent/options.html#_bootstrap) or
[`-bootstrap-expect 1`](/docs/agent/options.html#_bootstrap_expect) flag. If
the server cannot be recovered, you need to bring up a new server.
See the [bootstrapping guide](/docs/guides/bootstrapping.html) for more detail.
In a multi-server deploy, there are at least N remaining servers. The first step
is to simply stop all the servers. You can attempt a graceful leave, but
it will not work in most cases. Do not worry if the leave exits with an
error, since the cluster is in an unhealthy state.
In the case of an unrecoverable server failure in a single server cluster, data
loss is inevitable since data was not replicated to any other servers. This is
why a single server deploy is never recommended.
The next step is to go to the `-data-dir` of each Consul server. Inside
that directory, there will be a `raft/` sub-directory. We need to edit
the `raft/peers.json` file. It should be something like:
Any services registered with agents will be re-populated when the new server
comes online as agents perform anti-entropy.
## Failure of a Server in a Multi-Server Cluster
In a multi-server deploy, there are at least N remaining servers. The first
step is to simply stop all the servers. You can attempt a graceful leave,
but it will not work in most cases. Do not worry if the leave exits with an
error. The cluster is in an unhealthy state, so this is expected.
The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir)
of each Consul server. Inside that directory, there will be a `raft/`
sub-directory. We need to edit the `raft/peers.json` file. It should look
something like:
```javascript
[
@ -45,29 +54,30 @@ the `raft/peers.json` file. It should be something like:
```
Simply delete the entries for all the failed servers. You must confirm
those servers have indeed failed, and will not later rejoin the cluster.
those servers have indeed failed and will not later rejoin the cluster.
Ensure that this file is the same across all remaining server nodes.
At this point, you can restart all the remaining servers. If any servers
managed to perform a graceful leave, you may need to have then rejoin
the cluster using the `join` command:
the cluster using the [`join`](/docs/commands/join.html) command:
```text
$ consul join <Node Address>
Successfully joined cluster by contacting 1 nodes.
```
It should be noted that any existing member can be used to rejoin the cluster,
It should be noted that any existing member can be used to rejoin the cluster
as the gossip protocol will take care of discovering the server nodes.
At this point the cluster should be in an operable state again. One of the
At this point, the cluster should be in an operable state again. One of the
nodes should claim leadership and emit a log like:
```text
[INFO] consul: cluster leadership acquired
```
Additional, the `info` command can be a useful debugging tool:
Additional, the [`info`](/docs/commands/info.html) command can be a useful
debugging tool:
```text
$ consul info
@ -86,7 +96,7 @@ raft:
...
```
You should verify that one server claims to be the `Leader`, and all the
You should verify that one server claims to be the `Leader` and all the
others should be in the `Follower` state. All the nodes should agree on the
peer count as well. This count is (N-1), since a server does not count itself
as a peer.