mirror of https://github.com/status-im/consul.git
Website: GH-730 and cleanup for docs/guides/outage.html
This commit is contained in:
parent
53ee3ffba2
commit
c1e4eb2f2c
|
@ -3,38 +3,47 @@ layout: "docs"
|
||||||
page_title: "Outage Recovery"
|
page_title: "Outage Recovery"
|
||||||
sidebar_current: "docs-guides-outage"
|
sidebar_current: "docs-guides-outage"
|
||||||
description: |-
|
description: |-
|
||||||
Do not panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but is straightforward.
|
Don't panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but recovery is straightforward.
|
||||||
---
|
---
|
||||||
|
|
||||||
# Outage Recovery
|
# Outage Recovery
|
||||||
|
|
||||||
Do not panic! This is a critical first step. Depending on your
|
Don't panic! This is a critical first step. Depending on your
|
||||||
[deployment configuration](/docs/internals/consensus.html#toc_4), it may
|
[deployment configuration](/docs/internals/consensus.html#toc_4), it may
|
||||||
take only a single server failure for cluster unavailability. Recovery
|
take only a single server failure for cluster unavailability. Recovery
|
||||||
requires an operator to intervene, but is straightforward.
|
requires an operator to intervene, but the process is straightforward.
|
||||||
|
|
||||||
~> This page covers recovery from Consul becoming unavailable due to a majority
|
~> This page covers recovery from Consul becoming unavailable due to a majority
|
||||||
of server nodes in a datacenter being lost. If you are just looking to
|
of server nodes in a datacenter being lost. If you are just looking to
|
||||||
add or remove a server [see this page](/docs/guides/servers.html).
|
add or remove a server, [see this guide](/docs/guides/servers.html).
|
||||||
|
|
||||||
|
## Failure of a Single Server Cluster
|
||||||
|
|
||||||
If you had only a single server and it has failed, simply restart it.
|
If you had only a single server and it has failed, simply restart it.
|
||||||
Note that a single server configuration requires the `-bootstrap` or
|
Note that a single server configuration requires the
|
||||||
`-bootstrap-expect 1` flag. If that server cannot be recovered, you need to
|
[`-bootstrap`](/docs/agent/options.html#_bootstrap) or
|
||||||
bring up a new server.
|
[`-bootstrap-expect 1`](/docs/agent/options.html#_bootstrap_expect) flag. If
|
||||||
See the [bootstrapping guide](/docs/guides/bootstrapping.html). Data loss
|
the server cannot be recovered, you need to bring up a new server.
|
||||||
is inevitable, since data was not replicated to any other servers. This
|
See the [bootstrapping guide](/docs/guides/bootstrapping.html) for more detail.
|
||||||
is why a single server deploy is never recommended. Any services registered
|
|
||||||
with agents will be re-populated when the new server comes online, as
|
|
||||||
agents perform anti-entropy.
|
|
||||||
|
|
||||||
In a multi-server deploy, there are at least N remaining servers. The first step
|
In the case of an unrecoverable server failure in a single server cluster, data
|
||||||
is to simply stop all the servers. You can attempt a graceful leave, but
|
loss is inevitable since data was not replicated to any other servers. This is
|
||||||
it will not work in most cases. Do not worry if the leave exits with an
|
why a single server deploy is never recommended.
|
||||||
error, since the cluster is in an unhealthy state.
|
|
||||||
|
|
||||||
The next step is to go to the `-data-dir` of each Consul server. Inside
|
Any services registered with agents will be re-populated when the new server
|
||||||
that directory, there will be a `raft/` sub-directory. We need to edit
|
comes online as agents perform anti-entropy.
|
||||||
the `raft/peers.json` file. It should be something like:
|
|
||||||
|
## Failure of a Server in a Multi-Server Cluster
|
||||||
|
|
||||||
|
In a multi-server deploy, there are at least N remaining servers. The first
|
||||||
|
step is to simply stop all the servers. You can attempt a graceful leave,
|
||||||
|
but it will not work in most cases. Do not worry if the leave exits with an
|
||||||
|
error. The cluster is in an unhealthy state, so this is expected.
|
||||||
|
|
||||||
|
The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir)
|
||||||
|
of each Consul server. Inside that directory, there will be a `raft/`
|
||||||
|
sub-directory. We need to edit the `raft/peers.json` file. It should look
|
||||||
|
something like:
|
||||||
|
|
||||||
```javascript
|
```javascript
|
||||||
[
|
[
|
||||||
|
@ -45,29 +54,30 @@ the `raft/peers.json` file. It should be something like:
|
||||||
```
|
```
|
||||||
|
|
||||||
Simply delete the entries for all the failed servers. You must confirm
|
Simply delete the entries for all the failed servers. You must confirm
|
||||||
those servers have indeed failed, and will not later rejoin the cluster.
|
those servers have indeed failed and will not later rejoin the cluster.
|
||||||
Ensure that this file is the same across all remaining server nodes.
|
Ensure that this file is the same across all remaining server nodes.
|
||||||
|
|
||||||
At this point, you can restart all the remaining servers. If any servers
|
At this point, you can restart all the remaining servers. If any servers
|
||||||
managed to perform a graceful leave, you may need to have then rejoin
|
managed to perform a graceful leave, you may need to have then rejoin
|
||||||
the cluster using the `join` command:
|
the cluster using the [`join`](/docs/commands/join.html) command:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
$ consul join <Node Address>
|
$ consul join <Node Address>
|
||||||
Successfully joined cluster by contacting 1 nodes.
|
Successfully joined cluster by contacting 1 nodes.
|
||||||
```
|
```
|
||||||
|
|
||||||
It should be noted that any existing member can be used to rejoin the cluster,
|
It should be noted that any existing member can be used to rejoin the cluster
|
||||||
as the gossip protocol will take care of discovering the server nodes.
|
as the gossip protocol will take care of discovering the server nodes.
|
||||||
|
|
||||||
At this point the cluster should be in an operable state again. One of the
|
At this point, the cluster should be in an operable state again. One of the
|
||||||
nodes should claim leadership and emit a log like:
|
nodes should claim leadership and emit a log like:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
[INFO] consul: cluster leadership acquired
|
[INFO] consul: cluster leadership acquired
|
||||||
```
|
```
|
||||||
|
|
||||||
Additional, the `info` command can be a useful debugging tool:
|
Additional, the [`info`](/docs/commands/info.html) command can be a useful
|
||||||
|
debugging tool:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
$ consul info
|
$ consul info
|
||||||
|
@ -86,7 +96,7 @@ raft:
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
You should verify that one server claims to be the `Leader`, and all the
|
You should verify that one server claims to be the `Leader` and all the
|
||||||
others should be in the `Follower` state. All the nodes should agree on the
|
others should be in the `Follower` state. All the nodes should agree on the
|
||||||
peer count as well. This count is (N-1), since a server does not count itself
|
peer count as well. This count is (N-1), since a server does not count itself
|
||||||
as a peer.
|
as a peer.
|
||||||
|
|
Loading…
Reference in New Issue