mirror of https://github.com/status-im/consul.git
Website: rework docs/guides/outage.html to cover cases where recovery might be easier than manual removal of failed nodes from peers.json.
This commit is contained in:
parent
3f971e694b
commit
aa9a22df9a
|
@ -8,12 +8,14 @@ description: |-
|
||||||
|
|
||||||
# Outage Recovery
|
# Outage Recovery
|
||||||
|
|
||||||
Don't panic! This is a critical first step. Depending on your
|
Don't panic! This is a critical first step.
|
||||||
[deployment configuration](/docs/internals/consensus.html#toc_4), it may
|
|
||||||
take only a single server failure for cluster unavailability. Recovery
|
Depending on your
|
||||||
|
[deployment configuration](/docs/internals/consensus.html#deployment_table), it
|
||||||
|
may take only a single server failure for cluster unavailability. Recovery
|
||||||
requires an operator to intervene, but the process is straightforward.
|
requires an operator to intervene, but the process is straightforward.
|
||||||
|
|
||||||
~> This page covers recovery from Consul becoming unavailable due to a majority
|
~> This guide is for recovery from a Consul outage due to a majority
|
||||||
of server nodes in a datacenter being lost. If you are just looking to
|
of server nodes in a datacenter being lost. If you are just looking to
|
||||||
add or remove a server, [see this guide](/docs/guides/servers.html).
|
add or remove a server, [see this guide](/docs/guides/servers.html).
|
||||||
|
|
||||||
|
@ -28,15 +30,26 @@ See the [bootstrapping guide](/docs/guides/bootstrapping.html) for more detail.
|
||||||
|
|
||||||
In the case of an unrecoverable server failure in a single server cluster, data
|
In the case of an unrecoverable server failure in a single server cluster, data
|
||||||
loss is inevitable since data was not replicated to any other servers. This is
|
loss is inevitable since data was not replicated to any other servers. This is
|
||||||
why a single server deploy is never recommended.
|
why a single server deploy is **never** recommended.
|
||||||
|
|
||||||
Any services registered with agents will be re-populated when the new server
|
Any services registered with agents will be re-populated when the new server
|
||||||
comes online as agents perform anti-entropy.
|
comes online as agents perform anti-entropy.
|
||||||
|
|
||||||
## Failure of a Server in a Multi-Server Cluster
|
## Failure of a Server in a Multi-Server Cluster
|
||||||
|
|
||||||
In a multi-server deploy, there are at least N remaining servers. The first
|
If you think the failed server is recoverable, the easiest option is to bring
|
||||||
step is to simply stop all the servers. You can attempt a graceful leave,
|
it back online and have it rejoin the cluster, returning the cluster to a fully
|
||||||
|
healthy state. Similarly, even if you need to rebuild a new Consul server to
|
||||||
|
replace the failed node, you may wish to do that immediately. Keep in mind that
|
||||||
|
the rebuilt server needs to have the same IP as the failed server. Again, once
|
||||||
|
this server is online, the cluster will return to a fully healthy state.
|
||||||
|
|
||||||
|
Both of these strategies involve a potentially lengthy time to reboot or rebuild
|
||||||
|
a failed server. If this is impractical, if building a new server with the same
|
||||||
|
IP isn't an option, or if your failed server is unrecoverable, you need to remove
|
||||||
|
the failed server from the `raft/peers.json` file on all remaining servers.
|
||||||
|
|
||||||
|
To begin, stop all remaining servers. You can attempt a graceful leave,
|
||||||
but it will not work in most cases. Do not worry if the leave exits with an
|
but it will not work in most cases. Do not worry if the leave exits with an
|
||||||
error. The cluster is in an unhealthy state, so this is expected.
|
error. The cluster is in an unhealthy state, so this is expected.
|
||||||
|
|
||||||
|
@ -76,7 +89,7 @@ nodes should claim leadership and emit a log like:
|
||||||
[INFO] consul: cluster leadership acquired
|
[INFO] consul: cluster leadership acquired
|
||||||
```
|
```
|
||||||
|
|
||||||
Additional, the [`info`](/docs/commands/info.html) command can be a useful
|
Additionally, the [`info`](/docs/commands/info.html) command can be a useful
|
||||||
debugging tool:
|
debugging tool:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
|
|
|
@ -164,7 +164,7 @@ The three read modes are:
|
||||||
|
|
||||||
For more documentation about using these various modes, see the [HTTP API](/docs/agent/http.html).
|
For more documentation about using these various modes, see the [HTTP API](/docs/agent/http.html).
|
||||||
|
|
||||||
## Deployment Table
|
## <a name="deployment_table"></a>Deployment Table
|
||||||
|
|
||||||
Below is a table that shows for the number of servers how large the
|
Below is a table that shows for the number of servers how large the
|
||||||
quorum is, as well as how many node failures can be tolerated. The
|
quorum is, as well as how many node failures can be tolerated. The
|
||||||
|
|
Loading…
Reference in New Issue