Website: rework docs/guides/outage.html to cover cases where recovery might be easier than manual removal of failed nodes from peers.json.

2015-03-01 18:21:33 -05:00 · 2015-03-01 18:21:33 -05:00 · aa9a22df9a
parent 3f971e694b
commit aa9a22df9a
2 changed files with 22 additions and 9 deletions
--- a/website/source/docs/guides/outage.html.markdown
+++ b/website/source/docs/guides/outage.html.markdown
@ -8,12 +8,14 @@ description: |-

 # Outage Recovery

-Don't panic! This is a critical first step. Depending on your
-[deployment configuration](/docs/internals/consensus.html#toc_4), it may
-take only a single server failure for cluster unavailability. Recovery
+Don't panic! This is a critical first step.
+
+Depending on your
+[deployment configuration](/docs/internals/consensus.html#deployment_table), it
+may take only a single server failure for cluster unavailability. Recovery
 requires an operator to intervene, but the process is straightforward.

-~>  This page covers recovery from Consul becoming unavailable due to a majority
+~>  This guide is for recovery from a Consul outage due to a majority
 of server nodes in a datacenter being lost. If you are just looking to
 add or remove a server, [see this guide](/docs/guides/servers.html).

@ -28,15 +30,26 @@ See the [bootstrapping guide](/docs/guides/bootstrapping.html) for more detail.

 In the case of an unrecoverable server failure in a single server cluster, data
 loss is inevitable since data was not replicated to any other servers. This is
-why a single server deploy is never recommended.
+why a single server deploy is **never** recommended.

 Any services registered with agents will be re-populated when the new server
 comes online as agents perform anti-entropy.

 ## Failure of a Server in a Multi-Server Cluster

-In a multi-server deploy, there are at least N remaining servers. The first
-step is to simply stop all the servers. You can attempt a graceful leave,
+If you think the failed server is recoverable, the easiest option is to bring
+it back online and have it rejoin the cluster, returning the cluster to a fully
+healthy state. Similarly, even if you need to rebuild a new Consul server to
+replace the failed node, you may wish to do that immediately. Keep in mind that
+the rebuilt server needs to have the same IP as the failed server. Again, once
+this server is online, the cluster will return to a fully healthy state.
+
+Both of these strategies involve a potentially lengthy time to reboot or rebuild
+a failed server. If this is impractical, if building a new server with the same
+IP isn't an option, or if your failed server is unrecoverable, you need to remove
+the failed server from the `raft/peers.json` file on all remaining servers.
+
+To begin, stop all remaining servers. You can attempt a graceful leave,
 but it will not work in most cases. Do not worry if the leave exits with an
 error. The cluster is in an unhealthy state, so this is expected.

@ -76,7 +89,7 @@ nodes should claim leadership and emit a log like:
 [INFO] consul: cluster leadership acquired
 ```

-Additional, the [`info`](/docs/commands/info.html) command can be a useful
+Additionally, the [`info`](/docs/commands/info.html) command can be a useful
 debugging tool:

 ```text
--- a/website/source/docs/internals/consensus.html.markdown
+++ b/website/source/docs/internals/consensus.html.markdown
@ -164,7 +164,7 @@ The three read modes are:

 For more documentation about using these various modes, see the [HTTP API](/docs/agent/http.html).

-## Deployment Table
+## <a name="deployment_table"></a>Deployment Table

 Below is a table that shows for the number of servers how large the
 quorum is, as well as how many node failures can be tolerated. The