Website: rework docs/guides/outage.html to cover cases where recovery might be easier than manual removal of failed nodes from peers.json.

2015-03-01 18:21:33 -05:00 · 2015-03-01 18:21:33 -05:00 · aa9a22df9a
parent 3f971e694b
commit aa9a22df9a
2 changed files with 22 additions and 9 deletions
--- a/website/source/docs/guides/outage.html.markdown
+++ b/website/source/docs/guides/outage.html.markdown
@ -8,12 +8,14 @@ description: |-
 # Outage Recovery
-Don't panic! This is a critical first step. Depending on your
+Don't panic! This is a critical first step.
-[deployment configuration](/docs/internals/consensus.html#toc_4), it may
+
-take only a single server failure for cluster unavailability. Recovery
+Depending on your
 [deployment configuration](/docs/internals/consensus.html#deployment_table), it
 may take only a single server failure for cluster unavailability. Recovery
 requires an operator to intervene, but the process is straightforward.
-~>  This page covers recovery from Consul becoming unavailable due to a majority
+~>  This guide is for recovery from a Consul outage due to a majority
 of server nodes in a datacenter being lost. If you are just looking to
 add or remove a server, [see this guide](/docs/guides/servers.html).
@ -28,15 +30,26 @@ See the [bootstrapping guide](/docs/guides/bootstrapping.html) for more detail.
 In the case of an unrecoverable server failure in a single server cluster, data
 loss is inevitable since data was not replicated to any other servers. This is
-why a single server deploy is never recommended.
+why a single server deploy is **never** recommended.
 Any services registered with agents will be re-populated when the new server
 comes online as agents perform anti-entropy.
 ## Failure of a Server in a Multi-Server Cluster
-In a multi-server deploy, there are at least N remaining servers. The first
+If you think the failed server is recoverable, the easiest option is to bring
-step is to simply stop all the servers. You can attempt a graceful leave,
+it back online and have it rejoin the cluster, returning the cluster to a fully
 healthy state. Similarly, even if you need to rebuild a new Consul server to
 replace the failed node, you may wish to do that immediately. Keep in mind that
 the rebuilt server needs to have the same IP as the failed server. Again, once
 this server is online, the cluster will return to a fully healthy state.
 Both of these strategies involve a potentially lengthy time to reboot or rebuild
 a failed server. If this is impractical, if building a new server with the same
 IP isn't an option, or if your failed server is unrecoverable, you need to remove
 the failed server from the `raft/peers.json` file on all remaining servers.
 To begin, stop all remaining servers. You can attempt a graceful leave,
 but it will not work in most cases. Do not worry if the leave exits with an
 error. The cluster is in an unhealthy state, so this is expected.
@ -76,7 +89,7 @@ nodes should claim leadership and emit a log like:
 [INFO] consul: cluster leadership acquired
 ```
-Additional, the [`info`](/docs/commands/info.html) command can be a useful
+Additionally, the [`info`](/docs/commands/info.html) command can be a useful
 debugging tool:
 ```text
--- a/website/source/docs/internals/consensus.html.markdown
+++ b/website/source/docs/internals/consensus.html.markdown
@ -164,7 +164,7 @@ The three read modes are:
 For more documentation about using these various modes, see the [HTTP API](/docs/agent/http.html).
-## Deployment Table
+## <a name="deployment_table"></a>Deployment Table
 Below is a table that shows for the number of servers how large the
 quorum is, as well as how many node failures can be tolerated. The