From de51780eb87bf060fcebc87a862af667c6acbefc Mon Sep 17 00:00:00 2001 From: Jared Kirschner Date: Fri, 29 Apr 2022 10:58:03 -0700 Subject: [PATCH] docs: add guidance on improving Consul resilience Discuss available strategies for improving server-level and infrastructure-level fault tolerance in Consul. --- .../improving-consul-resilience.mdx | 139 ++++++++++++++++++ website/data/docs-nav-data.json | 4 + 2 files changed, 143 insertions(+) create mode 100644 website/content/docs/architecture/improving-consul-resilience.mdx diff --git a/website/content/docs/architecture/improving-consul-resilience.mdx b/website/content/docs/architecture/improving-consul-resilience.mdx new file mode 100644 index 0000000000..e45feb6169 --- /dev/null +++ b/website/content/docs/architecture/improving-consul-resilience.mdx @@ -0,0 +1,139 @@ +--- +layout: docs +page_title: Improving Consul Resilience +description: >- + Fault tolerance is the ability of a system to continue operating without interruption + despite the failure of one or more components. Consul's resilience, or fault tolerance, + is determined by the configuring of its voting server agents. Recommended strategies for + increasing Consul's fault tolerance include using 3 or 5 voting server agents, spreading + server agents across infrastructure availability zones, and using Consul Enterprise + redundancy zones to enable backup voting servers to automatically replace lost voters. +--- + +# Improving Consul Resilience + +Fault tolerance is the ability of a system to continue operating without interruption +despite the failure of one or more components. +The most basic production deployment of Consul has 3 server agents and can lose a single +server without interruption. + +As you continue to use Consul, your circumstances may change. +Perhaps a datacenter becomes more business critical or risk management policies change, +necessitating an increase in fault tolerance. +The sections below discuss options for how to improve Consul's fault tolerance. + +## Fault Tolerance in Consul + +Consul's fault tolerance is determined by the configuration of its voting server agents. + +Each Consul datacenter depends on a set of Consul voting server agents. +The voting servers ensure Consul has a consistent, fault-tolerant state +by requiring a majority of voting servers, known as a quorum, to agree upon any state changes. +Examples of state changes include: adding or removing services, +adding or removing nodes, and changes in service or node health status. + +Without a quorum, Consul experiences an outage: +it cannot provide most of its capabilities because they rely on +the availability of this state information. +If Consul has an outage, normal operation can be restored by following the +[outage recovery guide](https://learn.hashicorp.com/tutorials/consul/recovery-outage). + +If Consul is deployed with 3 servers, the quorum size is 2. The deployment can lose 1 +server and still maintain quorum, so it has a fault tolerance of 1. +If Consul is instead deployed with 5 servers, the quorum size increases to 3, so +the fault tolerance increases to 2. +To learn more about the relationship between the +number of servers, quorum, and fault tolerance, refer to the +[concensus protocol documentation](/docs/architecture/consensus#deployment_table). + +Effectively mitigating your risk is more nuanced than just increasing the fault tolerance +metric described above. You must consider: + +### Correlated Risks + +Are you protected against correlated risks? Infrastructure-level failures can cause multiple servers to fail at the same time. This means that a single infrastructure-level failure could cause a Consul outage, even if your server-level fault tolerance is 2. + +### Mitigation Costs + +What are the costs of the mitigation? Different mitigation options present different trade-offs for operational complexity, computing cost, and Consul request performance. + +## Strategies to Increase Fault Tolerance + +The following sections explore several options for increasing Consul's fault tolerance. + +HashiCorp recommends all production deployments consider: +- [Spreading Consul servers across availability zones](#spread-servers-across-infrastructure-availability-zones) +- Using backup voting servers to replace lost voters + +### Spread Servers Across Infrastructure Availability Zones + +The cloud or on-premise infrastructure underlying your [Consul datacenter](/docs/install/glossary#datacenter) +may be split into several "availability zones". +An availability zone is meant to share no points of failure with other zones by: +- Having power, cooling, and networking systems independent from other zones +- Being physically distant enough from other zones so that large-scale disruptions + such as natural disasters (flooding, earthquakes) are very unlikely to affect multiple zones + +Availability zones are available in the regions of most cloud providers and in some on-premise installations. +If possible, spread your Consul voting servers across 3 availability zones +to protect your Consul datacenter from a single zone-level failure. +For example, if deploying 5 Consul servers across 3 availability zones, place no more than 2 servers in each zone. +If one zone fails, at most 2 servers are lost and quorum will be maintained by the 3 remaining servers. + +To distribute your Consul servers across availability zones, modify your infrastructure configuration with your infrastructure provider. No change is needed to your Consul server’s agent configuration. + +Additionally, you should leverage resources that can automatically restore your compute instance, +such as autoscaling groups, virtual machine scale sets, or compute engine autoscaler. +The autoscaling resources can be customized to re-deploy servers into specific availability zones +and ensure the desired numbers of servers are available at all time. + +### Add More Voting Servers + +For most production use cases, we recommend using either 3 or 5 voting servers, +yielding a server-level fault tolerance of 1 or 2 respectively. + +Even though it would improve fault tolerance, +adding voting servers beyond 5 is **not recommended** because it decreases Consul's performance— +it requires Consul to involve more servers in every state change or consistent read. + +Consul Enterprise provides a way to improve fault tolerance without this performance penalty: +[using backup voting servers to replace lost voters](#use-backup-voting-servers-to-replace-lost-voters). + +### Use Backup Voting Servers to Replace Lost Voters + +Consul Enterprise [redundancy zones](/docs/enterprise/redundancy) +can be used to improve fault tolerance without the performance penalty of increasing the number of voting servers. + +Each redundancy zone should be assigned 2 or more Consul servers. +If all servers are healthy, only one server per redundancy zone will be an active voter; +all other servers will be backup voters. +If a zone's voter is lost, it will be replaced by: +- A backup voter within the same zone, if any. Otherwise, +- A backup voter within another zone, if any. + +Consul can replace lost voters with backup voters within 30 seconds in most cases. +Because this replacement process is not instantaneous, +redundancy zones do not improve immediate fault tolerance— +the number of healthy voting servers that can fail at once without causing an outage. +Instead, redundancy zones improve optimistic fault tolerance: +the number of healthy active and back-up voting servers that can fail gradually without causing an outage. + +The relationship between these two types of fault tolerance is: + +_Optimistic fault tolerance = immediate fault tolerance + the number of healthy backup voters_ + +For example, consider a Consul datacenter with 3 redundancy zones and 2 servers per zone. +There will be 3 voting servers (1 per zone), meaning a quorum size of 2 and an immediate fault tolerance of 1. +There will also be 3 backup voters (1 per zone), each of which increase the optimistic fault tolerance. +Therefore, the optimistic fault tolerance is 4. +This provides performance similar to a 3 server setup with fault tolerance similar to a 7 server setup. + +We recommend associating each Consul redundancy zone with an infrastructure availability zone +to also gain the infrastructure-level fault tolerance benefits provided by availability zones. +However, Consul redundancy zones can be used even without the backing of infrastructure availability zones. + +For more information on redundancy zones, refer to: +- [Redundancy zone documentation](/docs/enterprise/redundancy) + for a more detailed explanation +- [Redundancy zone tutorial](https://learn.hashicorp.com/tutorials/consul/redundancy-zones?in=consul/enterprise) + to learn how to use them diff --git a/website/data/docs-nav-data.json b/website/data/docs-nav-data.json index 0251df2549..e2e770419b 100644 --- a/website/data/docs-nav-data.json +++ b/website/data/docs-nav-data.json @@ -1086,6 +1086,10 @@ "title": "Overview", "path": "architecture" }, + { + "title": "Improving Consul Resilience", + "path": "architecture/improving-consul-resilience" + }, { "title": "Anti-Entropy", "path": "architecture/anti-entropy"