mirror of https://github.com/status-im/consul.git
140 lines
7.9 KiB
Plaintext
140 lines
7.9 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Improving Consul Resilience
|
|
description: >-
|
|
Fault tolerance is the ability of a system to continue operating without interruption
|
|
despite the failure of one or more components. Consul's resilience, or fault tolerance,
|
|
is determined by the configuring of its voting server agents. Recommended strategies for
|
|
increasing Consul's fault tolerance include using 3 or 5 voting server agents, spreading
|
|
server agents across infrastructure availability zones, and using Consul Enterprise
|
|
redundancy zones to enable backup voting servers to automatically replace lost voters.
|
|
---
|
|
|
|
# Improving Consul Resilience
|
|
|
|
Fault tolerance is the ability of a system to continue operating without interruption
|
|
despite the failure of one or more components.
|
|
The most basic production deployment of Consul has 3 server agents and can lose a single
|
|
server without interruption.
|
|
|
|
As you continue to use Consul, your circumstances may change.
|
|
Perhaps a datacenter becomes more business critical or risk management policies change,
|
|
necessitating an increase in fault tolerance.
|
|
The sections below discuss options for how to improve Consul's fault tolerance.
|
|
|
|
## Fault Tolerance in Consul
|
|
|
|
Consul's fault tolerance is determined by the configuration of its voting server agents.
|
|
|
|
Each Consul datacenter depends on a set of Consul voting server agents.
|
|
The voting servers ensure Consul has a consistent, fault-tolerant state
|
|
by requiring a majority of voting servers, known as a quorum, to agree upon any state changes.
|
|
Examples of state changes include: adding or removing services,
|
|
adding or removing nodes, and changes in service or node health status.
|
|
|
|
Without a quorum, Consul experiences an outage:
|
|
it cannot provide most of its capabilities because they rely on
|
|
the availability of this state information.
|
|
If Consul has an outage, normal operation can be restored by following the
|
|
[outage recovery guide](https://learn.hashicorp.com/tutorials/consul/recovery-outage).
|
|
|
|
If Consul is deployed with 3 servers, the quorum size is 2. The deployment can lose 1
|
|
server and still maintain quorum, so it has a fault tolerance of 1.
|
|
If Consul is instead deployed with 5 servers, the quorum size increases to 3, so
|
|
the fault tolerance increases to 2.
|
|
To learn more about the relationship between the
|
|
number of servers, quorum, and fault tolerance, refer to the
|
|
[consensus protocol documentation](/docs/architecture/consensus#deployment_table).
|
|
|
|
Effectively mitigating your risk is more nuanced than just increasing the fault tolerance
|
|
metric described above. You must consider:
|
|
|
|
### Correlated Risks
|
|
|
|
Are you protected against correlated risks? Infrastructure-level failures can cause multiple servers to fail at the same time. This means that a single infrastructure-level failure could cause a Consul outage, even if your server-level fault tolerance is 2.
|
|
|
|
### Mitigation Costs
|
|
|
|
What are the costs of the mitigation? Different mitigation options present different trade-offs for operational complexity, computing cost, and Consul request performance.
|
|
|
|
## Strategies to Increase Fault Tolerance
|
|
|
|
The following sections explore several options for increasing Consul's fault tolerance.
|
|
|
|
HashiCorp recommends all production deployments consider:
|
|
- [Spreading Consul servers across availability zones](#spread-servers-across-infrastructure-availability-zones)
|
|
- <EnterpriseAlert inline /><a href="#use-backup-voting-servers-to-replace-lost-voters">Using backup voting servers to replace lost voters</a>
|
|
|
|
### Spread Servers Across Infrastructure Availability Zones
|
|
|
|
The cloud or on-premise infrastructure underlying your [Consul datacenter](/docs/install/glossary#datacenter)
|
|
may be split into several "availability zones".
|
|
An availability zone is meant to share no points of failure with other zones by:
|
|
- Having power, cooling, and networking systems independent from other zones
|
|
- Being physically distant enough from other zones so that large-scale disruptions
|
|
such as natural disasters (flooding, earthquakes) are very unlikely to affect multiple zones
|
|
|
|
Availability zones are available in the regions of most cloud providers and in some on-premise installations.
|
|
If possible, spread your Consul voting servers across 3 availability zones
|
|
to protect your Consul datacenter from a single zone-level failure.
|
|
For example, if deploying 5 Consul servers across 3 availability zones, place no more than 2 servers in each zone.
|
|
If one zone fails, at most 2 servers are lost and quorum will be maintained by the 3 remaining servers.
|
|
|
|
To distribute your Consul servers across availability zones, modify your infrastructure configuration with your infrastructure provider. No change is needed to your Consul server's agent configuration.
|
|
|
|
Additionally, you should leverage resources that can automatically restore your compute instance,
|
|
such as autoscaling groups, virtual machine scale sets, or compute engine autoscaler.
|
|
The autoscaling resources can be customized to re-deploy servers into specific availability zones
|
|
and ensure the desired numbers of servers are available at all time.
|
|
|
|
### Add More Voting Servers
|
|
|
|
For most production use cases, we recommend using either 3 or 5 voting servers,
|
|
yielding a server-level fault tolerance of 1 or 2 respectively.
|
|
|
|
Even though it would improve fault tolerance,
|
|
adding voting servers beyond 5 is **not recommended** because it decreases Consul's performance—
|
|
it requires Consul to involve more servers in every state change or consistent read.
|
|
|
|
Consul Enterprise provides a way to improve fault tolerance without this performance penalty:
|
|
[using backup voting servers to replace lost voters](#use-backup-voting-servers-to-replace-lost-voters).
|
|
|
|
### <EnterpriseAlert inline /> Use Backup Voting Servers to Replace Lost Voters
|
|
|
|
Consul Enterprise [redundancy zones](/docs/enterprise/redundancy)
|
|
can be used to improve fault tolerance without the performance penalty of increasing the number of voting servers.
|
|
|
|
Each redundancy zone should be assigned 2 or more Consul servers.
|
|
If all servers are healthy, only one server per redundancy zone will be an active voter;
|
|
all other servers will be backup voters.
|
|
If a zone's voter is lost, it will be replaced by:
|
|
- A backup voter within the same zone, if any. Otherwise,
|
|
- A backup voter within another zone, if any.
|
|
|
|
Consul can replace lost voters with backup voters within 30 seconds in most cases.
|
|
Because this replacement process is not instantaneous,
|
|
redundancy zones do not improve immediate fault tolerance—
|
|
the number of healthy voting servers that can fail at once without causing an outage.
|
|
Instead, redundancy zones improve optimistic fault tolerance:
|
|
the number of healthy active and back-up voting servers that can fail gradually without causing an outage.
|
|
|
|
The relationship between these two types of fault tolerance is:
|
|
|
|
_Optimistic fault tolerance = immediate fault tolerance + the number of healthy backup voters_
|
|
|
|
For example, consider a Consul datacenter with 3 redundancy zones and 2 servers per zone.
|
|
There will be 3 voting servers (1 per zone), meaning a quorum size of 2 and an immediate fault tolerance of 1.
|
|
There will also be 3 backup voters (1 per zone), each of which increase the optimistic fault tolerance.
|
|
Therefore, the optimistic fault tolerance is 4.
|
|
This provides performance similar to a 3 server setup with fault tolerance similar to a 7 server setup.
|
|
|
|
We recommend associating each Consul redundancy zone with an infrastructure availability zone
|
|
to also gain the infrastructure-level fault tolerance benefits provided by availability zones.
|
|
However, Consul redundancy zones can be used even without the backing of infrastructure availability zones.
|
|
|
|
For more information on redundancy zones, refer to:
|
|
- [Redundancy zone documentation](/docs/enterprise/redundancy)
|
|
for a more detailed explanation
|
|
- [Redundancy zone tutorial](https://learn.hashicorp.com/tutorials/consul/redundancy-zones?in=consul/enterprise)
|
|
to learn how to use them
|