mirror of https://github.com/status-im/consul.git
website: documenting the internals
This commit is contained in:
parent
bcc533cada
commit
b3087aceb2
|
@ -82,9 +82,9 @@ slower as more machines are added. However, there is no limit to the number of c
|
||||||
and they can easily scale into the thousands or tens of thousands.
|
and they can easily scale into the thousands or tens of thousands.
|
||||||
|
|
||||||
All the nodes that are in a datacenter participate in a [gossip protocol](/docs/internals/gossip.html).
|
All the nodes that are in a datacenter participate in a [gossip protocol](/docs/internals/gossip.html).
|
||||||
This means is there is a Serf cluster that contains all the nodes for a given datacenter. This serves
|
This means is there is a gossip pool that contains all the nodes for a given datacenter. This serves
|
||||||
a few purposes: first, there is no need to configure clients with the addresses of servers,
|
a few purposes: first, there is no need to configure clients with the addresses of servers,
|
||||||
that discovery is done automatically using Serf. Second, the work of detecting node failures
|
discovery is done automatically. Second, the work of detecting node failures
|
||||||
is not placed on the servers but is distributed. This makes the failure detection much more
|
is not placed on the servers but is distributed. This makes the failure detection much more
|
||||||
scalable than naive heartbeating schemes. Thirdly, it is used as a messaging layer to notify
|
scalable than naive heartbeating schemes. Thirdly, it is used as a messaging layer to notify
|
||||||
when important events such as leader election take place.
|
when important events such as leader election take place.
|
||||||
|
|
|
@ -6,98 +6,177 @@ sidebar_current: "docs-internals-consensus"
|
||||||
|
|
||||||
# Consensus Protocol
|
# Consensus Protocol
|
||||||
|
|
||||||
Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
|
Consul uses a [consensus protocol](http://en.wikipedia.org/wiki/Consensus_(computer_science))
|
||||||
to broadcast messages to the cluster. This page documents the details of
|
to provide [Consistency and Availability](http://en.wikipedia.org/wiki/CAP_theorem) as defined by CAP.
|
||||||
this internal protocol. The gossip protocol is based on
|
This page documents the details of this internal protocol. The consensus protocol is based on
|
||||||
["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
|
["Raft: In search of an Understandable Consensus Algorithm"](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
|
||||||
with a few minor adaptations, mostly to increase propagation speed
|
|
||||||
and convergence rate.
|
|
||||||
|
|
||||||
<div class="alert alert-block alert-warning">
|
<div class="alert alert-block alert-warning">
|
||||||
<strong>Advanced Topic!</strong> This page covers the technical details of
|
<strong>Advanced Topic!</strong> This page covers technical details of
|
||||||
the internals of Serf. You don't need to know these details to effectively
|
the internals of Consul. You don't need to know these details to effectively
|
||||||
operate and use Serf. These details are documented here for those who wish
|
operate and use Consul. These details are documented here for those who wish
|
||||||
to learn about them without having to go spelunking through the source code.
|
to learn about them without having to go spelunking through the source code.
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
## SWIM Protocol Overview
|
## Raft Protocol Overview
|
||||||
|
|
||||||
Serf begins by joining an existing cluster or starting a new
|
Raft is a relatively new consensus algorithm that is based on Paxos,
|
||||||
cluster. If starting a new cluster, additional nodes are expected to join
|
but is designed to have fewer states and a simpler more understandable
|
||||||
it. New nodes in an existing cluster must be given the address of at
|
algorithm. There are a few key terms to know when discussing Raft:
|
||||||
least one existing member in order to join the cluster. The new member
|
|
||||||
does a full state sync with the existing member over TCP and begins gossiping its
|
|
||||||
existence to the cluster.
|
|
||||||
|
|
||||||
Gossip is done over UDP with a configurable but fixed fanout and interval.
|
* Log - The primary unit of work in a Raft system is a log entry. The problem
|
||||||
This ensures that network usage is constant with regards to number of nodes.
|
of consistency can be decomposed into a *replicated log*. A log is a an ordered
|
||||||
Complete state exchanges with a random node are done periodically over
|
seequence of entries. We consider the log consistent if all members agree on
|
||||||
TCP, but much less often than gossip messages. This increases the likelihood
|
the entries and their order.
|
||||||
that the membership list converges properly since the full state is exchanged
|
|
||||||
and merged. The interval between full state exchanges is configurable or can
|
|
||||||
be disabled entirely.
|
|
||||||
|
|
||||||
Failure detection is done by periodic random probing using a configurable interval.
|
* FSM - [Finite State Machine](http://en.wikipedia.org/wiki/Finite-state_machine).
|
||||||
If the node fails to ack within a reasonable time (typically some multiple
|
An FSM is a collection of finite states with transitions between them. As new logs
|
||||||
of RTT), then an indirect probe is attempted. An indirect probe asks a
|
are applied, the FSM is allowed to transition between states. Application of the
|
||||||
configurable number of random nodes to probe the same node, in case there
|
same sequence of logs must result in the same state, meaning non-deterministic
|
||||||
are network issues causing our own node to fail the probe. If both our
|
behavior is not permitted.
|
||||||
probe and the indirect probes fail within a reasonable time, then the
|
|
||||||
node is marked "suspicious" and this knowledge is gossiped to the cluster.
|
|
||||||
A suspicious node is still considered a member of cluster. If the suspect member
|
|
||||||
of the cluster does not dispute the suspicion within a configurable period of
|
|
||||||
time, the node is finally considered dead, and this state is then gossiped
|
|
||||||
to the cluster.
|
|
||||||
|
|
||||||
This is a brief and incomplete description of the protocol. For a better idea,
|
* Peer set - The peer set is the set of all members participating in log replication.
|
||||||
please read the
|
For Consul's purposes, all server nodes are in the peer set of the local datacenter.
|
||||||
[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
|
|
||||||
in its entirety, along with the Serf source code.
|
|
||||||
|
|
||||||
## SWIM Modifications
|
* Quorum - A quorum is a majority of members from a peer set, or (n/2)+1.
|
||||||
|
For example, if there are 5 members in the peer set, we would need 3 nodes
|
||||||
|
to form a quorum. If a quorum of nodes is unavailable for any reason, then the
|
||||||
|
cluster becomes *unavailable*, and no new logs can be committed.
|
||||||
|
|
||||||
As mentioned earlier, the gossip protocol is based on SWIM but includes
|
* Committed Entry - An entry is considered *committed* when it is durably stored
|
||||||
minor changes, mostly to increase propogation speed and convergence rates.
|
on a quorum of nodes. Once an entry is committed it can be applied.
|
||||||
|
|
||||||
The changes from SWIM are noted here:
|
* Leader - At any given time, the peer set elects a single node to be the leader.
|
||||||
|
The leader is responsible for ingesting new log entries, replicating to followers,
|
||||||
|
and managing when an entry is considered committed.
|
||||||
|
|
||||||
* Serf does a full state sync over TCP periodically. SWIM only propagates
|
Raft is a complex protocol, and will not be covered here in detail. For the full
|
||||||
changes over gossip. While both are eventually consistent, Serf is able to
|
specification, we recommend reading the paper. We will attempt to provide a high
|
||||||
more quickly reach convergence, as well as gracefully recover from network
|
level description, which may be useful for building a mental picture.
|
||||||
partitions.
|
|
||||||
|
|
||||||
* Serf has a dedicated gossip layer separate from the failure detection
|
Raft nodes are always in one of three states: follower, candidate or leader. All
|
||||||
protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
|
nodes initially start out as a follower. In this state, nodes can accept log entries
|
||||||
Serf uses piggybacking along with dedicated gossip messages. This
|
from a leader and cast votes. If no entries are received for some time, nodes
|
||||||
feature lets you have a higher gossip rate (for example once per 200ms)
|
self-promote to the candidate state. In the candidate state nodes request votes from
|
||||||
and a slower failure detection rate (such as once per second), resulting
|
their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
|
||||||
in overall faster convergence rates and data propagation speeds.
|
The leader must accept new log entries and replicate to all the other followers.
|
||||||
|
In addition, if stale reads are not acceptable, all queries must also be performed on
|
||||||
|
the leader.
|
||||||
|
|
||||||
* Serf keeps the state of dead nodes around for a set amount of time,
|
Once a cluster has a leader, it is able to accept new log entries. A client can
|
||||||
so that when full syncs are requested, the requester also receives information
|
request that a leader append a new log entry, which is an opaque binary blob to
|
||||||
about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
|
Raft. The leader then writes the entry to durable storage and attempts to replicate
|
||||||
state immediately upon learning that the node is dead. This change again helps
|
to a quorum of followers. Once the log entry is considered *committed*, it can be
|
||||||
the cluster converge more quickly.
|
*applied* to a finite state machine. The finite state machine is application specific,
|
||||||
|
and in Consul's case, we use [LMDB](http://symas.com/mdb/) to maintain cluster state.
|
||||||
|
|
||||||
## Serf-Specific Messages
|
An obvious question relates to the unbounded nature of a replicated log. Raft provides
|
||||||
|
a mechanism by which the current state is snapshotted, and the log is compacted. Because
|
||||||
|
of the FSM abstraction, restoring the state of the FSM must result in the same state
|
||||||
|
as a reply of old logs. This allows Raft to capture the FSM state at a point in time,
|
||||||
|
and then remove all the logs that were used to reach that state. This is performed automatically
|
||||||
|
without user intervention, and prevents unbounded disk usage as well as minimizing
|
||||||
|
time spent replaying logs. One of the advantages of using LMDB is that it allows Consul
|
||||||
|
to continue accepting new transactions even while old state is being snapshotted,
|
||||||
|
preventing any availability issues.
|
||||||
|
|
||||||
On top of the SWIM-based gossip layer, Serf sends some custom message types.
|
Lastly, there is the issue of updating the peer set when new servers are joining
|
||||||
|
or existing servers are leaving. As long as a quorum of nodes are available, this
|
||||||
|
is not an issue as Raft provides mechanisms to dynamically update the peer set.
|
||||||
|
If a quorum of nodes is unavailable, then this becomes a very challenging issue.
|
||||||
|
For example, suppose there are only 2 peers, A and B. The quorum size is also
|
||||||
|
2, meaning both nodes must agree to commit a log entry. If either A or B fails,
|
||||||
|
it is now impossible to reach quorum. This means the cluster is unable to add,
|
||||||
|
or remove a node, or commit any additional log entries. This results in *unavailability*.
|
||||||
|
At this point, manual intervention would be required to remove either A or B,
|
||||||
|
and to restart the remaining node in bootstrap mode.
|
||||||
|
|
||||||
Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
|
A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster
|
||||||
to maintain some notion of message ordering despite being eventually
|
of 5 can tolerate 2 node failures. The recommended configuration is to either
|
||||||
consistent. Every message sent by Serf contains a lamport clock time.
|
run 3 or 5 Consul servers per datacenter. This maximizes availability without
|
||||||
|
greatly sacrificing performance. See below for a deployment table.
|
||||||
|
|
||||||
When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
|
In terms of performance, Raft is comprable to Paxos. Assuming stable leadership,
|
||||||
the gossip layer. Because the underlying gossip layer makes no differentiation
|
a committing a log entry requires a single round trip to half of the cluster.
|
||||||
between a node leaving the cluster and a node being detected as failed, this
|
Thus performance is bound by disk I/O and network latency. Although Consul is
|
||||||
allows the higher level Serf layer to detect a failure versus a graceful
|
not designed to be a high-throughput write system, it should handle on the order
|
||||||
leave.
|
of hundreds to thousands of transactions per second depending on network and
|
||||||
|
hardware configuration.
|
||||||
|
|
||||||
When a node joins the cluster, Serf sends a _join intent_. The purpose
|
## Raft in Consul
|
||||||
of this intent is solely to attach a lamport clock time to a join so that
|
|
||||||
it can be ordered properly in case a leave comes out of order.
|
Only Consul server nodes participate in Raft, and are part of the peer set. All
|
||||||
|
client nodes forward requests to servers. Part of the reason for this design is
|
||||||
|
that as more members are added to the peer set, the size of the quorum also increases.
|
||||||
|
This introduces performance problems as you may be waiting for hundreds of machines
|
||||||
|
to agree on an entry instead of a handful.
|
||||||
|
|
||||||
|
When getting started, a single Consul server is put into "bootstrap" mode. This mode
|
||||||
|
allows it to self-elect as a leader. Once a leader is elected, other servers can be
|
||||||
|
added to the peer set in a way that preserves consistency and safety. Eventually,
|
||||||
|
bootstrap mode can be disabled, once the first few servers are added.
|
||||||
|
|
||||||
|
Since all servers participate as part of the peer set, they all know the current
|
||||||
|
leader. When an RPC request arrives at a non-leader server, the request is
|
||||||
|
forwarded to the leader. If the RPC is a *query* type, meaning it is read-only,
|
||||||
|
then the leader generates the result based on the current state of the FSM. If
|
||||||
|
the RPC is a *transaction* type, meaning it modifies state, then the leader
|
||||||
|
generates a new log entry and applies it using Raft. Once the log entry is committed
|
||||||
|
and applied to the FSM, the transaction is complete.
|
||||||
|
|
||||||
|
Because of the nature of Raft's replication, performance is sensitive to network
|
||||||
|
latency. For this reason, each datacenter elects an independent leader, and maintains
|
||||||
|
a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible
|
||||||
|
only for data in their datacenter. When a request is received for a remote datacenter,
|
||||||
|
the request is forwarded to the correct leader. This design allows for lower latency
|
||||||
|
transactions and higher availability without sacrificing consistency.
|
||||||
|
|
||||||
|
## Deployment Table
|
||||||
|
|
||||||
|
Below is a table that shows for the number of servers how large the
|
||||||
|
quorum is, as well as how many node failures can be tolerated. The
|
||||||
|
recommended deployment is either 3 or 5 servers.
|
||||||
|
|
||||||
|
<table class="table table-bordered table-striped">
|
||||||
|
<tr>
|
||||||
|
<th>Servers</th>
|
||||||
|
<th>Quorum Size</th>
|
||||||
|
<th>Failure Tolerance</th>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>1</td>
|
||||||
|
<td>1</td>
|
||||||
|
<td>0</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>2</td>
|
||||||
|
<td>2</td>
|
||||||
|
<td>0</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td><b>3</b></td>
|
||||||
|
<td>2</td>
|
||||||
|
<td>1</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>4</td>
|
||||||
|
<td>3</td>
|
||||||
|
<td>1</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td><b>5</b></td>
|
||||||
|
<td>3</td>
|
||||||
|
<td>2</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>6</td>
|
||||||
|
<td>4</td>
|
||||||
|
<td>2</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>7</td>
|
||||||
|
<td>4</td>
|
||||||
|
<td>3</td>
|
||||||
|
</tr>
|
||||||
|
</table>
|
||||||
|
|
||||||
For custom events, Serf sends a _user event_ message. This message contains
|
|
||||||
a lamport time, event name, and event payload. Because user events are sent
|
|
||||||
along the gossip layer, which uses UDP, the payload and entire message framing
|
|
||||||
must fit within a single UDP packet.
|
|
||||||
|
|
|
@ -6,98 +6,39 @@ sidebar_current: "docs-internals-gossip"
|
||||||
|
|
||||||
# Gossip Protocol
|
# Gossip Protocol
|
||||||
|
|
||||||
Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
|
Consul uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
|
||||||
to broadcast messages to the cluster. This page documents the details of
|
to manage membership and broadcast messages to the cluster. All of this is provided
|
||||||
this internal protocol. The gossip protocol is based on
|
through the use of the [Serf library](http://www.serfdom.io/). The gossip protocol
|
||||||
|
used by Serf is based on
|
||||||
["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
|
["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
|
||||||
with a few minor adaptations, mostly to increase propagation speed
|
with a few minor adaptations. There are more details about [Serf's protocol here](http://www.serfdom.io/docs/internals/gossip.html).
|
||||||
and convergence rate.
|
|
||||||
|
|
||||||
<div class="alert alert-block alert-warning">
|
<div class="alert alert-block alert-warning">
|
||||||
<strong>Advanced Topic!</strong> This page covers the technical details of
|
<strong>Advanced Topic!</strong> This page covers technical details of
|
||||||
the internals of Serf. You don't need to know these details to effectively
|
the internals of Consul. You don't need to know these details to effectively
|
||||||
operate and use Serf. These details are documented here for those who wish
|
operate and use Consul. These details are documented here for those who wish
|
||||||
to learn about them without having to go spelunking through the source code.
|
to learn about them without having to go spelunking through the source code.
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
## SWIM Protocol Overview
|
## Gossip in Consul
|
||||||
|
|
||||||
Serf begins by joining an existing cluster or starting a new
|
Consul makes use of two different gossip pools. We refer to each pool as the
|
||||||
cluster. If starting a new cluster, additional nodes are expected to join
|
LAN or WAN pool respectively. Each datacenter Consul operates in has a LAN gossip pool
|
||||||
it. New nodes in an existing cluster must be given the address of at
|
containing all members of the datacenter, both clients and servers. The LAN pool is
|
||||||
least one existing member in order to join the cluster. The new member
|
used for a few purposes. Membership information allows clients to automatically discover
|
||||||
does a full state sync with the existing member over TCP and begins gossiping its
|
servers, reducing the amount of configuration needed. The distributed failure detection
|
||||||
existence to the cluster.
|
allows the work of failure detection to be shared by the entire cluster instead of
|
||||||
|
concentrated on a few servers. Lastly, the gossip pool allows for reliable and fast
|
||||||
|
event broadcasts for events like leader election.
|
||||||
|
|
||||||
Gossip is done over UDP with a configurable but fixed fanout and interval.
|
The WAN pool is globally unique, as all servers should participate in the WAN pool
|
||||||
This ensures that network usage is constant with regards to number of nodes.
|
regardless of datacenter. Membership information provided by the WAN pool allows
|
||||||
Complete state exchanges with a random node are done periodically over
|
servers to perform cross datacenter requests. THe integrated failure detection
|
||||||
TCP, but much less often than gossip messages. This increases the likelihood
|
allows Consul to gracefully handle an entire datacenter losing connectivity, or just
|
||||||
that the membership list converges properly since the full state is exchanged
|
a single server in a remote datacenter.
|
||||||
and merged. The interval between full state exchanges is configurable or can
|
|
||||||
be disabled entirely.
|
|
||||||
|
|
||||||
Failure detection is done by periodic random probing using a configurable interval.
|
All of these features are provided by leveraging [Serf](http://www.serfdom.io/). It
|
||||||
If the node fails to ack within a reasonable time (typically some multiple
|
is used as an embedded library to provide these features. From a user perspective,
|
||||||
of RTT), then an indirect probe is attempted. An indirect probe asks a
|
this is not important, since the abstraction should be masked by Consul. It can be useful
|
||||||
configurable number of random nodes to probe the same node, in case there
|
however as a developer to understand how this library is leveraged.
|
||||||
are network issues causing our own node to fail the probe. If both our
|
|
||||||
probe and the indirect probes fail within a reasonable time, then the
|
|
||||||
node is marked "suspicious" and this knowledge is gossiped to the cluster.
|
|
||||||
A suspicious node is still considered a member of cluster. If the suspect member
|
|
||||||
of the cluster does not dispute the suspicion within a configurable period of
|
|
||||||
time, the node is finally considered dead, and this state is then gossiped
|
|
||||||
to the cluster.
|
|
||||||
|
|
||||||
This is a brief and incomplete description of the protocol. For a better idea,
|
|
||||||
please read the
|
|
||||||
[SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
|
|
||||||
in its entirety, along with the Serf source code.
|
|
||||||
|
|
||||||
## SWIM Modifications
|
|
||||||
|
|
||||||
As mentioned earlier, the gossip protocol is based on SWIM but includes
|
|
||||||
minor changes, mostly to increase propogation speed and convergence rates.
|
|
||||||
|
|
||||||
The changes from SWIM are noted here:
|
|
||||||
|
|
||||||
* Serf does a full state sync over TCP periodically. SWIM only propagates
|
|
||||||
changes over gossip. While both are eventually consistent, Serf is able to
|
|
||||||
more quickly reach convergence, as well as gracefully recover from network
|
|
||||||
partitions.
|
|
||||||
|
|
||||||
* Serf has a dedicated gossip layer separate from the failure detection
|
|
||||||
protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
|
|
||||||
Serf uses piggybacking along with dedicated gossip messages. This
|
|
||||||
feature lets you have a higher gossip rate (for example once per 200ms)
|
|
||||||
and a slower failure detection rate (such as once per second), resulting
|
|
||||||
in overall faster convergence rates and data propagation speeds.
|
|
||||||
|
|
||||||
* Serf keeps the state of dead nodes around for a set amount of time,
|
|
||||||
so that when full syncs are requested, the requester also receives information
|
|
||||||
about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
|
|
||||||
state immediately upon learning that the node is dead. This change again helps
|
|
||||||
the cluster converge more quickly.
|
|
||||||
|
|
||||||
## Serf-Specific Messages
|
|
||||||
|
|
||||||
On top of the SWIM-based gossip layer, Serf sends some custom message types.
|
|
||||||
|
|
||||||
Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
|
|
||||||
to maintain some notion of message ordering despite being eventually
|
|
||||||
consistent. Every message sent by Serf contains a lamport clock time.
|
|
||||||
|
|
||||||
When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
|
|
||||||
the gossip layer. Because the underlying gossip layer makes no differentiation
|
|
||||||
between a node leaving the cluster and a node being detected as failed, this
|
|
||||||
allows the higher level Serf layer to detect a failure versus a graceful
|
|
||||||
leave.
|
|
||||||
|
|
||||||
When a node joins the cluster, Serf sends a _join intent_. The purpose
|
|
||||||
of this intent is solely to attach a lamport clock time to a join so that
|
|
||||||
it can be ordered properly in case a leave comes out of order.
|
|
||||||
|
|
||||||
For custom events, Serf sends a _user event_ message. This message contains
|
|
||||||
a lamport time, event name, and event payload. Because user events are sent
|
|
||||||
along the gossip layer, which uses UDP, the payload and entire message framing
|
|
||||||
must fit within a single UDP packet.
|
|
||||||
|
|
|
@ -6,93 +6,43 @@ sidebar_current: "docs-internals-security"
|
||||||
|
|
||||||
# Security Model
|
# Security Model
|
||||||
|
|
||||||
Serf uses a symmetric key, or shared secret, cryptosystem to provide
|
Consul relies on both a lightweight gossip mechanism and an RPC system
|
||||||
[confidentiality, integrity and authentication](http://en.wikipedia.org/wiki/Information_security).
|
to provide various features. Both of the systems have different security
|
||||||
|
mechanisms that stem from their independent designs. However, the goals
|
||||||
|
of Consuls security are to provide [confidentiality, integrity and authentication](http://en.wikipedia.org/wiki/Information_security).
|
||||||
|
|
||||||
This means Serf communication is protected against eavesdropping, tampering,
|
The [gossip protocol](/docs/internals/gossip.html) is powered by Serf,
|
||||||
or attempts to generate fake events. This makes it possible to run Serf over
|
which uses a symmetric key, or shared secret, cryptosystem. There are more
|
||||||
untrusted networks such as EC2 and other shared hosting providers.
|
details on the security of [Serf here](http://www.serfdom.io/docs/internals/security.html).
|
||||||
|
|
||||||
|
The RPC system supports using end-to-end TLS, with optional client authentication.
|
||||||
|
[TLS](http://en.wikipedia.org/wiki/Transport_Layer_Security) is a widely deployed asymmetric
|
||||||
|
cryptosystem, and is the foundation of security on the Internet.
|
||||||
|
|
||||||
|
This means Consul communication is protected against eavesdropping, tampering,
|
||||||
|
or spoofing. This makes it possible to run Consul over untrusted networks such
|
||||||
|
as EC2 and other shared hosting providers.
|
||||||
|
|
||||||
<div class="alert alert-block alert-warning">
|
<div class="alert alert-block alert-warning">
|
||||||
<strong>Advanced Topic!</strong> This page covers the technical details of
|
<strong>Advanced Topic!</strong> This page covers the technical details of
|
||||||
the security model of Serf. You don't need to know these details to
|
the security model of Consul. You don't need to know these details to
|
||||||
operate and use Serf. These details are documented here for those who wish
|
operate and use Consul. These details are documented here for those who wish
|
||||||
to learn about them without having to go spelunking through the source code.
|
to learn about them without having to go spelunking through the source code.
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
## Security Primitives
|
|
||||||
|
|
||||||
The Serf security model is built on around a symmetric key, or shared secret system.
|
|
||||||
All members of the Serf cluster must be provided the shared secret ahead of time.
|
|
||||||
This places the burden of key distribution on the user.
|
|
||||||
|
|
||||||
To support confidentiality, all messages are encrypted using the
|
|
||||||
[AES-128 standard](http://en.wikipedia.org/wiki/Advanced_Encryption_Standard). The
|
|
||||||
AES standard is considered one of the most secure and modern encryption standards.
|
|
||||||
Additionally, it is a fast algorithm, and modern CPUs provide hardware instructions to
|
|
||||||
make encryption and decryption very lightweight.
|
|
||||||
|
|
||||||
AES is used with the [Galois Counter Mode (GCM)](http://en.wikipedia.org/wiki/Galois/Counter_Mode),
|
|
||||||
using a randomly generated nonce. The use of GCM provides message integrity,
|
|
||||||
as the ciphertext is suffixed with a 'tag' that is used to verify integrity.
|
|
||||||
|
|
||||||
## Message Format
|
|
||||||
|
|
||||||
In the previous section we described the crypto primitives that are used. In this
|
|
||||||
section we cover how messages are framed on the wire and interpretted.
|
|
||||||
|
|
||||||
### UDP Message Format
|
|
||||||
|
|
||||||
UDP messages do not require any framing since they are packet oriented. This
|
|
||||||
allows the message to be simple and saves space. The format is as follows:
|
|
||||||
|
|
||||||
-------------------------------------------------------------------
|
|
||||||
| Version (byte) | Nonce (12 bytes) | CipherText | Tag (16 bytes) |
|
|
||||||
-------------------------------------------------------------------
|
|
||||||
|
|
||||||
The UDP message has an overhead of 29 bytes per message.
|
|
||||||
Tampering or bit corruption will cause the GCM tag verification to fail.
|
|
||||||
|
|
||||||
Once we receive a packet, we first verify the GCM tag, and only on verification,
|
|
||||||
decrypt the payload. The version byte is provided to allow future versions to
|
|
||||||
change the algorithm they use. It is currently always set to 0.
|
|
||||||
|
|
||||||
### TCP Message Format
|
|
||||||
|
|
||||||
TCP provides a stream abstraction and therefor we must provide our own framing.
|
|
||||||
This intoduces a potential attack vector since we cannot verify the tag
|
|
||||||
until the entire message is received, and the message length must be in plaintext.
|
|
||||||
Our current strategy is to limit the maximum size of a framed message to prevent
|
|
||||||
an malicious attacker from being able to send enough data to cause a Denial of Service.
|
|
||||||
|
|
||||||
The TCP format is similar to the UDP format, but prepends the message with
|
|
||||||
a message type byte (similar to other Serf messages). It also adds a 4 byte length
|
|
||||||
field, encoded in Big Endian format. This increases its maximum overhead to 33 bytes.
|
|
||||||
|
|
||||||
When we first receive a TCP encrypted message, we check the message type. If any
|
|
||||||
party has encryption enabled, the other party must as well. Otherwise we are vulnerable
|
|
||||||
to a downgrade attack where one side can force the other into a non-encrypted mode of
|
|
||||||
operation.
|
|
||||||
|
|
||||||
Once this is verified, we determine the message length and if it is less than our limit,.
|
|
||||||
After the entire message is received, the tag is used to verify the entire message.
|
|
||||||
|
|
||||||
## Threat Model
|
## Threat Model
|
||||||
|
|
||||||
The following are the various parts of our threat model:
|
The following are the various parts of our threat model:
|
||||||
|
|
||||||
* Non-members getting access to events
|
* Non-members getting access to data
|
||||||
* Cluster state manipulation due to malicious messages
|
* Cluster state manipulation due to malicious messages
|
||||||
* Fake event generation due to malicious messages
|
* Fake data generation due to malicious messages
|
||||||
* Tampering of messages causing state corruption
|
* Tampering causing state corruption
|
||||||
* Denial of Service against a node
|
* Denial of Service against a node
|
||||||
|
|
||||||
We are specifically not concerned about replay attacks, as the gossip
|
|
||||||
protocol is designed to handle that due to the nature of its broadcast mechanism.
|
|
||||||
|
|
||||||
Additionally, we recognize that an attacker that can observe network
|
Additionally, we recognize that an attacker that can observe network
|
||||||
traffic for an extended period of time may infer the cluster members.
|
traffic for an extended period of time may infer the cluster members.
|
||||||
The gossip mechanism used by Serf relies on sending messages to random
|
The gossip mechanism used by Consul relies on sending messages to random
|
||||||
members, so an attacker can record all destinations and determine all
|
members, so an attacker can record all destinations and determine all
|
||||||
members of the cluster.
|
members of the cluster.
|
||||||
|
|
||||||
|
@ -101,13 +51,3 @@ Our goal is not to protect top secret data but to provide a "reasonable"
|
||||||
level of security that would require an attacker to commit a considerable
|
level of security that would require an attacker to commit a considerable
|
||||||
amount of resources to defeat.
|
amount of resources to defeat.
|
||||||
|
|
||||||
## Future Roadmap
|
|
||||||
|
|
||||||
Eventually, Serf will be able to use the versioning byte to support
|
|
||||||
different encryption algorithms. These could be configured at the
|
|
||||||
start time of the agent.
|
|
||||||
|
|
||||||
Additionally, we need to support key rotation so that it is possible
|
|
||||||
for network administrators to periodically change keys to ensure
|
|
||||||
perfect forward security.
|
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue