website: documenting the internals

2014-02-20 12:26:50 -08:00 · 2014-02-20 12:26:50 -08:00 · b3087aceb2
parent bcc533cada
commit b3087aceb2
4 changed files with 202 additions and 242 deletions
--- a/website/source/docs/internals/architecture.html.markdown
+++ b/website/source/docs/internals/architecture.html.markdown
@ -82,9 +82,9 @@ slower as more machines are added. However, there is no limit to the number of c
 and they can easily scale into the thousands or tens of thousands.
 All the nodes that are in a datacenter participate in a [gossip protocol](/docs/internals/gossip.html).
-This means is there is a Serf cluster that contains all the nodes for a given datacenter. This serves
+This means is there is a gossip pool that contains all the nodes for a given datacenter. This serves
 a few purposes: first, there is no need to configure clients with the addresses of servers,
-that discovery is done automatically using Serf. Second, the work of detecting node failures
+discovery is done automatically. Second, the work of detecting node failures
 is not placed on the servers but is distributed. This makes the failure detection much more
 scalable than naive heartbeating schemes. Thirdly, it is used as a messaging layer to notify
 when important events such as leader election take place.
--- a/website/source/docs/internals/consensus.html.markdown
+++ b/website/source/docs/internals/consensus.html.markdown
@ -6,98 +6,177 @@ sidebar_current: "docs-internals-consensus"
 # Consensus Protocol
-Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
+Consul uses a [consensus protocol](http://en.wikipedia.org/wiki/Consensus_(computer_science))
-to broadcast messages to the cluster. This page documents the details of
+to provide [Consistency and Availability](http://en.wikipedia.org/wiki/CAP_theorem) as defined by CAP.
-this internal protocol. The gossip protocol is based on
+This page documents the details of this internal protocol. The consensus protocol is based on
-["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
+["Raft: In search of an Understandable Consensus Algorithm"](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
 with a few minor adaptations, mostly to increase propagation speed
 and convergence rate.
 <div class="alert alert-block alert-warning">
-<strong>Advanced Topic!</strong> This page covers the technical details of
+<strong>Advanced Topic!</strong> This page covers technical details of
-the internals of Serf. You don't need to know these details to effectively
+the internals of Consul. You don't need to know these details to effectively
-operate and use Serf. These details are documented here for those who wish
+operate and use Consul. These details are documented here for those who wish
 to learn about them without having to go spelunking through the source code.
 </div>
-## SWIM Protocol Overview
+## Raft Protocol Overview
-Serf begins by joining an existing cluster or starting a new
+Raft is a relatively new consensus algorithm that is based on Paxos,
-cluster. If starting a new cluster, additional nodes are expected to join
+but is designed to have fewer states and a simpler more understandable
-it. New nodes in an existing cluster must be given the address of at
+algorithm. There are a few key terms to know when discussing Raft:
 least one existing member in order to join the cluster. The new member
 does a full state sync with the existing member over TCP and begins gossiping its
 existence to the cluster.
-Gossip is done over UDP with a configurable but fixed fanout and interval.
+* Log - The primary unit of work in a Raft system is a log entry. The problem
-This ensures that network usage is constant with regards to number of nodes.
+of consistency can be decomposed into a *replicated log*. A log is a an ordered
-Complete state exchanges with a random node are done periodically over
+seequence of entries. We consider the log consistent if all members agree on
-TCP, but much less often than gossip messages. This increases the likelihood
+the entries and their order.
 that the membership list converges properly since the full state is exchanged
 and merged. The interval between full state exchanges is configurable or can
 be disabled entirely.
-Failure detection is done by periodic random probing using a configurable interval.
+* FSM - [Finite State Machine](http://en.wikipedia.org/wiki/Finite-state_machine).
-If the node fails to ack within a reasonable time (typically some multiple
+An FSM is a collection of finite states with transitions between them. As new logs
-of RTT), then an indirect probe is attempted. An indirect probe asks a
+are applied, the FSM is allowed to transition between states. Application of the
-configurable number of random nodes to probe the same node, in case there
+same sequence of logs must result in the same state, meaning non-deterministic
-are network issues causing our own node to fail the probe. If both our
+behavior is not permitted.
 probe and the indirect probes fail within a reasonable time, then the
 node is marked "suspicious" and this knowledge is gossiped to the cluster.
 A suspicious node is still considered a member of cluster. If the suspect member
 of the cluster does not dispute the suspicion within a configurable period of
 time, the node is finally considered dead, and this state is then gossiped
 to the cluster.
-This is a brief and incomplete description of the protocol. For a better idea,
+* Peer set - The peer set is the set of all members participating in log replication.
-please read the
+For Consul's purposes, all server nodes are in the peer set of the local datacenter.
 [SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
 in its entirety, along with the Serf source code.
-## SWIM Modifications
+* Quorum - A quorum is a majority of members from a peer set, or (n/2)+1.
 For example, if there are 5 members in the peer set, we would need 3 nodes
 to form a quorum. If a quorum of nodes is unavailable for any reason, then the
 cluster becomes *unavailable*, and no new logs can be committed.
-As mentioned earlier, the gossip protocol is based on SWIM but includes
+* Committed Entry - An entry is considered *committed* when it is durably stored
-minor changes, mostly to increase propogation speed and convergence rates.
+on a quorum of nodes. Once an entry is committed it can be applied.
-The changes from SWIM are noted here:
+* Leader - At any given time, the peer set elects a single node to be the leader.
 The leader is responsible for ingesting new log entries, replicating to followers,
 and managing when an entry is considered committed.
-* Serf does a full state sync over TCP periodically. SWIM only propagates
+Raft is a complex protocol, and will not be covered here in detail. For the full
-  changes over gossip. While both are eventually consistent, Serf is able to
+specification, we recommend reading the paper. We will attempt to provide a high
-  more quickly reach convergence, as well as gracefully recover from network
+level description, which may be useful for building a mental picture.
  partitions.
-* Serf has a dedicated gossip layer separate from the failure detection
+Raft nodes are always in one of three states: follower, candidate or leader. All
-  protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
+nodes initially start out as a follower. In this state, nodes can accept log entries
-  Serf uses piggybacking along with dedicated gossip messages. This
+from a leader and cast votes. If no entries are received for some time, nodes
-  feature lets you have a higher gossip rate (for example once per 200ms)
+self-promote to the candidate state. In the candidate state nodes request votes from
-  and a slower failure detection rate (such as once per second), resulting
+their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
-  in overall faster convergence rates and data propagation speeds.
+The leader must accept new log entries and replicate to all the other followers.
 In addition, if stale reads are not acceptable, all queries must also be performed on
 the leader.
-* Serf keeps the state of dead nodes around for a set amount of time,
+Once a cluster has a leader, it is able to accept new log entries. A client can
-  so that when full syncs are requested, the requester also receives information
+request that a leader append a new log entry, which is an opaque binary blob to
-  about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
+Raft. The leader then writes the entry to durable storage and attempts to replicate
-  state immediately upon learning that the node is dead. This change again helps
+to a quorum of followers. Once the log entry is considered *committed*, it can be
-  the cluster converge more quickly.
+*applied* to a finite state machine. The finite state machine is application specific,
 and in Consul's case, we use [LMDB](http://symas.com/mdb/) to maintain cluster state.
-## Serf-Specific Messages
+An obvious question relates to the unbounded nature of a replicated log. Raft provides
 a mechanism by which the current state is snapshotted, and the log is compacted. Because
 of the FSM abstraction, restoring the state of the FSM must result in the same state
 as a reply of old logs. This allows Raft to capture the FSM state at a point in time,
 and then remove all the logs that were used to reach that state. This is performed automatically
 without user intervention, and prevents unbounded disk usage as well as minimizing
 time spent replaying logs. One of the advantages of using LMDB is that it allows Consul
 to continue accepting new transactions even while old state is being snapshotted,
 preventing any availability issues.
-On top of the SWIM-based gossip layer, Serf sends some custom message types.
+Lastly, there is the issue of updating the peer set when new servers are joining
 or existing servers are leaving. As long as a quorum of nodes are available, this
 is not an issue as Raft provides mechanisms to dynamically update the peer set.
 If a quorum of nodes is unavailable, then this becomes a very challenging issue.
 For example, suppose there are only 2 peers, A and B. The quorum size is also
 2, meaning both nodes must agree to commit a log entry. If either A or B fails,
 it is now impossible to reach quorum. This means the cluster is unable to add,
 or remove a node, or commit any additional log entries. This results in *unavailability*.
 At this point, manual intervention would be required to remove either A or B,
 and to restart the remaining node in bootstrap mode.
-Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
+A Raft cluster of 3 nodes can tolerate a single node failure, while a cluster
-to maintain some notion of message ordering despite being eventually
+of 5 can tolerate 2 node failures. The recommended configuration is to either
-consistent. Every message sent by Serf contains a lamport clock time.
+run 3 or 5 Consul servers per datacenter. This maximizes availability without
 greatly sacrificing performance. See below for a deployment table.
-When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
+In terms of performance, Raft is comprable to Paxos. Assuming stable leadership,
-the gossip layer. Because the underlying gossip layer makes no differentiation
+a committing a log entry requires a single round trip to half of the cluster.
-between a node leaving the cluster and a node being detected as failed, this
+Thus performance is bound by disk I/O and network latency. Although Consul is
-allows the higher level Serf layer to detect a failure versus a graceful
+not designed to be a high-throughput write system, it should handle on the order
-leave.
+of hundreds to thousands of transactions per second depending on network and
 hardware configuration.
-When a node joins the cluster, Serf sends a _join intent_. The purpose
+## Raft in Consul
-of this intent is solely to attach a lamport clock time to a join so that
+
-it can be ordered properly in case a leave comes out of order.
+Only Consul server nodes participate in Raft, and are part of the peer set. All
 client nodes forward requests to servers. Part of the reason for this design is
 that as more members are added to the peer set, the size of the quorum also increases.
 This introduces performance problems as you may be waiting for hundreds of machines
 to agree on an entry instead of a handful.
 When getting started, a single Consul server is put into "bootstrap" mode. This mode
 allows it to self-elect as a leader. Once a leader is elected, other servers can be
 added to the peer set in a way that preserves consistency and safety. Eventually,
 bootstrap mode can be disabled, once the first few servers are added.
 Since all servers participate as part of the peer set, they all know the current
 leader. When an RPC request arrives at a non-leader server, the request is
 forwarded to the leader. If the RPC is a *query* type, meaning it is read-only,
 then the leader generates the result based on the current state of the FSM. If
 the RPC is a *transaction* type, meaning it modifies state, then the leader
 generates a new log entry and applies it using Raft. Once the log entry is committed
 and applied to the FSM, the transaction is complete.
 Because of the nature of Raft's replication, performance is sensitive to network
 latency. For this reason, each datacenter elects an independent leader, and maintains
 a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible
 only for data in their datacenter. When a request is received for a remote datacenter,
 the request is forwarded to the correct leader. This design allows for lower latency
 transactions and higher availability without sacrificing consistency.
 ## Deployment Table
 Below is a table that shows for the number of servers how large the
 quorum is, as well as how many node failures can be tolerated. The
 recommended deployment is either 3 or 5 servers.
 <table class="table table-bordered table-striped">
 <tr>
 <th>Servers</th>
 <th>Quorum Size</th>
 <th>Failure Tolerance</th>
 </tr>
 <tr>
 <td>1</td>
 <td>1</td>
 <td>0</td>
 </tr>
 <tr>
 <td>2</td>
 <td>2</td>
 <td>0</td>
 </tr>
 <tr>
 <td><b>3</b></td>
 <td>2</td>
 <td>1</td>
 </tr>
 <tr>
 <td>4</td>
 <td>3</td>
 <td>1</td>
 </tr>
 <tr>
 <td><b>5</b></td>
 <td>3</td>
 <td>2</td>
 </tr>
 <tr>
 <td>6</td>
 <td>4</td>
 <td>2</td>
 </tr>
 <tr>
 <td>7</td>
 <td>4</td>
 <td>3</td>
 </tr>
 </table>
 For custom events, Serf sends a _user event_ message. This message contains
 a lamport time, event name, and event payload. Because user events are sent
 along the gossip layer, which uses UDP, the payload and entire message framing
 must fit within a single UDP packet.
--- a/website/source/docs/internals/gossip.html.markdown
+++ b/website/source/docs/internals/gossip.html.markdown
@ -6,98 +6,39 @@ sidebar_current: "docs-internals-gossip"
 # Gossip Protocol
-Serf uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
+Consul uses a [gossip protocol](http://en.wikipedia.org/wiki/Gossip_protocol)
-to broadcast messages to the cluster. This page documents the details of
+to manage membership and broadcast messages to the cluster. All of this is provided
-this internal protocol. The gossip protocol is based on
+through the use of the [Serf library](http://www.serfdom.io/). The gossip protocol
 used by Serf is based on
 ["SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol"](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf),
-with a few minor adaptations, mostly to increase propagation speed
+with a few minor adaptations. There are more details about [Serf's protocol here](http://www.serfdom.io/docs/internals/gossip.html).
 and convergence rate.
 <div class="alert alert-block alert-warning">
-<strong>Advanced Topic!</strong> This page covers the technical details of
+<strong>Advanced Topic!</strong> This page covers technical details of
-the internals of Serf. You don't need to know these details to effectively
+the internals of Consul. You don't need to know these details to effectively
-operate and use Serf. These details are documented here for those who wish
+operate and use Consul. These details are documented here for those who wish
 to learn about them without having to go spelunking through the source code.
 </div>
-## SWIM Protocol Overview
+## Gossip in Consul
-Serf begins by joining an existing cluster or starting a new
+Consul makes use of two different gossip pools. We refer to each pool as the
-cluster. If starting a new cluster, additional nodes are expected to join
+LAN or WAN pool respectively. Each datacenter Consul operates in has a LAN gossip pool
-it. New nodes in an existing cluster must be given the address of at
+containing all members of the datacenter, both clients and servers. The LAN pool is
-least one existing member in order to join the cluster. The new member
+used for a few purposes. Membership information allows clients to automatically discover
-does a full state sync with the existing member over TCP and begins gossiping its
+servers, reducing the amount of configuration needed. The distributed failure detection
-existence to the cluster.
+allows the work of failure detection to be shared by the entire cluster instead of
 concentrated on a few servers. Lastly, the gossip pool allows for reliable and fast
 event broadcasts for events like leader election.
-Gossip is done over UDP with a configurable but fixed fanout and interval.
+The WAN pool is globally unique, as all servers should participate in the WAN pool
-This ensures that network usage is constant with regards to number of nodes.
+regardless of datacenter. Membership information provided by the WAN pool allows
-Complete state exchanges with a random node are done periodically over
+servers to perform cross datacenter requests. THe integrated failure detection
-TCP, but much less often than gossip messages. This increases the likelihood
+allows Consul to gracefully handle an entire datacenter losing connectivity, or just
-that the membership list converges properly since the full state is exchanged
+a single server in a remote datacenter.
 and merged. The interval between full state exchanges is configurable or can
 be disabled entirely.
-Failure detection is done by periodic random probing using a configurable interval.
+All of these features are provided by leveraging [Serf](http://www.serfdom.io/). It
-If the node fails to ack within a reasonable time (typically some multiple
+is used as an embedded library to provide these features. From a user perspective,
-of RTT), then an indirect probe is attempted. An indirect probe asks a
+this is not important, since the abstraction should be masked by Consul. It can be useful
-configurable number of random nodes to probe the same node, in case there
+however as a developer to understand how this library is leveraged.
 are network issues causing our own node to fail the probe. If both our
 probe and the indirect probes fail within a reasonable time, then the
 node is marked "suspicious" and this knowledge is gossiped to the cluster.
 A suspicious node is still considered a member of cluster. If the suspect member
 of the cluster does not dispute the suspicion within a configurable period of
 time, the node is finally considered dead, and this state is then gossiped
 to the cluster.
 This is a brief and incomplete description of the protocol. For a better idea,
 please read the
 [SWIM paper](http://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
 in its entirety, along with the Serf source code.
 ## SWIM Modifications
 As mentioned earlier, the gossip protocol is based on SWIM but includes
 minor changes, mostly to increase propogation speed and convergence rates.
 The changes from SWIM are noted here:
 * Serf does a full state sync over TCP periodically. SWIM only propagates
  changes over gossip. While both are eventually consistent, Serf is able to
  more quickly reach convergence, as well as gracefully recover from network
  partitions.
 * Serf has a dedicated gossip layer separate from the failure detection
  protocol. SWIM only piggybacks gossip messages on top of probe/ack messages.
  Serf uses piggybacking along with dedicated gossip messages. This
  feature lets you have a higher gossip rate (for example once per 200ms)
  and a slower failure detection rate (such as once per second), resulting
  in overall faster convergence rates and data propagation speeds.
 * Serf keeps the state of dead nodes around for a set amount of time,
  so that when full syncs are requested, the requester also receives information
  about dead nodes. Because SWIM doesn't do full syncs, SWIM deletes dead node
  state immediately upon learning that the node is dead. This change again helps
  the cluster converge more quickly.
 ## Serf-Specific Messages
 On top of the SWIM-based gossip layer, Serf sends some custom message types.
 Serf makes heavy use of [lamport clocks](http://en.wikipedia.org/wiki/Lamport_timestamps)
 to maintain some notion of message ordering despite being eventually
 consistent. Every message sent by Serf contains a lamport clock time.
 When a node gracefully leaves the cluster, Serf sends a _leave intent_ through
 the gossip layer. Because the underlying gossip layer makes no differentiation
 between a node leaving the cluster and a node being detected as failed, this
 allows the higher level Serf layer to detect a failure versus a graceful
 leave.
 When a node joins the cluster, Serf sends a _join intent_. The purpose
 of this intent is solely to attach a lamport clock time to a join so that
 it can be ordered properly in case a leave comes out of order.
 For custom events, Serf sends a _user event_ message. This message contains
 a lamport time, event name, and event payload. Because user events are sent
 along the gossip layer, which uses UDP, the payload and entire message framing
 must fit within a single UDP packet.
--- a/website/source/docs/internals/security.html.markdown
+++ b/website/source/docs/internals/security.html.markdown
@ -6,93 +6,43 @@ sidebar_current: "docs-internals-security"
 # Security Model
-Serf uses a symmetric key, or shared secret, cryptosystem to provide
+Consul relies on both a lightweight gossip mechanism and an RPC system
-[confidentiality, integrity and authentication](http://en.wikipedia.org/wiki/Information_security).
+to provide various features. Both of the systems have different security
 mechanisms that stem from their independent designs. However, the goals
 of Consuls security are to provide [confidentiality, integrity and authentication](http://en.wikipedia.org/wiki/Information_security).
-This means Serf communication is protected against eavesdropping, tampering,
+The [gossip protocol](/docs/internals/gossip.html) is powered by Serf,
-or attempts to generate fake events. This makes it possible to run Serf over
+which uses a symmetric key, or shared secret, cryptosystem. There are more
-untrusted networks such as EC2 and other shared hosting providers.
+details on the security of [Serf here](http://www.serfdom.io/docs/internals/security.html).
 The RPC system supports using end-to-end TLS, with optional client authentication.
 [TLS](http://en.wikipedia.org/wiki/Transport_Layer_Security) is a widely deployed asymmetric
 cryptosystem, and is the foundation of security on the Internet.
 This means Consul communication is protected against eavesdropping, tampering,
 or spoofing. This makes it possible to run Consul over untrusted networks such
 as EC2 and other shared hosting providers.
 <div class="alert alert-block alert-warning">
 <strong>Advanced Topic!</strong> This page covers the technical details of
-the security model of Serf. You don't need to know these details to
+the security model of Consul. You don't need to know these details to
-operate and use Serf. These details are documented here for those who wish
+operate and use Consul. These details are documented here for those who wish
 to learn about them without having to go spelunking through the source code.
 </div>
 ## Security Primitives
 The Serf security model is built on around a symmetric key, or shared secret system.
 All members of the Serf cluster must be provided the shared secret ahead of time.
 This places the burden of key distribution on the user.
 To support confidentiality, all messages are encrypted using the
 [AES-128 standard](http://en.wikipedia.org/wiki/Advanced_Encryption_Standard). The
 AES standard is considered one of the most secure and modern encryption standards.
 Additionally, it is a fast algorithm, and modern CPUs provide hardware instructions to
 make encryption and decryption very lightweight.
 AES is used with the [Galois Counter Mode (GCM)](http://en.wikipedia.org/wiki/Galois/Counter_Mode),
 using a randomly generated nonce. The use of GCM provides message integrity,
 as the ciphertext is suffixed with a 'tag' that is used to verify integrity.
 ## Message Format
 In the previous section we described the crypto primitives that are used. In this
 section we cover how messages are framed on the wire and interpretted.
 ### UDP Message Format
 UDP messages do not require any framing since they are packet oriented. This
 allows the message to be simple and saves space. The format is as follows:
    -------------------------------------------------------------------
    | Version (byte) | Nonce (12 bytes) | CipherText | Tag (16 bytes) |
    -------------------------------------------------------------------
 The UDP message has an overhead of 29 bytes per message.
 Tampering or bit corruption will cause the GCM tag verification to fail.
 Once we receive a packet, we first verify the GCM tag, and only on verification,
 decrypt the payload. The version byte is provided to allow future versions to
 change the algorithm they use. It is currently always set to 0.
 ### TCP Message Format
 TCP provides a stream abstraction and therefor we must provide our own framing.
 This intoduces a potential attack vector since we cannot verify the tag
 until the entire message is received, and the message length must be in plaintext.
 Our current strategy is to limit the maximum size of a framed message to prevent
 an malicious attacker from being able to send enough data to cause a Denial of Service.
 The TCP format is similar to the UDP format, but prepends the message with
 a message type byte (similar to other Serf messages). It also adds a 4 byte length
 field, encoded in Big Endian format. This increases its maximum overhead to 33 bytes.
 When we first receive a TCP encrypted message, we check the message type. If any
 party has encryption enabled, the other party must as well. Otherwise we are vulnerable
 to a downgrade attack where one side can force the other into a non-encrypted mode of
 operation.
 Once this is verified, we determine the message length and if it is less than our limit,.
 After the entire message is received, the tag is used to verify the entire message.
 ## Threat Model
 The following are the various parts of our threat model:
-* Non-members getting access to events
+* Non-members getting access to data
 * Cluster state manipulation due to malicious messages
-* Fake event generation due to malicious messages
+* Fake data generation due to malicious messages
-* Tampering of messages causing state corruption
+* Tampering causing state corruption
 * Denial of Service against a node
 We are specifically not concerned about replay attacks, as the gossip
 protocol is designed to handle that due to the nature of its broadcast mechanism.
 Additionally, we recognize that an attacker that can observe network
 traffic for an extended period of time may infer the cluster members.
-The gossip mechanism used by Serf relies on sending messages to random
+The gossip mechanism used by Consul relies on sending messages to random
 members, so an attacker can record all destinations and determine all
 members of the cluster.
@ -101,13 +51,3 @@ Our goal is not to protect top secret data but to provide a "reasonable"
 level of security that would require an attacker to commit a considerable
 amount of resources to defeat.
 ## Future Roadmap
 Eventually, Serf will be able to use the versioning byte to support
 different encryption algorithms. These could be configured at the
 start time of the agent.
 Additionally, we need to support key rotation so that it is possible
 for network administrators to periodically change keys to ensure
 perfect forward security.