diff --git a/website/pages/docs/agent/index.mdx b/website/pages/docs/agent/index.mdx index 7d27e71aa2..25ab38354e 100644 --- a/website/pages/docs/agent/index.mdx +++ b/website/pages/docs/agent/index.mdx @@ -15,23 +15,28 @@ information, registers services, runs checks, responds to queries, and more. The agent must run on every node that is part of a Consul cluster. Any agent may run in one of two modes: client or server. A server -node takes on the additional responsibility of being part of the [consensus quorum](/docs/internals/consensus). +node takes on the additional responsibility of being part of the +[consensus quorum](/docs/internals/consensus). These nodes take part in Raft and provide strong consistency and availability in -the case of failure. The higher burden on the server nodes means that usually they -should be run on dedicated instances -- they are more resource intensive than a client -node. Client nodes make up the majority of the cluster, and they are very lightweight -as they interface with the server nodes for most operations and maintain very little state -of their own. +the case of failure. The higher burden on the server nodes means that usually +they should be run on dedicated instances -- they are more resource intensive +than a client node. Client nodes make up the majority of the cluster, and they +are very lightweight as they interface with the server nodes for most +operations and maintain very little state of their own. ## Running an Agent -The agent is started with the [`consul agent`](/docs/commands/agent) command. This -command blocks, running forever or until told to quit. You can test a local agent by following the [Getting Started guides](https://learn.hashicorp.com/consul/getting-started/install?utm_source=consul.io&utm_medium=docs). +The agent is started with the [`consul agent`](/docs/commands/agent) command. +This command blocks, running forever or until told to quit. You can test a +local agent by following the +[Getting Started guides](https://learn.hashicorp.com/consul/getting-started/install?utm_source=consul.io&utm_medium=docs). -The agent command takes a variety -of [`configuration options`](/docs/agent/options#command-line-options), but most have sane defaults. +The agent command takes a variety of +[`configuration options`](/docs/agent/options#command-line-options), but most +have sane defaults. -When running [`consul agent`](/docs/commands/agent), you should see output similar to this: +When running [`consul agent`](/docs/commands/agent), you should see output +similar to this: ```shell-session $ consul agent -data-dir=/tmp/consul @@ -49,33 +54,38 @@ $ consul agent -data-dir=/tmp/consul ... ``` -There are several important messages that [`consul agent`](/docs/commands/agent) outputs: +There are several important messages that +[`consul agent`](/docs/commands/agent) outputs: - **Node name**: This is a unique name for the agent. By default, this is the hostname of the machine, but you may customize it using the [`-node`](/docs/agent/options#_node) flag. -- **Datacenter**: This is the datacenter in which the agent is configured to run. - Consul has first-class support for multiple datacenters; however, to work efficiently, - each node must be configured to report its datacenter. The [`-datacenter`](/docs/agent/options#_datacenter) - flag can be used to set the datacenter. For single-DC configurations, the agent - will default to "dc1". +- **Datacenter**: This is the datacenter in which the agent is configured to +run. + Consul has first-class support for multiple datacenters; however, to work + efficiently, each node must be configured to report its datacenter. The + [`-datacenter`](/docs/agent/options#_datacenter) flag can be used to set the + datacenter. For single-DC configurations, the agent will default to "dc1". -- **Server**: This indicates whether the agent is running in server or client mode. +- **Server**: This indicates whether the agent is running in server or client +mode. Server nodes have the extra burden of participating in the consensus quorum, storing cluster state, and handling queries. Additionally, a server may be in ["bootstrap"](/docs/agent/options#_bootstrap_expect) mode. Multiple servers - cannot be in bootstrap mode as that would put the cluster in an inconsistent state. + cannot be in bootstrap mode as that would put the cluster in an inconsistent + state. - **Client Addr**: This is the address used for client interfaces to the agent. - This includes the ports for the HTTP and DNS interfaces. By default, this binds only - to localhost. If you change this address or port, you'll have to specify a `-http-addr` - whenever you run commands such as [`consul members`](/docs/commands/members) to - indicate how to reach the agent. Other applications can also use the HTTP address and port + This includes the ports for the HTTP and DNS interfaces. By default, this + binds only to localhost. If you change this address or port, you'll have to + specify a `-http-addr` whenever you run commands such as + [`consul members`](/docs/commands/members) to indicate how to reach the + agent. Other applications can also use the HTTP address and port [to control Consul](/api). -- **Cluster Addr**: This is the address and set of ports used for communication between - Consul agents in a cluster. Not all Consul agents in a cluster have to +- **Cluster Addr**: This is the address and set of ports used for communication + between Consul agents in a cluster. Not all Consul agents in a cluster have to use the same port, but this address **MUST** be reachable by all other nodes. When running under `systemd` on Linux, Consul notifies systemd by sending @@ -85,44 +95,62 @@ service definition file has to have `Type=notify` set. ## Stopping an Agent -An agent can be stopped in two ways: gracefully or forcefully. To gracefully -halt an agent, send the process an interrupt signal (usually -`Ctrl-C` from a terminal or running `kill -INT consul_pid` ). When gracefully exiting, the agent first notifies -the cluster it intends to leave the cluster. This way, other cluster members -notify the cluster that the node has _left_. +An agent can be stopped in two ways: gracefully or forcefully. Servers and +Clients both behave differently depending on the leave that is performed. There +are two potential states a process can be in after a system signal is sent: +_left_ and _failed_. -Alternatively, you can force kill the agent by sending it a kill signal. -When force killed, the agent ends immediately. The rest of the cluster will -eventually (usually within seconds) detect that the node has died and -notify the cluster that the node has _failed_. +To gracefully halt an agent, send the process an _interrupt signal_ (usually +`Ctrl-C` from a terminal, or running `kill -INT consul_pid` ). For more +information on different signals sent by the `kill` command, see +[here](https://www.linux.org/threads/kill-signals-and-commands-revised.11625/) -It is especially important that a server node be allowed to leave gracefully -so that there will be a minimal impact on availability as the server leaves -the consensus quorum. +When a Client is gracefully exited, the agent first notifies the cluster it +intends to leave the cluster. This way, other cluster members notify the +cluster that the node has _left_. + +When a Server is gracefully exited, the server will not be marked as _left_. +This is to minimally impact the consensus quorum. Instead, the Server will be +marked as _failed_. To remove a server from the cluster, the +[`force-leave`](/docs/commands/force-leave) command is used. Using +`force-leave` will put the server instance in a _left_ state so long as the +Server agent is not alive. + +Alternatively, you can forcibly stop an agent by sending it a +`kill -KILL consul_pid` signal. This will stop any agent immediately. The rest +of the cluster will eventually (usually within seconds) detect that the node has +died and notify the cluster that the node has _failed_. For client agents, the difference between a node _failing_ and a node _leaving_ may not be important for your use case. For example, for a web server and load balancer setup, both result in the same outcome: the web node is removed from the load balancer pool. +The [`skip_leave_on_interrupt`](/docs/agent/options#skip_leave_on_interrupt) and +[`leave_on_terminate`](/docs/agent/options#leave_on_terminate) configuration +options allow you to adjust this behavior. + ## Lifecycle Every agent in the Consul cluster goes through a lifecycle. Understanding this lifecycle is useful for building a mental model of an agent's interactions with a cluster and how the cluster treats a node. -When an agent is first started, it does not know about any other node in the cluster. +When an agent is first started, it does not know about any other node in the +cluster. To discover its peers, it must _join_ the cluster. This is done with the [`join`](/docs/commands/join) -command or by providing the proper configuration to auto-join on start. Once a node -joins, this information is gossiped to the entire cluster, meaning all nodes will -eventually be aware of each other. If the agent is a server, existing servers will -begin replicating to the new node. +command or by providing the proper configuration to auto-join on start. Once a +node joins, this information is gossiped to the entire cluster, meaning all +nodes will eventually be aware of each other. If the agent is a server, +existing servers will begin replicating to the new node. In the case of a network failure, some nodes may be unreachable by other nodes. -In this case, unreachable nodes are marked as _failed_. It is impossible to distinguish -between a network failure and an agent crash, so both cases are handled the same. -Once a node is marked as failed, this information is updated in the service catalog. +In this case, unreachable nodes are marked as _failed_. It is impossible to +distinguish between a network failure and an agent crash, so both cases are +handled the same. +Once a node is marked as failed, this information is updated in the service +catalog. -> **Note:** There is some nuance here since this update is only possible if the servers can still [form a quorum](/docs/internals/consensus). Once the network recovers or a crashed agent restarts the cluster will repair itself and unmark a node as failed. The health check in the catalog will also be updated to reflect this.