mirror of https://github.com/status-im/consul.git
258 lines
15 KiB
Plaintext
258 lines
15 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Consul on AWS Elastic Container Service (ECS) Architecture
|
|
description: >-
|
|
Consul's architecture supports Amazon Web Services ECS deployments. Learn about how the two work together, including the order tasks and containers startup and shutdown, as well as requirements for the AWS IAM auth method, the ACL controller and tokens, and health check syncing.
|
|
---
|
|
|
|
# Consul on AWS Elastic Container Service (ECS) Architecture
|
|
|
|
The following diagram shows the main components of the Consul architecture when deployed to an ECS cluster:
|
|
|
|
![Consul on ECS Architecture](/img/consul-ecs-arch.png)
|
|
|
|
1. **Consul servers:** Production-ready Consul server cluster
|
|
1. **Application tasks:** Runs user application containers along with two helper containers:
|
|
1. **Consul client:** The Consul client container runs Consul. The Consul client communicates
|
|
with the Consul server and configures the Envoy proxy sidecar. This communication
|
|
is called _control plane_ communication.
|
|
1. **Sidecar proxy:** The sidecar proxy container runs [Envoy](https://envoyproxy.io/). All requests
|
|
to and from the application container(s) run through the sidecar proxy. This communication
|
|
is called _data plane_ communication.
|
|
1. **Mesh Init:** Each task runs a short-lived container, called `mesh-init`, which sets up initial configuration
|
|
for Consul and Envoy.
|
|
1. **Health Syncing:** Optionally, an additional `health-sync` container can be included in a task to sync health statuses
|
|
from ECS into Consul.
|
|
1. **ACL Controller:** The ACL controller is responsible for automating configuration and cleanup in the Consul servers.
|
|
The ACL controller will automatically configure the [AWS IAM Auth Method](/consul/docs/security/acl/auth-methods/aws-iam), and cleanup
|
|
unused ACL tokens from Consul. When using Consul Enterprise namespaces, the ACL controller will automatically create Consul
|
|
namespaces for ECS tasks.
|
|
|
|
For more information about how Consul works in general, see Consul's [Architecture Overview](/consul/docs/architecture).
|
|
|
|
## Task Startup
|
|
|
|
This diagram shows the timeline of a task starting up and all its containers:
|
|
|
|
<ImageConfig width={400}>
|
|
|
|
![Task Startup Timeline](/img/ecs-task-startup.svg)
|
|
|
|
</ImageConfig>
|
|
|
|
- **T0:** ECS starts the task. The `consul-client` and `mesh-init` containers start:
|
|
- `consul-client` does the following:
|
|
- If ACLs are enabled, a startup script runs a `consul login` command to obtain a
|
|
token from the AWS IAM auth method for the Consul client. This token has `node:write`
|
|
permissions.
|
|
- It uses the `retry-join` option to join the Consul cluster.
|
|
- `mesh-init` does the following:
|
|
- If ACLs are enabled, mesh-init runs a `consul login` command to obtain a token from
|
|
the AWS IAM auth method for the service registration. This token has `service:write`
|
|
permissions for the service and its sidecar proxy. This token is written to a shared
|
|
volume for use by the `health-sync` container.
|
|
- It registers the service for the current task and its sidecar proxy with Consul.
|
|
- It runs `consul connect envoy -bootstrap` to generate Envoy's bootstrap JSON file and
|
|
writes it to a shared volume.
|
|
- **T1:** The following containers start:
|
|
- `sidecar-proxy` starts using a custom entrypoint command, `consul-ecs envoy-entrypoint`.
|
|
The entrypoint command starts Envoy by running `envoy -c <path-to-bootstrap-json>`.
|
|
- `health-sync` starts if ECS health checks are defined or if ACLs are enabled. It syncs health
|
|
checks from ECS to Consul (see [ECS Health Check Syncing](#ecs-health-check-syncing)).
|
|
- **T2:** The `sidecar-proxy` container is marked as healthy by ECS. It uses a health check that
|
|
detects if its public listener port is open. At this time, your application containers are started
|
|
since all Consul machinery is ready to service requests.
|
|
|
|
## Task Shutdown
|
|
|
|
This diagram shows an example timeline of a task shutting down:
|
|
|
|
<ImageConfig width={400}>
|
|
|
|
![Task Shutdown Timeline](/img/ecs-task-shutdown.svg)
|
|
|
|
</ImageConfig>
|
|
|
|
- **T0**: ECS sends a TERM signal to all containers. Each container reacts to the TERM signal:
|
|
- `consul-client` begins to gracefully leave the Consul cluster.
|
|
- `health-sync` stops syncing health status from ECS into Consul checks.
|
|
- `sidecar-proxy` ignores the TERM signal and continues running until the `user-app` container
|
|
exits. The custom entrypoint command, `consul-ecs envoy-entrypoint`, monitors the local ECS task
|
|
metadata. It waits until the `user-app` container has exited before terminating Envoy. This
|
|
enables the application to continue making outgoing requests through the proxy to the mesh for
|
|
graceful shutdown.
|
|
- `user-app` exits if it is not configured to ignore the TERM signal. The `user-app` container
|
|
will continue running if it is configured to ignore the TERM signal.
|
|
- **T1**:
|
|
- `health-sync` does the following:
|
|
- It updates its Consul checks to critical status and exits. This ensures this service instance is marked unhealthy.
|
|
- If ACLs are enabled, it runs `consul logout` for the two tokens created by the `consul-client` and `mesh-init` containers.
|
|
This removes those tokens from Consul. If `consul logout` fails for some reason, the ACL controller will remove the tokens
|
|
after the task has stopped.
|
|
- `sidecar-proxy` notices the `user-app` container has stopped and exits.
|
|
- **T2**: `consul-client` finishes gracefully leaving the Consul datacenter and exits.
|
|
- **T3**:
|
|
- ECS notices all containers have exited, and will soon change the Task status to `STOPPED`
|
|
- Updates about this task have reached the rest of the Consul cluster, so downstream proxies have been updated to stopped sending traffic to this task.
|
|
- **T4**: At this point task shutdown should be complete. Otherwise, ECS will send a KILL signal to any containers still running. The KILL signal cannot be ignored and will forcefully stop containers. This will interrupt in-progress operations and possibly cause errors.
|
|
|
|
## ACL Tokens
|
|
|
|
Two types of ACL tokens are required by ECS tasks:
|
|
|
|
* **Client tokens:** used by the `consul-client` containers to join the Consul cluster
|
|
* **Service tokens:** used by sidecar containers for service registration and health syncing
|
|
|
|
With Consul on ECS, these tokens are obtained dynamically when a task starts up by logging
|
|
in via Consul's AWS IAM auth method.
|
|
|
|
### Consul Client Token
|
|
|
|
Consul client tokens require `node:write` for any node name, which is necessary because the Consul node
|
|
names on ECS are not known until runtime.
|
|
|
|
### Service Token
|
|
|
|
Service tokens are associated with a [service identity](/consul/docs/security/acl#service-identities).
|
|
The service identity includes `service:write` permissions for the service and sidecar proxy.
|
|
|
|
## AWS IAM Auth Method
|
|
|
|
Consul's [AWS IAM Auth Method](/consul/docs/security/acl/auth-methods/aws-iam) is used by ECS tasks to
|
|
automatically obtain Consul ACL tokens. When a service mesh task on ECS starts up, it runs two
|
|
`consul login` commands to obtain a client token and a service token via the auth method. When the
|
|
task stops, it attempts two `consul logout` commands in order to destroy these tokens.
|
|
|
|
During a `consul login`, the [task's IAM
|
|
role](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html) is presented
|
|
to the AWS IAM auth method on the Consul servers. The role is validated with AWS. If the role is
|
|
valid, and if the auth method trusts the IAM role, then the role is permitted to login. A new Consul
|
|
ACL token is created and [Binding Rules](/consul/docs/security/acl/auth-methods#binding-rules) associate
|
|
permissions with the newly created token. These permissions are mapped to the token based on the IAM
|
|
role details. For example, tags on the IAM role are used to specify the service name and the
|
|
Consul Enterprise namespace to be associated with a service token that is created by a successful
|
|
login to the auth method.
|
|
|
|
### Task IAM Role
|
|
|
|
The following configuration is required for the task IAM role in order to be compatible with the
|
|
auth method. When using Terraform, the `mesh-task` module creates the task role with this
|
|
configuration by default.
|
|
|
|
* A scoped `iam:GetRole` permission must be included on the IAM role, enabling the role to fetch
|
|
details about itself.
|
|
* A `consul.hashicorp.com.service-name` tag on the IAM role must be set to the Consul service name.
|
|
* <EnterpriseAlert inline /> A <code>consul.hashicorp.com.namespace</code> tag must be set on the
|
|
IAM role to the Consul Enterprise namespace of the Consul service for the task.
|
|
|
|
Task IAM roles should not typically be shared across task families. Since a task family represents a
|
|
single Consul service, and since the task role must include the Consul service name, one task role
|
|
is required for each task family when using the auth method.
|
|
|
|
### Security
|
|
|
|
The auth method relies on the configuration of AWS resources, such as IAM roles, IAM policies, and
|
|
ECS tasks. If these AWS resources are misconfigured or if the account has loose access controls,
|
|
then the security of your service mesh may be at risk.
|
|
|
|
Any entity in your AWS account with the ability to obtain credentials for an IAM role could potentially
|
|
obtain a Consul ACL token and impersonate a Consul service. The `mesh-task` Terraform module
|
|
mitigates against this concern by creating the task role with an `AssumeRolePolicyDocument` that
|
|
allows only the AWS ECS service to assume the task role. By default, other entities are unable
|
|
to obtain credentials for task roles, and are unable to abuse the AWS IAM auth method to obtain
|
|
Consul ACL tokens.
|
|
|
|
However, other entities in your AWS account with the ability to create or modify IAM roles can
|
|
potentially circumvent this. For example, if they are able to create an IAM role with the correct
|
|
tags, they can obtain a Consul ACL token for any service. Or, if they can pass a role to an ECS task
|
|
and start an ECS task, they can use the task to obtain a Consul ACL token via the auth method.
|
|
|
|
The IAM policy actions `iam:CreateRole`, `iam:TagRole`, `iam:PassRole`, and `sts:AssumeRole` can be
|
|
used to restrict these capabilities in your AWS account and improve security when using the AWS IAM
|
|
auth method. See the [AWS
|
|
documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies.html) to learn how
|
|
to restrict these permissions in your AWS account.
|
|
|
|
## ACL Controller
|
|
|
|
The ACL controller performs the following operations on the Consul servers:
|
|
|
|
* Configures the Consul AWS IAM auth method.
|
|
* Monitors tasks in ECS cluster where the controller is running.
|
|
* Cleans up unused Consul ACL tokens created by tasks in this cluster.
|
|
* <EnterpriseAlert inline /> Manages Consul admin partitions and namespaces.
|
|
|
|
### Auth Method Configuration
|
|
|
|
The ACL controller is responsible for configuring the AWS IAM auth method. The following resources
|
|
are created by the ACL controller when it starts up:
|
|
|
|
* **Client role**: The controller creates the Consul (not IAM) role and policy used for client
|
|
tokens if these do not exist. This policy has `node:write` permissions to enable Consul clients to
|
|
join the Consul cluster.
|
|
* **Auth method for client tokens**: One instance of the AWS IAM auth method is created for client
|
|
tokens, if it does not exist. A binding rule is configured that attaches the Consul client role to each
|
|
token created during a successful login to this auth method instance.
|
|
* **Auth method for service tokens**: One instance of the AWS IAM auth method is created for service
|
|
tokens, if it does not exist:
|
|
* A binding rule is configured to attach a [service identity](/consul/docs/security/acl#service-identities)
|
|
to each token created during a successful login to this auth method instance. The service name for
|
|
this service identity is taken from the tag, `consul.hashicorp.com.service-name`, on the IAM role
|
|
used to log in.
|
|
* <EnterpriseAlert inline /> A namespace binding rule is configured to create service tokens in
|
|
the namespace specified by the tag, <code>consul.hashicorp.com.namespace</code>, on the IAM
|
|
role used to log in.
|
|
|
|
The ACL controller configures both instances of the auth method to permit only certain IAM roles to login,
|
|
by setting the [`BoundIAMPrincipalARNs`](/consul/docs/security/acl/auth-methods/aws-iam#boundiamprincipalarns)
|
|
field of the AWS IAM auth method as follows:
|
|
|
|
* By default, the only IAM roles permitted to log in must have an ARN matching the pattern,
|
|
`arn:aws:iam::<ACCOUNT>:role/consul-ecs/*`. This allows IAM roles at the role path `/consul-ecs/`
|
|
to log in, and only those IAM roles in the same AWS account where the ACL controller is running.
|
|
* The role path can be changed by setting the `iam_role_path` input variable for the `mesh-task` and
|
|
`acl-controller` modules, or by passing the `-iam-role-path` flag to the `consul-ecs
|
|
acl-controller` command.
|
|
* Each instance of the auth method is shared by ACL controllers in the same Consul datacenter. Each
|
|
controller updates the auth method, if necessary, to include additional entries in the
|
|
`BoundIAMPrincipalARNs` list. This enables the use of the auth method with ECS clusters in
|
|
different AWS accounts, for example. This does not apply when using Consul Enterprise admin
|
|
partitions because auth method instances are not shared by multiple controllers in that case.
|
|
|
|
### Task Monitoring
|
|
|
|
After startup, the ACL controller monitors tasks in the same ECS cluster where the ACL controller is
|
|
running in order to discover newly running tasks and tasks that have stopped.
|
|
|
|
The ACL controller cleans up tokens created by `consul login` for tasks that are no longer running.
|
|
Normally, each task attempts `consul logout` commands when the task stops to destroy its tokens.
|
|
However, in unstable conditions the `consul logout` command may fail to clean up a token.
|
|
The ACL controller runs continually to ensure those unused tokens are soon removed.
|
|
|
|
### Admin Partitions and Namespaces<EnterpriseAlert inline />
|
|
|
|
When [admin partitions and namespaces](/consul/docs/ecs/enterprise#admin-partitions-and-namespaces) are enabled,
|
|
the ACL controller is assigned to its configured admin partition. It supports one ACL controller instance per ECS
|
|
cluster. This results in an architecture with one admin partition per ECS cluster.
|
|
|
|
When admin partitions and namespace are enabled, the ACL controller performs the following
|
|
additional actions:
|
|
|
|
* At startup, creates its assigned admin partition if it does not exist.
|
|
* Inspects task tags for new ECS tasks to discover the task's intended partition
|
|
and namespace. The ACL controller ignores tasks with a partition tag that does not match the
|
|
controller's assigned partition.
|
|
* Creates namespaces when tasks start up. Namespaces are only created if they do not exist.
|
|
* Creates auth method instances for client and service tokens in controller's assigned admin partition.
|
|
|
|
## ECS Health Check Syncing
|
|
|
|
If the following conditions apply, ECS health checks automatically sync with Consul health checks for all application containers:
|
|
|
|
* marked as `essential`
|
|
* have ECS `healthChecks`
|
|
* are not configured with native Consul health checks
|
|
|
|
The `mesh-init` container creates a TTL health check for every container that fits these criteria
|
|
and the `health-sync` container ensures that the ECS and Consul health checks remain in sync.
|