docs: Well Architected Framework content migration (#21099)

* Migration

* move page
This commit is contained in:
Jeff Boruszak 2024-05-20 14:04:10 -07:00 committed by GitHub
parent f12ba3f2a5
commit 1c0f6e5597
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 435 additions and 2 deletions

View File

@ -0,0 +1,83 @@
---
layout: docs
page_title: Consul monitoring and alerts recommendations
description: >-
Apply best practices towards Consul monitoring and alerts.
---
# Consul monitoring and alerts recommendations
This document will guide you through which host resources to monitor and how monitoring tools can help you set up alerts to notify you when your Consul cluster exceeds its limits. By monitoring Consul and setting up alerts, you can ensure Consul works as expected for all your service discovery and service mesh needs.
## Instance level monitoring
While each host environment and Consul deployment is unique, these recommendations can serve as a starting point for you to reference to meet the unique needs of your deployment.
A Consul datacenter is the smallest unit of Consul infrastructure that can perform basic Consul operations like service discovery or service mesh. A datacenter contains at least one Consul server agent, but a real-world deployment contains three or five server agents and several Consul client agents.
Consul server agents store all state information, including service and node IP addresses, health checks, and configuration. Consul clients report node and service health status to the Consul cluster. In a typical deployment, you must run client agents on every compute node in your datacenter. If you have Kubernetes workloads, you can also run Consul with an alternate service mesh configuration that deploys Envoy proxies but not client agents. Refer to [Simplified service mesh with Consul dataplanes](/consul/docs/connect/dataplane) for more information.
We recommend monitoring the following parameters for Consul agents health:
- Disk space and file handles
- [RAM utilization](/consul/docs/agent/telemetry#memory-usage)
- CPU utilization
- Network activity and utilization
We recommend using an [application performance monitoring (APM) system](#monitoring-tools) to track these metrics. For a full list of key metrics, visit the [Key metrics](/consul/docs/agent/telemetry#key-metrics) section of Telemetry documentation.
## Recommendations for host-level alerts
We recommend starting with a small cluster for most initial production deployments or for testing environments. For production environments with a consistently high workload, we recommend large clusters . Refer to the [Consul capacity planning](/well-architected-framework/reliability/reliability-consul-capacity-planning#minimum-hardware-requirements) article for more information.
When collecting metrics, it is important to establish a baseline. This baseline ensures your Consul deployment is healthy, and serves as a reference point when troubleshooting abnormal Cluster behavior. Complete the [Monitor Consul datacenter health](/consul/tutorials/day-2-operations/monitor-datacenter-health#how-to-collect-metrics) tutorial to learn how to collect metrics.
Once you have established a baseline for your metrics, use them and the following recommendations to configure reasonable alerts for your Consul agent.
### Memory alert recommendations
Consul uses RAM as the primary storage for data on its leader node, while periodically flushing it to disk. Reference the [Memory usage](/consul/docs/agent/telemetry#memory-usage) section of the Telemetry documentation for more details. The recommended instance type depends on your hosting provider. Refer to the [Hardware sizing for Consul servers](/consul/tutorials/production-deploy/reference-architecture#hardware-sizing-for-consul-servers) for recommended instance types for most cloud providers along with other up-to-date hardware recommendations.
When determining how much RAM you should allocate, we recommend enough RAM for your server agents to contain between 2 to 4 times the working set size. You can determine the working set size by noting the value of `consul.runtime.alloc_bytes` in the telemetry data.
Set up an alert if your RAM usage exceeds a reasonable threshold (for example, 90% of your allocated RAM).
### CPU alert recommendations
Your Consul servers should scale up to handle peak CPU load, not idle load. When idle, Consul servers are waiting to react to changes in service health, placement, or other configuration. If there are any service state changes, the Consul server has to notify all impacted Consul clients simultaneously. For example, if the Consul server has to notify hundreds or thousands of Consul clients of a service state update, the Consul server CPU may spike.
If this happens, your monitoring dashboard will show a CPU spike on all servers immediately after a big registration/deregistration operation. This should not happen — you should be able to do a rollout or other high-change operation without taxing the Consul servers.
Set up an alert to detect CPU spikes on your Consul server agents. When this happens, evaluate the size of your Consul servers and upgrade them accordingly.
### Network recommendations
The data sent between all Consul agents must follow latency requirements for total round trip time (RTT):
Average RTT for all traffic cannot exceed 50ms.
RTT for 99 percent of traffic cannot exceed 100ms.
Refer to the [Reference architecture](/consul/tutorials/production-deploy/reference-architecture#network-latency-and-bandwidth) to learn more about network latency and bandwidth guidance.
Set an alert to detect when the RTT exceeds these values. When this happens, Therefore, you should monitor metrics related to the host's network latency so the RTT does not exceed these values.
### Monitoring Consul using Prometheus and Grafana
Time series based observability tools, such as Grafana and Prometheus, help you monitor the health of Consul clusters over long intervals of time. Refer to the
[Monitoring for Layer 7 observability with Prometheus, Grafana, and Kubernetes](/consul/tutorials/day-2-operations/kubernetes-layer7-observability) tutorial for additional information.
### Monitoring Consul using Datadog
Datadog is a SaaS-based monitoring and analytics platform for large-scale applications and infrastructure. It is one of the supported platforms for monitoring Consul. Datadogs agents run on your host reporting logs, metrics and traces. By configuring Datadog agents on your Consul server and client instances, you can monitor your Consul cluster's health.
Refer to the following resources for more information:
- [Setup Consul logging with DataDog](https://www.datadoghq.com/blog/consul-datadog/)
- [Datadog monitoring solutions brief](https://www.datocms-assets.com/2885/1576713622-datadog-consul.pdf)
- [Hashicorp partner portal for Consul support on Datadog](https://www.hashicorp.com/partners/tech/datadog#consul)
## Next steps
In this guide, you learned which host resources to monitor and how monitoring tools can help you set up alerts to notify you when your Consul cluster exceeds its limits.
- To learn about monitoring the Consul control and data plane, visit our [Monitoring Consul components](/well-architected-framework/reliability/reliability-consul-monitoring-consul-components) documentation.
- Complete the [Monitor Consul datacenter health with Telegraf](/consul/tutorials/day-2-operations/monitor-health-telegraf) tutorial for additional metrics and alerting recommendations.

View File

@ -0,0 +1,121 @@
---
layout: docs
page_title: Monitoring Consul components
description: >-
Apply best practices monitoring your Consul control and data plane.
---
# Monitoring Consul components
This document will guide you recommendations for monitoring your Consul control and data plane. By keeping track of these components and setting up alerts, you can better maintain the overall health and resilience of your service mesh.
## Background
A Consul datacenter is the smallest unit of Consul infrastructure that can perform basic Consul operations like service discovery or service mesh. A datacenter contains at least one Consul server agent, but a real-world deployment contains three or five server agents and several Consul client agents.
The Consul control plane consists of server agents that store all state information, including service and node IP addresses, health checks, and configuration. In addition, the control plane is responsible for securing the mesh, facilitating service discovery, health checking, policy enforcement, and other similar operational concerns. In addition, the control plane contains client agents that report node and service health status to the Consul cluster. In a typical deployment, you must run client agents on every compute node in your datacenter.
The Consul data plane consists of proxies deployed locally alongside each service instance. These proxies, called sidecar proxies, receive mesh configuration data from the control plane, and control network communication between their local service instance and other services in the network. The sidecar proxy handles inbound and outbound service connections, and ensures TLS connections between services are both verified and encrypted.
If you have Kubernetes workloads, you can also run Consul with an alternate service mesh configuration that deploys Envoy proxies but not client agents. Refer to [Simplified service mesh with Consul dataplanes](/consul/docs/connect/dataplane) for more information.
## Consul control plane monitoring
The Consul control plane consists of the following components:
- RPC Communication between Consul servers and clients.
- Data plane routing instructions for the Envoy Layer 7 proxy.
- Serf Traffic: LAN and WAN
- Consul cluster peering and server federation
It is important to monitor and establish baseline and alert thresholds for Consul control plane components for abnormal behavior detection. Note that these alerts can also be triggered by some planned events like Consul cluster upgrades, configuration changes, or leadership change.
To help monitor your Consul control plane, we recommend to establish a baseline and standard deviation for the following:
- [Server health](/consul/docs/agent/telemetry#server-health)
- [Leadership changes](/consul/docs/agent/telemetry#leadership-changes)
- [Key metrics](/consul/docs/agent/telemetry#key-metrics)
- [Autopilot](/consul/docs/agent/telemetry#autopilot)
- [Network activity](/consul/docs/agent/telemetry#network-activity-rpc-count)
- [Certificate authority expiration](/consul/docs/agent/telemetry#certificate-authority-expiration)
It is important to have a highly performant network with low network latency. Ensure network latency for gossip in all datacenters are within the 8ms latency budget for all Consul agents. View the [Production server requirements](/consul/docs/install/performance#production-server-requirements) for more information.
### Raft recommendations
Consul uses [Raft for consensus protocol](/consul/docs/architecture/consensus). High saturation of the Raft goroutines can lead to elevated latency in the rest of the system and may cause the Consul cluster to be unstable. As a result, it is important to monitor Raft to track your control plane health. We recommend the following actions to keep control plane healthy:
- Create an alert that notifies you when [Raft thread saturation](/consul/docs/agent/telemetry#raft-thread-saturation) exceeds 50%.
- Monitor [Raft replication capacity](/consul/docs/agent/telemetry#raft-replication-capacity-issues) when Consul is handling large amounts of data and high write throughput.
- Lower [`raft_multiplier`](/consul/docs/install/performance#production) to keep your Consul cluster stable. The value of `raft_multiplier` defines the scaling factor for Consul. Default value for raft_multiplier is 5.
A short multiplier minimizes failure detection and election time but may trigger frequently in high latency situations. This can cause constant leadership churn and associated unavailability. A high multiplier reduces the chances that spurious failures will cause leadership churn but it does this at the expense of taking longer to detect real failures and thus takes longer to restore Consul cluster availability.
Wide networks with higher latency will perform better with larger `raft_multipler` values.
Raft uses BoltDB for storing data and maintaining its own state. Refer to the [Bolt DB performance metrics](/consul/docs/agent/telemetry#bolt-db-performance) when you are troubleshooting Raft performance issues.
## Consul data plane monitoring
The data plane of Consul consists of Consul clients or [Connect proxies](/consul/docs/connect/proxies) interacting with each other through service-to-service communication. Service-to-service traffic always stays within the data plane, while the control plane only enforces traffic rules. Monitoring service-to-service communication is important but may become extremely complex in an enterprise setup with multiple services communicating to each other across federated Consul clusters through mesh, ingress and terminating gateways.
### Service monitoring
You can extract the following service-related information:
- Use the [`catalog`](/consul/commands/catalog) command or the Consul UI to query all registered services in a Consul datacenter.
- Use the [`/agent/service/:service_id`](/consul/api-docs/agent/service#get-service-configuration) API endpoint to query individual services. Connect proxies use this endpoint to discover embedded configuration.
### Proxy monitoring
Envoy is the supported Connect proxy for Consul service mesh. For virtual machines (VMs), Envoy starts as a sidecar service process. For Kubernetes, Envoy starts as a sidecar container in a Kubernetes service pod.
Refer to the [Supported Envoy versions](/consul/docs/connect/proxies/envoy#supported-versions) documentation to find the compatible Envoy versions for your version of Consul.
For troubleshooting service mesh issues, set Consul logs to `trace` or `debug`. The following example annotation sets Envoy logging to `debug`.
```yaml
annotations:
consul.hashicorp.com/envoy-extra-args: '--log-level debug --disable-hot-restart'
```
Refer to the [Enable logging on Envoy sidecar pods](/consul/docs/k8s/annotations-and-labels#consul-hashicorp-com-envoy-extra-args) documention for more information.
#### Envoy Admin Interface
To troubleshoot service-to-service communication issues, monitor Envoy host statistics. Envoy exposes a local administration interface that can be used to query and modify different aspects of the server on port `19000` by default. Envoy also exposes a public listener port to receive mTLS connections from other proxies in the mesh on port `20000` by default.
All endpoints exposed by Envoy are available at the node running Envoy on port `19000`. The node can either be a pod in Kubernetes or VM running Consul Service Mesh. For example, if you forward the Envoy port to your local machine, you can access the Envoy admin interface at `http://localhost:19000/`.
The following Envoy admin interface endpoints are particularly useful:
- The `listeners` endpoint lists all listeners running on `localhost`. This allows you to confirm whether the upstream services are binding correctly to Envoy.
```shell-session
$ curl http://localhost:19000/listeners
public_listener:192.168.19.168:20000::192.168.19.168:20000
Outbound_listener:127.0.0.1:15001::127.0.0.1:15001
```
- The `/clusters` endpoint displays information about the xDS clusters, such as service requests and mTLS related data. The following example shows a truncated output.
```shell-session
$ http://localhost:19000/clusters
`local_app::observability_name::local_app
local_app::default_priority::max_connections::1024
local_app::default_priority::max_pending_requests::1024
local_app::default_priority::max_requests::1024
local_app::default_priority::max_retries::3
local_app::high_priority::max_connections::1024
local_app::high_priority::max_pending_requests::1024
local_app::high_priority::max_requests::1024
local_app::high_priority::max_retries::3
local_app::added_via_api::true
## ...
```
Visit the main admin interface (`http://localhost:19000`) to find the full list of possible Consul admin endpoints. Refer to the [Envoy docs](https://www.envoyproxy.io/docs/envoy/latest/operations/admin) for more information.
## Next steps
In this guide, you learned recommendations for monitoring your Consul control and data plane.
To learn about monitoring the Consul host and instance resources, visit our [Monitoring best practices](/well-architected-framework/reliability/reliability-monitoring-service-to-service-communication-with-envoy) documentation.

View File

@ -0,0 +1,212 @@
---
layout: docs
page_title: Monitoring service-to-service communication with Envoy
description: >-
Learn to monitor the appropriate metrics when using Envoy proxy.
---
# Monitoring service-to-service communication with Envoy
When running a service mesh with Envoy as the proxy, there are a wide array of possible metrics produced from traffic flowing through the data plane. This document covers a set of scenarios and key baseline metrics and potential alerts that will help you maintain the overall health and resilience of the mesh for HTTP services. In addition, it provides examples of using these metrics in specific ways to generate a Grafana dashboard using a Prometheus backend to better understand how the metrics behave.
When collecting metrics, it is important to establish a baseline. This baseline ensures your Consul deployment is healthy, and serves as a reference point when troubleshooting abnormal Cluster behavior. Once you have established a baseline for your metrics, use them and the following recommendations to configure reasonable alerts for your Consul agent.
<Note>
The following examples assume that the operator adds the cluster name (i.e. datacenter) using the label “cluster” and the node name (i.e. machine or pod) using the label “node” to all scrape targets.
</Note>
## General scenarios
### Is Envoy's configuration growing stale?
When Envoy connects to the Consul control plane over xDS, it will rapidly converge to the current configuration that the control plane expects it to have. If the xDS stream terminates and does not reconnect for an extended period, then the xDS configuration currently in the Envoy instances will “fail static” and slowly grow out of date.
##### Metric
`envoy_control_plane_connected_state`
#### Alerting
If the value for a given node/pod/machine was 0 for an extended period of time.
#### Example dashboard (table)
```
group(last_over_time(envoy_control_plane_connected_state{cluster="$cluster"}[1m] ) == 0) by (node)
```
## Inbound traffic scenarios
### Is this service being sent requests?
Within a mesh, a request travels from one service to another. You may choose to measure many relevant metrics from the calling-side, the serving-side, or both.
It is useful to track the perceived request rate of requests from the calling-side as that would include all requests, even those that fail to arrive at the serving-side due to any failures.
Any measurement of the request rate is also generally useful for capacity planning purposes as increased traffic typically correlates with a need for a scale-up event in the near future.
##### Metric
`envoy_cluster_upstream_rq_total`
#### Alerting
If the value has a significant change, check if services are properly interacting with each other and if you need to increase your Consul agent resource requirements.
#### Example dashboard (plot; rate)
```
sum(irate(envoy_cluster_upstream_rq_total{consul_destination_datacenter="$cluster",
consul_destination_service="$service"}[1m])) by (cluster, local_cluster)
```
### Are requests sent to this service mostly successful?
A service mesh is about communication between services, so it is important to track the perceived success rate of requests witnessed by the calling services.
##### Metric
`envoy_cluster_upstream_rq_xx`
#### Alerting
If the value crosses a user defined baseline.
#### Example dashboard (plot; %)
```
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5",consul_destination_datacenter="$cluster",consul_destination_service="$service"}[1m])) by (cluster, local_cluster) / sum(irate(envoy_cluster_upstream_rq_xx{consul_destination_datacenter="$cluster",consul_destination_service="$service"}[1m])) by (cluster, local_cluster)
```
### Are requests sent to this service handled in a timely manner?
If you undersize your infrastructure from a resource perspective, then you may expect a decline in response speed over time. You can track this by plotting the 95th percentile of the latency as experienced by the clients.
##### Metric
`envoy_cluster_upstream_rq_time_bucket`
#### Alerting
If the value crosses a user defined baseline.
#### Example dashboard (plot; value)
```
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{consul_destination_datacenter="$cluster",consul_destination_service="$service",local_cluster!=""}[1m])) by (le, cluster, local_cluster))
```
### Is this service responding to requests that it receives?
Unlike the perceived request rate, which is measured from the calling side, this is the real request rate measured on the serving-side. This is a serving-side parallel metric that can help clarify underlying causes of problems in the calling-side equivalent metric. Ideally this metric should roughly track the calling side values in a 1-1 manner.
##### Metric
`envoy_http_downstream_rq_total`
#### Alerting
If the value crosses a user defined baseline.
#### Example dashboard (plot; rate)
```
sum(irate(envoy_http_downstream_rq_total{cluster="$cluster",local_cluster="$service",envoy_http_conn_manager_prefix="public_listener"}[1m]))
```
### Are responses from this service mostly successful?
Unlike the perceived success rate of requests, which is measured from the calling side, this is the real success rate of requests measured on the serving-side. This is a serving-side parallel metric that can help clarify underlying causes of problems in the calling-side equivalent metric. Ideally this metric should roughly track the calling side values in a 1-1 manner.
##### Metrics
`envoy_http_downstream_rq_total`
`envoy_http_downstream_rq_xx`
#### Alerting
If the value crosses a user defined baseline.
#### Example dashboard (plot; %)
##### Total
```
sum(increase(envoy_http_downstream_rq_total{cluster="$cluster",local_cluster="$service",envoy_http_conn_manager_prefix="public_listener"}[1m]))
```
##### BY STATUS CODE:
```
sum(increase(envoy_http_downstream_rq_xx{cluster="$cluster",local_cluster="$service",envoy_http_conn_manager_prefix="public_listener"}[1m])) by (envoy_response_code_class)
```
## Outbound traffic scenarios
### Is this service sending traffic to its upstreams?
Similar to the real request rate for requests arriving at a service, it may be helpful to view the perceived request rate departing from a service through its upstreams.
##### Metric
`envoy_cluster_upstream_rq_total`
#### Alerting
If the value crosses a user defined success threshold.
#### Example dashboard (plot; rate)
```
sum(irate(envoy_cluster_upstream_rq_total{cluster="$cluster",
local_cluster="$service",
consul_destination_target!=""}[1m])) by (consul_destination_target)
```
### Are requests from this service to its upstreams mostly successful?
Similar to the real success rate of requests arriving at a service, it is also important to track the perceived success rate of requests departing from a service through its upstreams.
##### Metric
`envoy_cluster_upstream_rq_xx`
#### Alerting
If the value crosses a user defined success threshold.
#### Example dashboard (plot; value)
```
sum(irate(envoy_cluster_upstream_rq_xx{envoy_response_code_class!="5",
cluster="$cluster",local_cluster="$service",
consul_destination_target!=""}[1m])) by (consul_destination_target) / sum(irate(envoy_cluster_upstream_rq_xx{cluster="$cluster",local_cluster="$service",consul_destination_target!=""}[1m])) by (consul_destination_target)
```
### Are requests from this service to its upstreams handled in a timely manner?
Similar to the latency of requests departing for a service, it is useful to track the 95th percentile of the latency of requests departing from a service through its upstreams.
##### Metric
`envoy_cluster_upstream_rq_time_bucket`
#### Alerting
If the value crosses a user defined success threshold.
#### Example dashboard (plot; value)
```
histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{cluster="$cluster",
local_cluster="$service",consul_target!=""}[1m])) by (le, consul_destination_target))
```
## Next steps
In this guide, you learned recommendations for monitoring your Envoy metrics, and why monitoring these metrics is important for your Consul deployment.
To learn about monitoring Consul components, visit our [Monitoring Consul components](/well-architected-framework/reliability/reliability-consul-monitoring-consul-components) documentation.

View File

@ -667,6 +667,10 @@
"title": "Access Logs", "title": "Access Logs",
"path": "connect/observability/access-logs" "path": "connect/observability/access-logs"
}, },
{
"title": "Monitor service-to-service communication",
"path": "connect/observability/service"
},
{ {
"title": "UI Visualization", "title": "UI Visualization",
"path": "connect/observability/ui-visualization" "path": "connect/observability/ui-visualization"
@ -1251,8 +1255,21 @@
"path": "agent/config-entries" "path": "agent/config-entries"
}, },
{ {
"title": "Telemetry", "title": "Monitor Consul",
"path": "agent/telemetry" "routes": [
{
"title": "Agent telemetry",
"path": "agent/monitor/telemetry"
},
{
"title": "Monitor components",
"path": "agent/monitor/components"
},
{
"title": "Recommendations",
"path": "agent/monitor/alerts"
}
]
}, },
{ {
"title": "Sentinel", "title": "Sentinel",