Add units and types to metrics tables (#9674)

This commits adds units and types to key metrics tables to have
consistent table views of all metrics in telemetry.mdx.

Fixes: https://github.com/hashicorp/consul/issues/9069
This commit is contained in:
Robert Kuska 2021-03-11 04:36:15 +01:00 committed by GitHub
parent 9d924a81a9
commit 6fe45c075f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 26 additions and 26 deletions

View File

@ -58,12 +58,12 @@ These are some metrics emitted that can help you understand the health of your c
### Transaction timing
| Metric Name | Description |
| :----------------------- | :----------------------------------------------------------------------------------- |
| `consul.kvs.apply` | This measures the time it takes to complete an update to the KV store. |
| `consul.txn.apply` | This measures the time spent applying a transaction operation. |
| `consul.raft.apply` | This counts the number of Raft transactions occurring over the interval. |
| `consul.raft.commitTime` | This measures the time it takes to commit a new entry to the Raft log on the leader. |
| Metric Name | Description | Unit | Type |
| :----------------------- | :----------------------------------------------------------------------------------- | :--------------------------- | :------ |
| `consul.kvs.apply` | This measures the time it takes to complete an update to the KV store. | ms | timer |
| `consul.txn.apply` | This measures the time spent applying a transaction operation. | ms | timer |
| `consul.raft.apply` | This counts the number of Raft transactions occurring over the interval. | raft transactions / interval | counter |
| `consul.raft.commitTime` | This measures the time it takes to commit a new entry to the Raft log on the leader. | ms | timer |
**Why they're important:** Taken together, these metrics indicate how long it takes to complete write operations in various parts of the Consul cluster. Generally these should all be fairly consistent and no more than a few milliseconds. Sudden changes in any of the timing values could be due to unexpected load on the Consul servers, or due to problems on the servers themselves.
@ -71,11 +71,11 @@ These are some metrics emitted that can help you understand the health of your c
### Leadership changes
| Metric Name | Description |
| :------------------------------- | :------------------------------------------------------------------------------------------------------------- |
| `consul.raft.leader.lastContact` | Measures the time since the leader was last able to contact the follower nodes when checking its leader lease. |
| `consul.raft.state.candidate` | This increments whenever a Consul server starts an election. |
| `consul.raft.state.leader` | This increments whenever a Consul server becomes a leader. |
| Metric Name | Description | Unit | Type |
| :------------------------------- | :------------------------------------------------------------------------------------------------------------- | :-------- | :------ |
| `consul.raft.leader.lastContact` | Measures the time since the leader was last able to contact the follower nodes when checking its leader lease. | ms | timer |
| `consul.raft.state.candidate` | This increments whenever a Consul server starts an election. | elections | counter |
| `consul.raft.state.leader` | This increments whenever a Consul server becomes a leader. | leaders | counter |
**Why they're important:** Normally, your Consul cluster should have a stable leader. If there are frequent elections or leadership changes, it would likely indicate network issues between the Consul servers, or that the Consul servers themselves are unable to keep up with the load.
@ -83,9 +83,9 @@ These are some metrics emitted that can help you understand the health of your c
### Autopilot
| Metric Name | Description |
| :------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `consul.autopilot.healthy` | This tracks the overall health of the local server cluster. If all servers are considered healthy by Autopilot, this will be set to 1. If any are unhealthy, this will be 0. |
| Metric Name | Description | Unit | Type |
| :------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------- | :---- |
| `consul.autopilot.healthy` | This tracks the overall health of the local server cluster. If all servers are considered healthy by Autopilot, this will be set to 1. If any are unhealthy, this will be 0. | health state | gauge |
**Why it's important:** Obviously, you want your cluster to be healthy.
@ -93,10 +93,10 @@ These are some metrics emitted that can help you understand the health of your c
### Memory usage
| Metric Name | Description |
| :--------------------------- | :----------------------------------------------------------------- |
| `consul.runtime.alloc_bytes` | This measures the number of bytes allocated by the Consul process. |
| `consul.runtime.sys_bytes` | This is the total number of bytes of memory obtained from the OS. |
| Metric Name | Description | Unit | Type |
| :--------------------------- | :----------------------------------------------------------------- | :---- | :---- |
| `consul.runtime.alloc_bytes` | This measures the number of bytes allocated by the Consul process. | bytes | gauge |
| `consul.runtime.sys_bytes` | This is the total number of bytes of memory obtained from the OS. | bytes | gauge |
**Why they're important:** Consul keeps all of its data in memory. If Consul consumes all available memory, it will crash.
@ -104,9 +104,9 @@ These are some metrics emitted that can help you understand the health of your c
### Garbage collection
| Metric Name | Description |
| :--------------------------------- | :---------------------------------------------------------------------------------------------------- |
| `consul.runtime.total_gc_pause_ns` | Number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started. |
| Metric Name | Description | Unit | Type |
| :--------------------------------- | :---------------------------------------------------------------------------------------------------- | :--- | :---- |
| `consul.runtime.total_gc_pause_ns` | Number of nanoseconds consumed by stop-the-world garbage collection (GC) pauses since Consul started. | ns | gauge |
**Why it's important:** GC pause is a "stop-the-world" event, meaning that all runtime threads are blocked until GC completes. Normally these pauses last only a few nanoseconds. But if memory usage is high, the Go runtime may GC so frequently that it starts to slow down Consul.
@ -117,11 +117,11 @@ you will need to apply a function such as InfluxDB's [`non_negative_difference()
### Network activity - RPC Count
| Metric Name | Description |
| :--------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `consul.client.rpc` | Increments whenever a Consul agent in client mode makes an RPC request to a Consul server |
| `consul.client.rpc.exceeded` | Increments whenever a Consul agent in client mode makes an RPC request to a Consul server gets rate limited by that agent's [`limits`](/docs/agent/options#limits) configuration. |
| `consul.client.rpc.failed` | Increments whenever a Consul agent in client mode makes an RPC request to a Consul server and fails. |
| Metric Name | Description | Unit | Type |
| :--------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------- | :------ |
| `consul.client.rpc` | Increments whenever a Consul agent in client mode makes an RPC request to a Consul server | requests | counter |
| `consul.client.rpc.exceeded` | Increments whenever a Consul agent in client mode makes an RPC request to a Consul server gets rate limited by that agent's [`limits`](/docs/agent/options#limits) configuration. | requests | counter |
| `consul.client.rpc.failed` | Increments whenever a Consul agent in client mode makes an RPC request to a Consul server and fails. | requests | counter |
**Why they're important:** These measurements indicate the current load created from a Consul agent, including when the load becomes high enough to be rate limited. A high RPC count, especially from `consul.client.rpcexceeded` meaning that the requests are being rate-limited, could imply a misconfigured Consul agent.