From 77fe08b7c9bdb81b842c66c3baa1aafe5e6d2bff Mon Sep 17 00:00:00 2001 From: Siva Date: Thu, 28 Jun 2018 22:09:15 -0400 Subject: [PATCH 1/3] Website: Added more telemetry metrics --- website/source/docs/agent/telemetry.html.md | 92 +++++++++++++++++++++ 1 file changed, 92 insertions(+) diff --git a/website/source/docs/agent/telemetry.html.md b/website/source/docs/agent/telemetry.html.md index a0a9685c36..b7356cc179 100644 --- a/website/source/docs/agent/telemetry.html.md +++ b/website/source/docs/agent/telemetry.html.md @@ -355,6 +355,62 @@ These metrics are used to monitor the health of the Consul servers. Unit Type + + `consul.raft.fsm.snapshot` + This metric measures the time taken by the FSM to record the current state for the snapshot. + ms + timer + + + + `consul.raft.fsm.apply` + This metric gives the number of logs committed since the last interval. + commit logs / interval + counter + + + + `consul.raft.fsm.restore` + This metric measures the time taken by the FSM to restore its state from a snapshot. + ms + timer + + + `consul.raft.snapshot.create` + This metric measures the time taken to initialize the snapshot process. + ms + timer + + + `consul.raft.snapshot.persist` + This metric measures the time taken to dump the current snapshot taken by the Consul agent to the disk. + ms + timer + + + `consul.raft.snapshot.takeSnapshot` + This metric measures the total time involved in taking the current snapshot (creating one and persisting it) by the Consul agent. + ms + timer + + + `consul.raft.replication.heartbeat` + This metric measures the time taken to invoke appendEntries on a peer, so that it doesn’t timeout on a periodic basis. + ms + timer + + + `consul.serf.snapshot.appendLine` + This metric measures the time taken by the Consul agent to append an entry into the existing log. + ms + timer + + + `consul.serf.snapshot.compact` + This metric measures the time by the Consul agent to compact a log. This operation occurs only when the snapshot becomes too large enough to justify the compaction . + ms + timer + `consul.raft.state.leader` This increments whenever a Consul server becomes a leader. If there are frequent leadership changes this may be indication that the servers are overloaded and aren't meeting the soft real-time requirements for Raft, or that there are networking problems between the servers. @@ -655,6 +711,42 @@ These metrics give insight into the health of the cluster as a whole. suspect messages received / interval counter + + `consul.memberlist.gossip` + This metric gives the number of gossips (messages) broadcasted to a set of randomly selected nodes. + messages / Interval + counter + + + `consul.memberlist.msg_alive` + This metric counts the number of alive agents, that the agent has mapped out so far, based on the message information given by the network layer. + nodes / Interval + counter + + + `consul.memberlist.msg_dead` + This metric gives the number of dead agents, that the agent has mapped out so far, based on the message information given by the network layer. + nodes / Interval + counter + + + `consul.memberlist.msg_suspect` + This metric gives the number of suspect nodes, that the agent has mapped out so far, based on the message information given by the network layer. + nodes / Interval + counter + + + `consul.memberlist.probeNode` + This metric measures the time taken to perform a single round of failure detection on a select agent. + nodes / Interval + counter + + + `consul.memberlist.pushPullNode` + This metric measures the number of agents that have exchanged state with this agent. + nodes / Interval + counter + `consul.serf.member.flap` Available in Consul 0.7 and later, this increments when an agent is marked dead and then recovers within a short time period. This can be an indicator of overloaded agents, network problems, or configuration errors where agents can not connect to each other on the [required ports](/docs/agent/options.html#ports). From 9579ba4e12bf8a0d5366812117252f2e405f05b9 Mon Sep 17 00:00:00 2001 From: Siva Date: Tue, 3 Jul 2018 10:27:01 -0400 Subject: [PATCH 2/3] Website: Added more telemetry details for raft and memberlist. --- website/source/docs/agent/telemetry.html.md | 43 +++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/website/source/docs/agent/telemetry.html.md b/website/source/docs/agent/telemetry.html.md index b7356cc179..df9b843e10 100644 --- a/website/source/docs/agent/telemetry.html.md +++ b/website/source/docs/agent/telemetry.html.md @@ -429,6 +429,25 @@ These metrics are used to monitor the health of the Consul servers. raft transactions / interval counter + + `consul.raft.barrier` + This metric counts the number of times the agent has started the barrier i.e the number of times it has + issued a blocking call, to ensure that the agent has all the pending operations that were queued, to be applied to the agent's FSM. + blocks / interval + counter + + + `consul.raft.verify_leader` + This metric counts the number of times an agent checks whether it is still the leader or not + checks / interval + Counter + + + `consul.raft.restore` + This metric counts the number of times the restore operation has been performed by the agent. Here, restore refers to the action of raft consuming an external snapshot to restore its state. + operation invoked / interval + counter + `consul.raft.commitTime` This measures the time it takes to commit a new entry to the Raft log on the leader. @@ -705,6 +724,30 @@ These metrics give insight into the health of the cluster as a whole. Unit Type + + `consul.memberlist.degraded.probe` + This metric counts the number of times the agent has performed failure detection on an other agent at a slower probe rate. The agent uses its own health metric as an indicator to perform this action. (If its health score is low, means that the node is healthy, and vice versa.) + probes / interval + counter + + + `consul.memberlist.degraded.timeout` + This metric counts the number of times an agent was marked as a dead node, whilst not getting enough confirmations from a randomly selected list of agent nodes in an agent's membership. + occurrence / interval + counter + + + `consul.memberlist.msg.dead` + This metric counts the number of times an agent has marked another agent to be a dead node. + messages / interval + counter + + + `consul.memberlist.health.score` + This metric emits the agent's updated health score. This score is updated whenever the agent notices any changes in the response, from a set of randomly probed agents. This value ranges from 0-8, the lowest indicating the agent is healthy and vice versa. + score + gauge + `consul.memberlist.msg.suspect` This increments when an agent suspects another as failed when executing random probes as part of the gossip protocol. These can be an indicator of overloaded agents, network problems, or configuration errors where agents can not connect to each other on the [required ports](/docs/agent/options.html#ports). From 5d8bf053e0742f623290dabea13bdc8933db6c7f Mon Sep 17 00:00:00 2001 From: Siva Date: Tue, 3 Jul 2018 10:59:31 -0400 Subject: [PATCH 3/3] Website/Telemetry: Errata for snapshot.compact and reworded memberlist.health.score --- website/source/docs/agent/telemetry.html.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/website/source/docs/agent/telemetry.html.md b/website/source/docs/agent/telemetry.html.md index df9b843e10..4028ed9d10 100644 --- a/website/source/docs/agent/telemetry.html.md +++ b/website/source/docs/agent/telemetry.html.md @@ -407,7 +407,7 @@ These metrics are used to monitor the health of the Consul servers. `consul.serf.snapshot.compact` - This metric measures the time by the Consul agent to compact a log. This operation occurs only when the snapshot becomes too large enough to justify the compaction . + This metric measures the time taken by the Consul agent to compact a log. This operation occurs only when the snapshot becomes large enough to justify the compaction . ms timer @@ -744,7 +744,7 @@ These metrics give insight into the health of the cluster as a whole. `consul.memberlist.health.score` - This metric emits the agent's updated health score. This score is updated whenever the agent notices any changes in the response, from a set of randomly probed agents. This value ranges from 0-8, the lowest indicating the agent is healthy and vice versa. + This metric describes a node's perception of its own health based on how well it is meeting the soft real-time requirements of the protocol. This metric ranges from 0 to 8, where 0 indicates "totally healthy". This health score is used to scale the time between outgoing probes, and higher scores translate into longer probing intervals. For more details see section IV of the Lifeguard paper: https://arxiv.org/pdf/1707.00788.pdf score gauge