From 9579ba4e12bf8a0d5366812117252f2e405f05b9 Mon Sep 17 00:00:00 2001 From: Siva Date: Tue, 3 Jul 2018 10:27:01 -0400 Subject: [PATCH] Website: Added more telemetry details for raft and memberlist. --- website/source/docs/agent/telemetry.html.md | 43 +++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/website/source/docs/agent/telemetry.html.md b/website/source/docs/agent/telemetry.html.md index b7356cc179..df9b843e10 100644 --- a/website/source/docs/agent/telemetry.html.md +++ b/website/source/docs/agent/telemetry.html.md @@ -429,6 +429,25 @@ These metrics are used to monitor the health of the Consul servers. raft transactions / interval counter + + `consul.raft.barrier` + This metric counts the number of times the agent has started the barrier i.e the number of times it has + issued a blocking call, to ensure that the agent has all the pending operations that were queued, to be applied to the agent's FSM. + blocks / interval + counter + + + `consul.raft.verify_leader` + This metric counts the number of times an agent checks whether it is still the leader or not + checks / interval + Counter + + + `consul.raft.restore` + This metric counts the number of times the restore operation has been performed by the agent. Here, restore refers to the action of raft consuming an external snapshot to restore its state. + operation invoked / interval + counter + `consul.raft.commitTime` This measures the time it takes to commit a new entry to the Raft log on the leader. @@ -705,6 +724,30 @@ These metrics give insight into the health of the cluster as a whole. Unit Type + + `consul.memberlist.degraded.probe` + This metric counts the number of times the agent has performed failure detection on an other agent at a slower probe rate. The agent uses its own health metric as an indicator to perform this action. (If its health score is low, means that the node is healthy, and vice versa.) + probes / interval + counter + + + `consul.memberlist.degraded.timeout` + This metric counts the number of times an agent was marked as a dead node, whilst not getting enough confirmations from a randomly selected list of agent nodes in an agent's membership. + occurrence / interval + counter + + + `consul.memberlist.msg.dead` + This metric counts the number of times an agent has marked another agent to be a dead node. + messages / interval + counter + + + `consul.memberlist.health.score` + This metric emits the agent's updated health score. This score is updated whenever the agent notices any changes in the response, from a set of randomly probed agents. This value ranges from 0-8, the lowest indicating the agent is healthy and vice versa. + score + gauge + `consul.memberlist.msg.suspect` This increments when an agent suspects another as failed when executing random probes as part of the gossip protocol. These can be an indicator of overloaded agents, network problems, or configuration errors where agents can not connect to each other on the [required ports](/docs/agent/options.html#ports).