diff --git a/website/source/assets/images/grafana-screenshot.png b/website/source/assets/images/grafana-screenshot.png new file mode 100644 index 0000000000..6cc0d4aa4e Binary files /dev/null and b/website/source/assets/images/grafana-screenshot.png differ diff --git a/website/source/docs/guides/monitoring-telegraf.html.md b/website/source/docs/guides/monitoring-telegraf.html.md new file mode 100644 index 0000000000..047e1fd194 --- /dev/null +++ b/website/source/docs/guides/monitoring-telegraf.html.md @@ -0,0 +1,256 @@ +--- +layout: "docs" +page_title: "Monitoring Consul with Telegraf" +sidebar_current: "docs-guides-monitoring-telegraf" +description: |- + Best practice approaches for monitoring a production Consul cluster with Telegraf +--- + +# Monitoring Consul with Telegraf + +Consul makes available a range of metrics in various formats in order to measure the health and stability of a cluster, and diagnose or predict potential issues. + +There are number of monitoring tools and options, but for the purposes of this guide we are going to use the [telegraf_plugin][] in conjunction with the Statsd protocol supported by Consul. + +You can read the full breakdown of metrics with Consul in the [telemtry documentation](/docs/agent/telemetry.html) + +## Configuring Telegraf + +# Installing Telegraf + +Installing Telegraf is straightforward on most Linux distributions. We recommend following the [official Telegraf installation documentation][telegraf-install]. + +# Configuring Telegraf + +Besides acting as a statsd agent, Telegraf can collect additional metrics about the host that the Consul agent is running on. Telegraf itself ships with a wide range of [input plugins][telegraf-input-plugins] to collect data from lots of sources for this purpose. + +We're going to enable some of the most common ones to monitor CPU, memory, disk I/O, networking, and process status, as these are useful for debugging Consul cluster issues. + +The `telegraf.conf` file starts with global options: + +```ini +[agent] + interval = "10s" + flush_interval = "10s" + omit_hostname = false +``` + +We set the default collection interval to 10 seconds and ask Telegraf to include a `host` tag in each metric. + +As mentioned above, Telegraf also allows you to set additional tags on the metrics that pass through it. In this case, we are adding tags for the server role and datacenter. We can then use these tags in Grafana to filter queries (for example, to create a dashboard showing only servers with the `consul-server` role, or only servers in the `us-east-1` datacenter). + +```ini +[global_tags] + role = "consul-server" + datacenter = "us-east-1" +``` + +Next, we set up a statsd listener on UDP port 8125, with instructions to calculate percentile metrics and to +parse DogStatsD-compatible tags, when they're sent: + +```ini +[[inputs.statsd]] + protocol = "udp" + service_address = ":8125" + delete_gauges = true + delete_counters = true + delete_sets = true + delete_timings = true + percentiles = [90] + metric_separator = "_" + parse_data_dog_tags = true + allowed_pending_messages = 10000 + percentile_limit = 1000 +``` + +The full reference to all the available statsd-related options in Telegraf is [here][telegraf-statsd-input]. + +Now we can configure inputs for things like CPU, memory, network I/O, and disk I/O. Most of them don't require any configuration, but make sure the `interfaces` list in `inputs.net` matches the interface names you see in `ifconfig`. + +```ini +[[inputs.cpu]] + percpu = true + totalcpu = true + collect_cpu_time = false + +[[inputs.disk]] + # mount_points = ["/"] + # ignore_fs = ["tmpfs", "devtmpfs"] + +[[inputs.diskio]] + # devices = ["sda", "sdb"] + # skip_serial_number = false + +[[inputs.kernel]] + # no configuration + +[[inputs.linux_sysctl_fs]] + # no configuration + +[[inputs.mem]] + # no configuration + +[[inputs.net]] + interfaces = ["enp0s*"] + +[[inputs.netstat]] + # no configuration + +[[inputs.processes]] + # no configuration + +[[inputs.swap]] + # no configuration + +[[inputs.system]] + # no configuration +``` + +Another useful plugin is the [procstat][telegraf-procstat-input] plugin, which reports metrics for processes you select: + +```ini +[[inputs.procstat]] + pattern = "(consul)" +``` + +Telegraf even includes a [plugin][telegraf-consul-input] that monitors the health checks associated with the Consul agent, using Consul API to query the data. + +It's important to note: the plugin itself will not report the telemetry, Consul will report those stats already using StatsD protocol. + +```ini +[[inputs.consul]] + address = "localhost:8500" + scheme = "http" +``` + +## Telegraf Configuration for Consul + +Asking Consul to send telemetry to Telegraf is as simple as adding a `telemetry` section to your agent configuration: + +```json +{ + "telemetry": { + "dogstatsd_addr": "localhost:8125", + "disable_hostname": true + } +} +``` + +As you can see, we only need to specify two options. The `dogstatsd_addr` specifies the hostname and port of the +statsd daemon. + +Note that we specify DogStatsD format instead of plain statsd, which tells Consul to send [tags][tagging] +with each metric. Tags can be used by Grafana to filter data on your dashboards (for example, displaying only +the data for which `role=consul-server`. Telegraf is compatible with the DogStatsD format and allows us to add +our own tags too. + +The second option tells Consul not to insert the hostname in the names of the metrics it sends to statsd, since the hostnames will be sent as tags. Without this option, the single metric `consul.raft.apply` would become multiple metrics: + + consul.server1.raft.apply + consul.server2.raft.apply + consul.server3.raft.apply + +If you are using a different agent (e.g. Circonus, Statsite, or plain statsd), you may want to change this configuration, and you can find the configuration reference [here][consul-telemetry-config]. + +## Visualising Telegraf Consul Metrics + +There a number of ways of consuming the information from Telegraf. Generally they are visualised using a tool like [Grafana][] or [Chronograf][]. + +Here is an example Grafana dashboard: + +