518 lines
16 KiB
Markdown
Raw Normal View History

# nim-metrics
[![CI](https://github.com/status-im/nim-metrics/actions/workflows/ci.yml/badge.svg)](https://github.com/status-im/nim-metrics/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![License: Apache](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
![Stability: experimental](https://img.shields.io/badge/stability-experimental-orange.svg)
## Introduction
Nim metrics client library supporting the [Prometheus](https://prometheus.io/)
monitoring toolkit, [StatsD](https://github.com/statsd/statsd/wiki) and
[Carbon](https://graphite.readthedocs.io/en/latest/feeding-carbon.html).
Designed to be thread-safe and efficient, it's disabled by default so libraries
can use it without any overhead for those library users not interested in
metrics.
## Installation
You can install the development version of the library through Nimble with the
following command:
```
nimble install https://github.com/status-im/nim-metrics@#master
```
## Usage
To enable metrics, compile your code with `-d:metrics --threads:on`.
To avoid depending on PCRE, compile with `-d:withoutPCRE`.
## Architectural overview
`Collector` objects holding various `Metric` objects are registered in one or
more `Registry` objects. There is a default registry being used for the most
common case.
Metric values are `float64`, but the API also accepts `int64` parameters which
are then cast to `float64`.
By starting an HTTP server, custom metrics (and some default ones) can be
pulled by Prometheus. By specifying backends, those same custom metrics will be
pushed to StatsD or Carbon servers, as soon as they are modified. They can also
be serialised to strings for some quick and dirty logging. Integration with the
[Chronicles](https://github.com/status-im/nim-chronicles) logging library is
available in a separate module.
That HTTP server used for pulling is running in its own thread. Metric pushing
also uses a dedicated thread for networking, in order to minimise the overhead.
## Collector types
### Counter
A counter's value can only be incremented.
```nim
# Declare a variable `myCounter` holding a `Counter` object with a `Metric`
# having the same name as the variable. The help string is mandatory. The initial
# value is 0 and it's automatically added to `defaultRegistry`.
declareCounter myCounter, "an example counter"
# increment it by 1
myCounter.inc()
# increment it by 10
myCounter.inc(10)
# count all exceptions in a block
someCounter.countExceptions:
foo()
# or just an exception type
otherCounter.countExceptions(ValueError):
bar()
# do you need a variable that's being exported from the module?
declarePublicCounter seenPeers, "number of seen peers"
# it's the equivalent of `var seenPeers* = ...`
# want to avoid declaring a variable, giving it a help string, or anything else for that matter?
counter("one_off_counter").inc()
# What this does is generate a {.global.} var, so as long as you use the same
# string, you're using the same counter. Using strings instead of identifiers
# skips any compiler protection in case of typos, so this API is not recommended
# for serious use.
```
### Gauge
Gauges can be incremented, decremented or set to a given value.
```nim
declareGauge myGauge, "an example gauge" # or `declarePublicGauge` to export it
myGauge.inc(4.5)
myGauge.dec(2)
myGauge.set(10)
myGauge.setToCurrentTime() # Unix timestamp in seconds
myGauge.trackInProgress:
# myGauge is incremented at the start of the block (a `myGauge.inc()` is being inserted here)
foo()
# and decremented at the end (`myGauge.dec()`)
# set the gauge to the runtime of a block, in seconds
myGauge.time:
bar()
# alternative, unrecommended API
gauge("one_off_gauge").set(42)
```
### Summary
Summaries sample observations and provide a total count and the sum of all observed values.
```nim
declareSummary mySummary, "an example summary" # or `declarePublicSummary` to export it
mySummary.observe(10)
mySummary.observe(0.5)
echo mySummary
```
This will print out:
```text
# HELP mySummary an example summary
# TYPE mySummary summary
mySummary_sum 10.5 1569332171696
mySummary_count 2.0 1569332171696
mySummary_created 1569332171.0
```
```nim
# observe the execution duration of a block, in seconds
mySummary.time:
foo()
# alternative, unrecommended API
summary("one_off_summary").observe(10)
```
### Histogram
These cumulative histograms store the count and total sum of observed values,
just like summaries. Further more, they place the observed values in
configurable buckets and provide per-bucket counts.
Note that an observed value will be counted in all buckets that have a size greater or equal to it.
```nim
declareHistogram myHistogram, "an example histogram" # or `declarePublicHistogram` to export it
# This uses the default bucket sizes: [0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0,
# 2.5, 5.0, 7.5, 10.0, Inf]
# You can customise the buckets:
declareHistogram withCustomBuckets, "custom buckets", buckets = [0.0, 1.0, 2.0]
# if you leave out the "Inf" bucket, it's added for you
withCustomBuckets.observe(0.5)
withCustomBuckets.observe(1)
withCustomBuckets.observe(1.5)
withCustomBuckets.observe(3.7)
echo withCustomBuckets
```
This will print out:
```text
# HELP withCustomBuckets custom buckets
# TYPE withCustomBuckets histogram
withCustomBuckets_sum 6.7 1569334493506
withCustomBuckets_count 4.0 1569334493506
withCustomBuckets_created 1569334493.0
withCustomBuckets_bucket{le="0.0"} 0.0
withCustomBuckets_bucket{le="1.0"} 2.0 1569334493506
withCustomBuckets_bucket{le="2.0"} 3.0 1569334493506
withCustomBuckets_bucket{le="+Inf"} 4.0 1569334493506
```
```nim
# observe the execution duration of a block, in seconds
myHistogram.time:
foo()
# alternative, unrecommended API
histogram("one_off_histogram").observe(10)
```
### Custom collectors
Sometimes you need to create metrics on the fly, with a custom `collect()`
method of a custom collector type.
Let's say you have an USB-attached power meter and, for some reason, you want
to read the power consumption every time Prometheus reads your metrics:
```nim
import metrics, times
when defined(metrics):
type PowerCollector = ref object of Gauge
var powerCollector = PowerCollector.newCollector(name = "power_usage", help = "Instantaneous power usage - in watts.")
method collect(collector: PowerCollector): Metrics =
let timestamp = getTime().toMilliseconds()
result[@[]] = @[
Metric(
name: "power_usage",
value: getPowerUsage(), # your power-meter reader
timestamp: timestamp,
)
]
```
There's a bit of repetition in the collector and metric names, because we no
longer have behind-the-scenes name copying/deriving there.
You can output multiple metrics from your custom `collect()` method. It's
perfectly legal and we do that internally for our system/runtime metrics.
Try not to get creative with dynamic metric names - Prometheus has a hard time
dealing with that.
## Labels
Metric labels are supported for the Prometheus backend, as a way to add extra
dimensions corresponding to each combination of metric name and label values.
This can quickly get out of hand, as you can guess, so don't go overboard with
this feature. (See also the [relevant warnings in Prometheus' docs](https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels).)
You declare label names when defining the collector and label values each time
you update it:
```nim
declareCounter lCounter, "example counter with labels", ["foo", "bar"]
lCounter.inc(labelValues = ["1", "a"]) # the label values must be strings
lCounter.inc(labelValues = ["2", "b"])
# How many metrics are now in this collector? Two, because we used two sets of label values:
echo lCounter
```
```text
# HELP lCounter example counter with labels
# TYPE lCounter counter
lCounter_total{foo="1",bar="a"} 1.0 1569340503703
lCounter_created{foo="1",bar="a"} 1569340503.0
lCounter_total{foo="2",bar="b"} 1.0 1569340503703
lCounter_created{foo="2",bar="b"} 1569340503.0
```
(OK, there are four metrics in total, because each one gets a `*_created` buddy.)
So if you must use labels, make sure there's a finite and small number of
possible label values being set.
## Metric name and label name validation
We use Prometheus standards for that, so metric names must comply with the
`^[a-zA-Z_:][a-zA-Z0-9_:]*$` regex while label names have to comply with
`^[a-zA-Z_][a-zA-Z0-9_]*$`.
In the examples you've seen so far, all collectors declared with
`declare<CollectorType>` had more stringent naming rules, since their names were
also identifiers for Nim variables - which can't have colons in them.
To overcome this, without relying on the discouraged alternative API, use the `name` parameter:
```nim
declareCounter cCounter, "counter with colons in name", name = "foo:bar:baz"
cCounter.inc()
echo cCounter
```
```text
# HELP foo:bar:baz counter with colons in name
# TYPE foo:bar:baz counter
foo:bar:baz_total 1.0 1569341756504
foo:bar:baz_created 1569341756.0
```
## Logging
Metrics are not logs, but you might want to log them nonetheless. The `$`
procedure is defined for collectors and registries, so you can just use the
built-in string serialisation to print them:
```nim
echo myCounter, myGauge
echo defaultRegistry
```
Integration with [Chronicles](https://github.com/status-im/nim-chronicles) is available in a separate module:
```nim
import chronicles, metrics, metrics/chronicles_support
# ...
info "myCounter", myCounter
debug "default registry", defaultRegistry
```
## Testing
When testing, you might want to isolate some collectors by registering them
into a custom registry:
```nim
var myRegistry = newRegistry()
declareCounter myCounter, "help", registry = myRegistry
echo myRegistry
# this means that `myCounter` is no longer registered in `defaultRegistry`
echo defaultRegistry
```
These unoptimised (read "very inefficient") `value()` and `valueByName()`
procedures for accessing metric values should only be used inside test suites:
```nim
suite "counter":
test "basic":
declareCounter myCounter, "help"
check myCounter.value == 0
myCounter.inc()
check myCounter.value == 1
declareSummary cSummary, "summary with colons in name", name = "foo:bar:baz"
cSummary.observe(10)
check cSummary.valueByName("foo:bar:baz_count") == 1
check cSummary.valueByName("foo:bar:baz_sum") == 10
```
## Prometheus endpoint
First, you need to choose the HTTP server implementation.
### Standard library
Using [asynchttpserver](https://nim-lang.org/docs/asynchttpserver.html) which is based on [asyncdispatch](https://nim-lang.org/docs/asyncdispatch.html) from the Nim standard library:
```nim
import metrics, metrics/stdlib_httpserver
```
### Chronos
Using [Chronos](https://github.com/status-im/nim-chronos/) - an asyncdispatch alternative:
```nim
import metrics, metrics/chronos_httpserver
```
### Starting the HTTP server
Start an HTTP server listening on `127.0.0.1:8000` from which the Prometheus
daemon can pull the metrics from all collectors in `defaultRegistry` (plus the
default metrics):
```nim
startMetricsHttpServer()
```
Or set your own address and port to listen to:
```nim
import net
startMetricsHttpServer("127.0.0.1", Port(8000))
```
The HTTP server will run in its own thread. It will expose two endpoints:
* http://127.0.0.1:8000/metrics - Returns the metrics consumed by Prometheus.
* http://127.0.0.1:8000/health - Healthcheck that returns `OK` string and 200 code.
### System metrics
Default metrics available (see also [the relevant Prometheus docs](https://prometheus.io/docs/instrumenting/writing_clientlibs/#standard-and-runtime-collectors)):
```text
process_cpu_seconds_total
process_open_fds
process_max_fds
process_virtual_memory_bytes
process_resident_memory_bytes
process_start_time_seconds
nim_gc_mem_bytes[thread_id]
nim_gc_mem_occupied_bytes[thread_id]
nim_gc_heap_instance_occupied_bytes[type_name]
nim_gc_heap_instance_occupied_summed_bytes
```
The `process_*` metrics are only available on Linux, for now.
`nim_gc_heap_instance_occupied_bytes` is only available when compiling with
`-d:nimTypeNames` and holds the top 10 instance types, in reverse order of
their total heap usage (from all threads), at the time the metric is created.
Since this set changes with time, you'll see more than 10 types in Grafana.
The thread-specific metrics are being updated automatically when a user-defined metric
is changed in the main thread, but only if a minimal interval has passed since
the last update (defaults to 10 second). All other system metrics are custom
collectors which are updated at collection time.
```nim
import times
when defined(metrics):
# get the default minimal update interval
echo getSystemMetricsUpdateInterval()
# you can change it
setSystemMetricsUpdateInterval(initDuration(seconds = 2))
```
You can also disable this automated piggy-backing on user-defined metric value
changes, if you need more regularity, and take charge of updating system
metrics yourself.
```nim
# disable automatic updates
setSystemMetricsAutomaticUpdate(false)
# somewhere in your event loop, at an interval of your choice
updateThreadMetrics()
```
Those metrics with with a "thread\_id" label are thread-specific metrics. The
automatic update only covers thread metrics for the main thread. You'll have to
call `updateThreadMetrics()` by yourself for any other thread you care about.
Screenshot of [Grafana showing data from Prometheus that pulls it from Nimbus which uses nim-metrics](https://github.com/status-im/nimbus-eth1/#metric-visualisation):
![Grafana screenshot](https://i.imgur.com/AdtavDA.png)
## StatsD
Add a [StatsD](https://github.com/statsd/statsd/wiki) export backend where
metric updates will be pushed as soon as they are created:
```nim
import metrics, net
when defined(metrics):
addExportBackend(
metricProtocol = STATSD,
netProtocol = UDP,
address = "127.0.0.1",
port = Port(8125)
)
declareCounter myCounter, "some counter"
myCounter.inc()
# When we incremented the counter, the corresponding data was sent over the wire to the StatsD daemon.
```
The only supported collector types are counters and gauges. There's a dedicated
thread that does the networking part. When you update these collectors, data is
sent over a channel to that thread. If the channel's buffer is full, the data
is silently dropped. Same for an unreachable backend or any other networking
error. Reconnections are tried automatically and there's one socket per backend
being reused.
All the complexity is hidden from the API user and additional latency is kept
to a minimum. Exported metrics are treated like disposable data and dropped at
the first sign of trouble.
Counters support an additional parameter just for StatsD: `sampleRate`. This
allows sending just a percentage of the increments to the StatsD daemon.
Nothing else changes on the client side.
```nim
declareCounter sCounter, "counter with a sample rate set", sampleRate = 0.1
sCounter.inc()
# Now only 10% (on average) of this counter's updates will be sent over the
# wire. We throw a dice when the time comes, using a simple PRNG. We also
# inform the StatsD daemon about this rate, so it can adjust its estimated value
# accordingly.
```
## Carbon
Add a [Carbon](https://graphite.readthedocs.io/en/latest/feeding-carbon.html)
export backend where metric updates will be pushed as soon as they are created:
```nim
import metrics, net
when defined(metrics):
addExportBackend(
metricProtocol = CARBON,
netProtocol = TCP,
address = "127.0.0.1",
port = Port(2003)
)
```
The implementation is very similar to the StatsD metric exporting described above.
You may add as many export backends as you want, but deleting them from the
`exportBackends` global variable is unsupported.
## Contributing
When submitting pull requests, please add test cases for any new features or
fixes and make sure `nimble test` is still able to execute the entire test
suite successfully.
## License
Licensed and distributed under either of
* MIT license: [LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT
* Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
at your option. These files may not be copied, modified, or distributed except according to those terms.