a library to curb OOMs by running Go GC according to a user-defined policy.
Go to file
Raúl Kripalani 31d951f370 implement automatic heapdumps when usage is above threshold.
A heapdump will be captured when the usage trespasses the threshold.
Staying above the threshold won't trigger another heapdump.
If the usage goes down, then back up, that is considered another
"episode" to be captured in a heapdump.

This feature is driven by three parameters:

* HeapdumpDir: the directory where the watchdog will write the heapdump.
  It will be created if it doesn't exist upon initialization. An error when
  creating the dir will not prevent heapdog initialization; it will just
  disable the heapdump capture feature.

  If zero-valued, the feature is disabled. Heapdumps will be written to path:
  <HeapdumpDir>/<RFC3339Nano formatted timestamp>.heap.

* HeapdumpMaxCaptures: sets the maximum amount of heapdumps a process will
  generate. This limits the amount of episodes that will be captured, in case
  the utilization climbs repeatedly over the threshold. By default, it is 10.

* HeapdumpThreshold: sets the utilization threshold that will trigger a
  heap dump to be taken automatically. A zero value disables this feature.
  By default, it is disabled.
2021-01-19 20:02:16 +00:00
.circleci circleci: run build on bare linux. 2021-01-18 21:33:52 +00:00
.dockerignore introduce cgroup-driven watchdog; refactor. 2021-01-18 21:24:31 +00:00
Dockerfile.dlv introduce cgroup-driven watchdog; refactor. 2021-01-18 21:24:31 +00:00
Dockerfile.test introduce cgroup-driven watchdog; refactor. 2021-01-18 21:24:31 +00:00
LICENSE-APACHE initial commit. 2020-12-02 00:03:20 +00:00
LICENSE-MIT initial commit. 2020-12-02 00:03:20 +00:00
Makefile skip tests that don't work in CircleCI due to unwritable cgroup. 2021-01-18 22:15:47 +00:00
README.md introduce cgroup-driven watchdog; refactor. 2021-01-18 21:24:31 +00:00
adaptive.go remove 'immediate' flag in policies. 2020-12-09 15:35:29 +00:00
adaptive_test.go implement automatic heapdumps when usage is above threshold. 2021-01-19 20:02:16 +00:00
doc.go introduce cgroup-driven watchdog; refactor. 2021-01-18 21:24:31 +00:00
go.mod introduce cgroup-driven watchdog; refactor. 2021-01-18 21:24:31 +00:00
go.sum major rewrite of go-watchdog. 2020-12-08 14:19:04 +00:00
log.go major rewrite of go-watchdog. 2020-12-08 14:19:04 +00:00
watchdog.go implement automatic heapdumps when usage is above threshold. 2021-01-19 20:02:16 +00:00
watchdog_linux.go implement automatic heapdumps when usage is above threshold. 2021-01-19 20:02:16 +00:00
watchdog_linux_test.go skip tests that don't work in CircleCI due to unwritable cgroup. 2021-01-18 22:15:47 +00:00
watchdog_other.go introduce cgroup-driven watchdog; refactor. 2021-01-18 21:24:31 +00:00
watchdog_other_test.go introduce cgroup-driven watchdog; refactor. 2021-01-18 21:24:31 +00:00
watchdog_test.go implement automatic heapdumps when usage is above threshold. 2021-01-19 20:02:16 +00:00
watermarks.go implement automatic heapdumps when usage is above threshold. 2021-01-19 20:02:16 +00:00
watermarks_test.go implement automatic heapdumps when usage is above threshold. 2021-01-19 20:02:16 +00:00

README.md

Go memory watchdog

🐺 A library to curb OOMs by running Go GC according to a user-defined policy.

godocs build status

Package watchdog runs a singleton memory watchdog in the process, which watches memory utilization and forces Go GC in accordance with a user-defined policy.

There three kinds of watchdogs:

  1. heap-driven (watchdog.HeapDriven()): applies a heap limit, adjusting GOGC dynamically in accordance with the policy.
  2. system-driven (watchdog.SystemDriven()): applies a limit to the total system memory used, obtaining the current usage through elastic/go-sigar.
  3. cgroups-driven (watchdog.CgroupDriven()): discovers the memory limit from the cgroup of the process (derived from /proc/self/cgroup), or from the root cgroup path if the PID == 1 (which indicates that the process is running in a container). It uses the cgroup stats to obtain the current usage.

The watchdog's behaviour is controlled by the policy, a pluggable function that determines when to trigger GC based on the current utilization. This library ships with two policies:

  1. watermarks policy (watchdog.NewWatermarkPolicy()): runs GC at configured watermarks of memory utilisation.
  2. adaptive policy (watchdog.NewAdaptivePolicy()): runs GC when the current usage surpasses a dynamically-set threshold.

You can easily write a custom policy tailored to the allocation patterns of your program.

The recommended way to set up the watchdog is as follows, in descending order of precedence. This logic assumes that the library supports setting a heap limit through an environment variable (e.g. MYAPP_HEAP_MAX) or config key.

  1. If heap limit is set and legal, initialize a heap-driven watchdog.
  2. Otherwise, try to use the cgroup-driven watchdog. If it succeeds, return.
  3. Otherwise, try to initialize a system-driven watchdog. If it succeeds, return.
  4. Watchdog initialization failed. Log a warning to inform the user that they're flying solo.

Running the tests

Given the low-level nature of this component, some tests need to run in isolation, so that they don't carry over Go runtime metrics. For completeness, this module uses a Docker image for testing, so we can simulate cgroup memory limits.

The test execution and docker builds have been conveniently packaged in a Makefile. Run with:

$ make

Why is this even needed?

The garbage collector that ships with the go runtime is pretty good in some regards (low-latency, negligible no stop-the-world), but it's insatisfactory in a number of situations that yield ill-fated outcomes:

  1. it is incapable of dealing with bursty/spiky allocations efficiently; depending on the workload, the program may OOM as a consequence of not scheduling GC in a timely manner.
  2. part of the above is due to the fact that go doesn't concern itself with any limits. To date, it is not possible to set a maximum heap size.
  3. its default policy of scheduling GC when the heap doubles, coupled with its ignorance of system or process limits, can easily cause it to OOM.

For more information, check out these GitHub issues:

License

Dual-licensed: MIT, Apache Software License v2, by way of the Permissive License Stack.