Igor Sirotin 987a9e8707
feat_: add Sentry panic reporting (#6054)
* feat_: report panics to sentry

* test_: sentry options, params and utils

* feat_: toggle sentry with centralized metrics

* test_: sentry init, report and close

* refactor_: rename public api to generic

* docs_: sentry

* fix_: typo in internal/sentry/README.md

Co-authored-by: osmaczko <33099791+osmaczko@users.noreply.github.com>

* fix_: linter

---------

Co-authored-by: osmaczko <33099791+osmaczko@users.noreply.github.com>
2024-11-25 12:13:47 +00:00

12 KiB

Description

This package encapsulates Sentry integration. So far:

  • only for status-go (including when running as part of desktop and mobile)
  • only for panics (no other error reporting)

Sentry is only enabled for users that both:

  1. Opted-in for metrics
  2. Use builds from our release CI

🛬 Where

We use self-hosted Sentry: https://sentry.infra.status.im/

🕐 When

Which panics are reported:

  • When running inside status-desktop/status-mobile:
    • during API calls in /mobile/status.go
    • inside all goroutines
  • When running status-backend:
    • any panic

Which panics are NOT reported:

  • When running inside status-desktop/status-mobile:
    • during API calls in /services/**/api.go
      NOTE: These endpoints are executed through go-ethereum's JSON-RPC server, which internally recovers all panics and doesn't provide any events or option to set an interceptor. The only way to catch these panics is to replace the JSON-RPC server implementation.
  • When running status-go unit tests:
    • any panic
      NOTE: Go internally runs tests in a goroutine. The only way to catch these panics in tests is to manually defer sentry.Recover() in each test. This also requires a linter (similar to lint-panics) that checks this call is present.
      This is not a priority right now, because:
      1. We have direct access to failed tests logs, which contain the panic stacktrace.
      2. Panics are not expected to happen. Test must be passing to be able to merge the code. So it's only possible with a flaky test.
      3. This would only apply to nightly/develop jobs, as we don't gather panic reports from PR-level jobs.

📦 What

Full list can be found in sentry.Event.

Notes regarding identity-disclosing properties:

  • ServerName - completely removed from all events

  • Stacktrace:

    • No private user paths are exposed, as we only gather reports from CI-built binaries.
  • TraceID - so far will be unique for each event

    Trace: A collection of spans representing the end-to-end journey of a request through your system that all share the same trace ID.

    More details in sentry docs.

Configuration

There are 2 main tags to identify the error. The configuration is a bit complicated, but provides full information.

Parameters

Environment

Defining question Where it is running?
Set time - production can only be set at build time to prevent users from hacking the environment
- All others can be set at runtime, because on CI we sometimes use same build for multiple environments
Expected values
ValueDescription
productionEnd user machine
developmentDeveloper machine
ci-prPR-level CI runs
ci-mainCI runs for stable branch
ci-nightlyCI nightly jobs on stable branch
development and ci-pr are dropped, because we only want to consider panics from stable code

Context

Defining question What is the executable for the library?
Set time Always at build-time
Expected values
ValueRunning as...
status-desktopLibrary embedded into status-desktop
status-mobileLibrary embedded into status-mobile
status-backendPart of cmd/status-backend
Can be other cmd/* as well.
matterbridgePart of Status/Discord bridge app
status-go-testsInside status-go tests

Environment variables

To cover these requirements, I added these environment variables:

Environment variable Provide time Description
SENTRY_DSN - At build time with direct call to sentry.Init
- At runtime with InitializeApplication endpoint
Sentry DSN to be used
SENTRY_CONTEXT_NAME
SENTRY_CONTEXT_VERSION
Build time Execution context of status-go
SENTRY_PRODUCTION Build time When true or 1:
-Defines if this is a production build
-Sets environment to production
-Has precedence over runtime SENTRY_ENVIRONMENT
SENTRY_ENVIRONMENT Run time Sets the environment. Has no effect when SENTRY_PRODUCTION is set

Client integration

  1. Set SENTRY_CONTEXT_NAME and SENTRY_CONTEXT_VERSION at status-go build time

  2. Provide sentryDSN to the InitializeApplication call. DSN must be kept private and will be provided by CI. Expect a STATUS_GO_SENTRY_DSN environment variable to be provided by CI.

    Why can't we consume `STATUS_GO_SENTRY_DSN` directly in status-go build?
     In theory, we could. But this would require us to mix approaches of getting the env variable to the code.
     Right now we prefer `go:generate + go:embed` approach (e.g. https://github.com/status-im/status-go/pull/6014), but we can't do it in this case, because we must not write the DSN to a local file, which would be a bit vulnerable. And I don't want to go back to `-ldflags="-X github.com/status-im/status-go/internal/sentry.sentryDSN=$(STATUS_GO_SENTRY_DSN:v%=%)` approach.
    

Implementation details