History

Igor Sirotin cffd2cfefb fix(sentry)_: trim production env variable (#6149 )		2024-12-02 16:32:26 +00:00
..
README.md	feat_: add Sentry panic reporting (#6054 )	2024-11-25 12:13:47 +00:00
options.go	feat_: add Sentry panic reporting (#6054 )	2024-11-25 12:13:47 +00:00
options_test.go	feat_: add Sentry panic reporting (#6054 )	2024-11-25 12:13:47 +00:00
params.go	fix(sentry)_: trim production env variable (#6149 )	2024-12-02 16:32:26 +00:00
params_test.go	feat_: add Sentry panic reporting (#6054 )	2024-11-25 12:13:47 +00:00
sentry.go	feat_: add Sentry panic reporting (#6054 )	2024-11-25 12:13:47 +00:00
sentry_test.go	feat_: add Sentry panic reporting (#6054 )	2024-11-25 12:13:47 +00:00
stacktrace.go	fix(sentry)_: trim production env variable (#6149 )	2024-12-02 16:32:26 +00:00
stacktrace_test.go	fix(sentry)_: trim production env variable (#6149 )	2024-12-02 16:32:26 +00:00

README.md

Description

This package encapsulates Sentry integration. So far:

only for status-go (including when running as part of desktop and mobile)
only for panics (no other error reporting)

Sentry is only enabled for users that both:

Opted-in for metrics
Use builds from our release CI

🛬 Where

We use self-hosted Sentry: https://sentry.infra.status.im/

🕐 When

Which panics are reported:

When running inside status-desktop/status-mobile:
- during API calls in /mobile/status.go
- inside all goroutines
When running status-backend:
- any panic

Which panics are NOT reported:

When running inside status-desktop/status-mobile:
- during API calls in /services/**/api.go
  NOTE: These endpoints are executed through go-ethereum's JSON-RPC server, which internally recovers all panics and doesn't provide any events or option to set an interceptor. The only way to catch these panics is to replace the JSON-RPC server implementation.
When running status-go unit tests:
- any panic
  NOTE: Go internally runs tests in a goroutine. The only way to catch these panics in tests is to manually defer sentry.Recover() in each test. This also requires a linter (similar to lint-panics) that checks this call is present.
  This is not a priority right now, because:
  1. We have direct access to failed tests logs, which contain the panic stacktrace.
  2. Panics are not expected to happen. Test must be passing to be able to merge the code. So it's only possible with a flaky test.
  3. This would only apply to nightly/develop jobs, as we don't gather panic reports from PR-level jobs.

📦 What

Full list can be found in sentry.Event.

Notes regarding identity-disclosing properties:

ServerName - completely removed from all events
Stacktrace:
- No private user paths are exposed, as we only gather reports from CI-built binaries.
TraceID - so far will be unique for each event

Trace: A collection of spans representing the end-to-end journey of a request through your system that all share the same trace ID.

More details in sentry docs.

Configuration

There are 2 main tags to identify the error. The configuration is a bit complicated, but provides full information.

Parameters

`Environment`

Defining question Where it is running?

Set time - production can only be set at build time to prevent users from hacking the environment
- All others can be set at runtime, because on CI we sometimes use same build for multiple environments

Expected values

Value	Description
`production`	End user machine
~~`development`~~	Developer machine
~~`ci-pr`~~	PR-level CI runs
`ci-main`	CI runs for stable branch
`ci-nightly`	CI nightly jobs on stable branch

development and ci-pr are dropped, because we only want to consider panics from stable code

`Context`

Defining question What is the executable for the library?

Set time Always at build-time

Expected values

Value	Running as...
`status-desktop`	Library embedded into status-desktop
`status-mobile`	Library embedded into status-mobile
`status-backend`	Part of `cmd/status-backend` Can be other `cmd/*` as well.
`matterbridge`	Part of Status/Discord bridge app
`status-go-tests`	Inside status-go tests

Environment variables

To cover these requirements, I added these environment variables:

Environment variable	Provide time	Description
`SENTRY_DSN`	- At build time with direct call to `sentry.Init` - At runtime with `InitializeApplication` endpoint	Sentry DSN to be used
`SENTRY_CONTEXT_NAME` `SENTRY_CONTEXT_VERSION`	Build time	Execution context of status-go
`SENTRY_PRODUCTION`	Build time	When `true` or `1`: -Defines if this is a production build -Sets environment to `production` -Has precedence over runtime `SENTRY_ENVIRONMENT`
`SENTRY_ENVIRONMENT`	Run time	Sets the environment. Has no effect when `SENTRY_PRODUCTION` is set

Client integration

Set SENTRY_CONTEXT_NAME and SENTRY_CONTEXT_VERSION at status-go build time

Provide sentryDSN to the InitializeApplication call. DSN must be kept private and will be provided by CI. Expect a STATUS_GO_SENTRY_DSN environment variable to be provided by CI.

Why can't we consume `STATUS_GO_SENTRY_DSN` directly in status-go build?

 In theory, we could. But this would require us to mix approaches of getting the env variable to the code.
 Right now we prefer `go:generate + go:embed` approach (e.g. https://github.com/status-im/status-go/pull/6014), but we can't do it in this case, because we must not write the DSN to a local file, which would be a bit vulnerable. And I don't want to go back to `-ldflags="-X github.com/status-im/status-go/internal/sentry.sentryDSN=$(STATUS_GO_SENTRY_DSN:v%=%)` approach.

Implementation details

We recover from panics in here: fcedb01316/common/utils.go (L102) fcedb01316/mobile/callog/status_request_log.go (L69) fcedb01316/cmd/status-backend/main.go (L40) This covers all goroutines, because we have a linter to check that all goroutines have defer common.LogOnPanic.
Sentry is currently initialized in 2 places:
- InitializeApplication - covers desktop/mobile clients fcedb01316/mobile/status.go (L105-L108)
- in status-backend - covers functional tests: fcedb01316/cmd/status-backend/main.go (L36-L39)