logos-blockchain-testing/book/src/logging-observability.md

364 lines
12 KiB
Markdown
Raw Normal View History

# Logging & Observability
Comprehensive guide to log collection, metrics, and debugging across all runners.
## Node Logging vs Framework Logging
**Critical distinction:** Node logs and framework logs use different configuration mechanisms.
| Component | Controlled By | Purpose |
|-----------|--------------|---------|
| **Framework binaries** (`cargo run -p runner-examples --bin local_runner`) | `RUST_LOG` | Runner orchestration, deployment logs |
2026-01-26 16:36:51 +01:00
| **Node processes** (nodes spawned by runner) | `LOGOS_BLOCKCHAIN_LOG_LEVEL`, `LOGOS_BLOCKCHAIN_LOG_FILTER` (+ `LOGOS_BLOCKCHAIN_LOG_DIR` on host runner) | Consensus, mempool, network logs |
2026-01-26 16:36:51 +01:00
**Common mistake:** Setting `RUST_LOG=debug` only increases verbosity of the runner binary itself. Node logs remain at their default level unless you also set `LOGOS_BLOCKCHAIN_LOG_LEVEL=debug`.
**Example:**
```bash
# This only makes the RUNNER verbose, not the nodes:
RUST_LOG=debug cargo run -p runner-examples --bin local_runner
# This makes the NODES verbose:
2026-01-26 16:36:51 +01:00
LOGOS_BLOCKCHAIN_LOG_LEVEL=debug cargo run -p runner-examples --bin local_runner
# Both verbose (typically not needed):
2026-01-26 16:36:51 +01:00
RUST_LOG=debug LOGOS_BLOCKCHAIN_LOG_LEVEL=debug cargo run -p runner-examples --bin local_runner
```
## Logging Environment Variables
See [Environment Variables Reference](environment-variables.md) for complete details. Quick summary:
| Variable | Default | Effect |
|----------|---------|--------|
2026-01-26 16:36:51 +01:00
| `LOGOS_BLOCKCHAIN_LOG_DIR` | None (console only) | Host runner: directory for per-node log files. Compose/k8s: use `cfgsync.yaml` |
| `LOGOS_BLOCKCHAIN_LOG_LEVEL` | `info` | Global log level: `error`, `warn`, `info`, `debug`, `trace` |
| `LOGOS_BLOCKCHAIN_LOG_FILTER` | None | Fine-grained target filtering (e.g., `cryptarchia=trace`) |
| `LOGOS_BLOCKCHAIN_TESTS_TRACING` | false | Enable debug tracing preset |
| `LOGOS_BLOCKCHAIN_OTLP_ENDPOINT` | None | OTLP trace endpoint (optional) |
| `LOGOS_BLOCKCHAIN_OTLP_METRICS_ENDPOINT` | None | OTLP metrics endpoint (optional) |
**Example:** Full debug logging to files:
```bash
2026-01-26 16:36:51 +01:00
LOGOS_BLOCKCHAIN_TESTS_TRACING=true \
LOGOS_BLOCKCHAIN_LOG_DIR=/tmp/test-logs \
LOGOS_BLOCKCHAIN_LOG_LEVEL=debug \
LOGOS_BLOCKCHAIN_LOG_FILTER="cryptarchia=trace,chain_service=info,chain_network=info" \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin local_runner
```
## Per-Node Log Files
2026-01-26 16:36:51 +01:00
When `LOGOS_BLOCKCHAIN_LOG_DIR` is set, each node writes logs to separate files:
**File naming pattern:**
2026-01-26 16:36:51 +01:00
- **Validators**: Prefix `logos-blockchain-node-0`, `logos-blockchain-node-1`, etc. (may include timestamp suffix)
**Example filenames:**
2026-01-26 16:36:51 +01:00
- `logos-blockchain-node-0.2024-12-18T14-30-00.log`
- `logos-blockchain-node-1.2024-12-18T14-30-00.log`
2026-01-26 16:36:51 +01:00
**Local runner note:** The local runner uses per-run temporary directories under the current working directory and removes them after the run unless `LOGOS_BLOCKCHAIN_TESTS_KEEP_LOGS=1`. Use `LOGOS_BLOCKCHAIN_LOG_DIR=/path/to/logs` to write per-node log files to a stable location.
## Filter Target Names
2026-01-26 16:36:51 +01:00
Common target prefixes for `LOGOS_BLOCKCHAIN_LOG_FILTER`:
| Target Prefix | Subsystem |
|---------------|-----------|
| `cryptarchia` | Consensus (Cryptarchia) |
| `nomos_blend` | Mix network/privacy layer |
| `chain_service` | Chain service (node APIs/state) |
| `chain_network` | P2P networking |
| `chain_leader` | Leader election |
**Example filter:**
```bash
2026-01-26 16:36:51 +01:00
LOGOS_BLOCKCHAIN_LOG_FILTER="cryptarchia=trace,chain_service=info,chain_network=info"
```
---
## Accessing Logs by Runner
### Local Runner (Host Processes)
**Default (temporary directories, auto-cleanup):**
```bash
POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin local_runner
# Logs written to temporary directories in working directory
# Automatically cleaned up after test completes
```
**Persistent file output:**
```bash
2026-01-26 16:36:51 +01:00
LOGOS_BLOCKCHAIN_LOG_DIR=/tmp/local-logs \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin local_runner
# After test completes:
ls /tmp/local-logs/
2026-01-26 16:36:51 +01:00
# Files with prefix: logos-blockchain-node-0*, logos-blockchain-node-1*
# May include timestamps in filename
```
2026-01-26 16:36:51 +01:00
**Tip:** Use `LOGOS_BLOCKCHAIN_LOG_DIR` for persistent per-node log files, and `LOGOS_BLOCKCHAIN_TESTS_KEEP_LOGS=1` if you want to keep the per-run temporary directories (configs/state) for post-mortem inspection.
### Compose Runner (Docker Containers)
**Via Docker logs (default, recommended):**
```bash
# List containers (note the UUID prefix in names)
docker ps --filter "name=nomos-compose-"
# Stream logs from specific container
docker logs -f <container-id-or-name>
# Or use name pattern matching:
2026-01-26 16:36:51 +01:00
docker logs -f $(docker ps --filter "name=nomos-compose-.*-node-0" -q | head -1)
# Show last 100 lines
docker logs --tail 100 <container-id>
```
**Via file collection (advanced):**
To write per-node log files inside containers, set `tracing_settings.logger: !File` in `testing-framework/assets/stack/cfgsync.yaml` (and ensure the directory is writable). To access them, you must either:
1. **Copy files out after the run:**
```bash
# Ensure cfgsync.yaml is configured to log to /logs
2026-01-26 16:36:51 +01:00
LOGOS_BLOCKCHAIN_TESTNET_IMAGE=logos-blockchain-testing:local \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin compose_runner
# After test, copy files from containers:
docker ps --filter "name=nomos-compose-"
docker cp <container-id>:/logs/node* /tmp/
```
2. **Mount a host volume** (requires modifying compose template):
```yaml
volumes:
- /tmp/host-logs:/logs # Add to docker-compose.yml.tera
```
**Recommendation:** Use `docker logs` by default. File collection inside containers is complex and rarely needed.
**Keep containers for debugging:**
```bash
COMPOSE_RUNNER_PRESERVE=1 \
2026-01-26 16:36:51 +01:00
LOGOS_BLOCKCHAIN_TESTNET_IMAGE=logos-blockchain-testing:local \
cargo run -p runner-examples --bin compose_runner
# Containers remain running after test—inspect with docker logs or docker exec
```
**Compose debugging variables:**
- `COMPOSE_RUNNER_HOST=127.0.0.1` — host used for readiness probes
- `COMPOSE_RUNNER_HOST_GATEWAY=host.docker.internal:host-gateway` — controls `extra_hosts` entry (set to `disable` to omit)
- `TESTNET_RUNNER_PRESERVE=1` — alias for `COMPOSE_RUNNER_PRESERVE=1`
- `COMPOSE_RUNNER_HTTP_TIMEOUT_SECS=<secs>` — override HTTP readiness timeout
2026-01-26 16:36:51 +01:00
**Note:** Container names follow pattern `nomos-compose-{uuid}-node-{index}-1` where `{uuid}` changes per run.
### K8s Runner (Kubernetes Pods)
**Via kubectl logs (use label selectors):**
```bash
# List pods
kubectl get pods
# Stream logs using label selectors (recommended)
# Helm chart labels:
2026-01-26 16:36:51 +01:00
# - nomos/logical-role=node
# - nomos/node-index
kubectl logs -l nomos/logical-role=node -f
# Stream logs from specific pod
2026-01-26 16:36:51 +01:00
kubectl logs -f logos-blockchain-node-0
# Previous logs from crashed pods
2026-01-26 16:36:51 +01:00
kubectl logs --previous -l nomos/logical-role=node
```
**Download logs for offline analysis:**
```bash
# Using label selectors
2026-01-26 16:36:51 +01:00
kubectl logs -l nomos/logical-role=node --tail=1000 > all-nodes.log
# Specific pods
2026-01-26 16:36:51 +01:00
kubectl logs logos-blockchain-node-0 > node-0.log
```
**K8s debugging variables:**
- `K8S_RUNNER_DEBUG=1` — logs Helm stdout/stderr for install commands
- `K8S_RUNNER_PRESERVE=1` — keep namespace/release after run
- `K8S_RUNNER_NODE_HOST=<ip|hostname>` — override NodePort host resolution
- `K8S_RUNNER_NAMESPACE=<name>` / `K8S_RUNNER_RELEASE=<name>` — pin namespace/release (useful for debugging)
**Specify namespace (if not using default):**
```bash
2026-01-26 16:36:51 +01:00
kubectl logs -n my-namespace -l nomos/logical-role=node -f
```
**Note:** K8s runner is optimized for local clusters (Docker Desktop K8s, minikube, kind). Remote clusters require additional setup.
---
## OTLP and Telemetry
**OTLP exporters are optional.** If you see errors about unreachable OTLP endpoints, it's safe to ignore them unless you're actively collecting traces/metrics.
**To enable OTLP:**
```bash
2026-01-26 16:36:51 +01:00
LOGOS_BLOCKCHAIN_OTLP_ENDPOINT=http://localhost:4317 \
LOGOS_BLOCKCHAIN_OTLP_METRICS_ENDPOINT=http://localhost:4318 \
cargo run -p runner-examples --bin local_runner
```
**To silence OTLP errors:** Simply leave these variables unset (the default).
---
## Observability: Prometheus and Node APIs
Runners expose metrics and node HTTP endpoints for expectation code and debugging.
### Prometheus-Compatible Metrics Querying (Optional)
- Runners do **not** provision Prometheus automatically
- For a ready-to-run stack, use `scripts/setup/setup-observability.sh`:
- Compose: `scripts/setup/setup-observability.sh compose up` then `scripts/setup/setup-observability.sh compose env`
- K8s: `scripts/setup/setup-observability.sh k8s install` then `scripts/setup/setup-observability.sh k8s env`
2026-01-26 16:36:51 +01:00
- Provide `LOGOS_BLOCKCHAIN_METRICS_QUERY_URL` (PromQL base URL) to enable `ctx.telemetry()` queries
- Access from expectations when configured: `ctx.telemetry().prometheus().map(|p| p.base_url())`
**Example:**
```bash
# Start observability stack (Compose)
scripts/setup/setup-observability.sh compose up
# Get environment variables
eval $(scripts/setup/setup-observability.sh compose env)
# Run scenario with metrics
POL_PROOF_DEV_MODE=true \
2026-01-26 16:36:51 +01:00
scripts/run/run-examples.sh -t 60 -n 3 compose
```
### Grafana (Optional)
- Runners do **not** provision Grafana automatically (but `scripts/setup/setup-observability.sh` can)
2026-01-26 16:36:51 +01:00
- If you set `LOGOS_BLOCKCHAIN_GRAFANA_URL`, the deployer prints it in `TESTNET_ENDPOINTS`
- Dashboards live in `testing-framework/assets/stack/monitoring/grafana/dashboards/` (the bundled stack auto-provisions them)
**Example:**
```bash
# Bring up the bundled Prometheus+Grafana stack (optional)
scripts/setup/setup-observability.sh compose up
eval $(scripts/setup/setup-observability.sh compose env)
2026-01-26 16:36:51 +01:00
export LOGOS_BLOCKCHAIN_GRAFANA_URL=http://localhost:3000
POL_PROOF_DEV_MODE=true scripts/run/run-examples.sh -t 60 -n 3 compose
```
**Default bundled Grafana login:** `admin` / `admin` (see `scripts/observability/compose/docker-compose.yml`).
### Node APIs
2026-01-26 16:36:51 +01:00
- Access from expectations: `ctx.node_clients().node_clients().get(0)`
- Endpoints: consensus info, network info, etc.
- See `testing-framework/core/src/nodes/api_client.rs` for available methods
**Example usage in expectations:**
```rust,ignore
use testing_framework_core::scenario::{DynError, RunContext};
async fn evaluate(ctx: &RunContext) -> Result<(), DynError> {
2026-01-26 16:36:51 +01:00
let client = &ctx.node_clients().node_clients()[0];
let info = client.consensus_info().await?;
2026-01-26 16:36:51 +01:00
tracing::info!(height = info.height, "consensus info from node 0");
Ok(())
}
```
---
## Observability Flow
```mermaid
flowchart TD
Expose[Runner exposes endpoints/ports] --> Collect[Runtime collects block/health signals]
Collect --> Consume[Expectations consume signals<br/>decide pass/fail]
Consume --> Inspect[Operators inspect logs/metrics<br/>when failures arise]
```
---
## Quick Reference
### Debug Logging (Host)
```bash
2026-01-26 16:36:51 +01:00
LOGOS_BLOCKCHAIN_LOG_DIR=/tmp/logs \
LOGOS_BLOCKCHAIN_LOG_LEVEL=debug \
LOGOS_BLOCKCHAIN_LOG_FILTER="cryptarchia=trace" \
POL_PROOF_DEV_MODE=true \
2026-01-26 16:36:51 +01:00
scripts/run/run-examples.sh -t 60 -n 3 host
```
### Compose with Observability
```bash
# Start observability stack
scripts/setup/setup-observability.sh compose up
eval $(scripts/setup/setup-observability.sh compose env)
# Run with metrics
POL_PROOF_DEV_MODE=true \
2026-01-26 16:36:51 +01:00
scripts/run/run-examples.sh -t 60 -n 3 compose
# Access Grafana at http://localhost:3000
```
### K8s with Debug
```bash
K8S_RUNNER_NAMESPACE=nomos-debug \
K8S_RUNNER_DEBUG=1 \
K8S_RUNNER_PRESERVE=1 \
POL_PROOF_DEV_MODE=true \
2026-01-26 16:36:51 +01:00
scripts/run/run-examples.sh -t 60 -n 3 k8s
# Inspect logs
2026-01-26 16:36:51 +01:00
kubectl logs -n nomos-debug -l nomos/logical-role=node
```
---
## See Also
- [Environment Variables](environment-variables.md) — Complete variable reference
- [Troubleshooting](troubleshooting.md) — Log-related debugging (see "Where to Find Logs")
- [Running Examples](running-examples.md) — Runner-specific logging details
- [Prerequisites & Setup](prerequisites.md) — Setup before running