logos-blockchain-testing/book/src/logging-observability.md

# Logging & Observability

Comprehensive guide to log collection, metrics, and debugging across all runners.

## Node Logging vs Framework Logging

**Critical distinction:** Node logs and framework logs use different configuration mechanisms.

| Component | Controlled By | Purpose |
|-----------|--------------|---------|
| **Framework binaries** (`cargo run -p runner-examples --bin local_runner`) | `RUST_LOG` | Runner orchestration, deployment logs |
| **Node processes** (nodes spawned by runner) | `LOGOS_BLOCKCHAIN_LOG_LEVEL`, `LOGOS_BLOCKCHAIN_LOG_FILTER` (+ `LOGOS_BLOCKCHAIN_LOG_DIR` on host runner) | Consensus, mempool, network logs |

**Common mistake:** Setting `RUST_LOG=debug` only increases verbosity of the runner binary itself. Node logs remain at their default level unless you also set `LOGOS_BLOCKCHAIN_LOG_LEVEL=debug`.

**Example:**

```bash
# This only makes the RUNNER verbose, not the nodes:
RUST_LOG=debug cargo run -p runner-examples --bin local_runner

# This makes the NODES verbose:
LOGOS_BLOCKCHAIN_LOG_LEVEL=debug cargo run -p runner-examples --bin local_runner

# Both verbose (typically not needed):
RUST_LOG=debug LOGOS_BLOCKCHAIN_LOG_LEVEL=debug cargo run -p runner-examples --bin local_runner
```

## Logging Environment Variables

See [Environment Variables Reference](environment-variables.md) for complete details. Quick summary:

| Variable | Default | Effect |
|----------|---------|--------|
| `LOGOS_BLOCKCHAIN_LOG_DIR` | None (console only) | Host runner: directory for per-node log files. Compose/k8s: use `cfgsync.yaml` |
| `LOGOS_BLOCKCHAIN_LOG_LEVEL` | `info` | Global log level: `error`, `warn`, `info`, `debug`, `trace` |
| `LOGOS_BLOCKCHAIN_LOG_FILTER` | None | Fine-grained target filtering (e.g., `cryptarchia=trace`) |
| `LOGOS_BLOCKCHAIN_TESTS_TRACING` | false | Enable debug tracing preset |
| `LOGOS_BLOCKCHAIN_OTLP_ENDPOINT` | None | OTLP trace endpoint (optional) |
| `LOGOS_BLOCKCHAIN_OTLP_METRICS_ENDPOINT` | None | OTLP metrics endpoint (optional) |

**Example:** Full debug logging to files:

```bash
LOGOS_BLOCKCHAIN_TESTS_TRACING=true \
LOGOS_BLOCKCHAIN_LOG_DIR=/tmp/test-logs \
LOGOS_BLOCKCHAIN_LOG_LEVEL=debug \
LOGOS_BLOCKCHAIN_LOG_FILTER="cryptarchia=trace,chain_service=info,chain_network=info" \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin local_runner
```

## Per-Node Log Files

When `LOGOS_BLOCKCHAIN_LOG_DIR` is set, each node writes logs to separate files:

**File naming pattern:**
- **Validators**: Prefix `logos-blockchain-node-0`, `logos-blockchain-node-1`, etc. (may include timestamp suffix)

**Example filenames:**
- `logos-blockchain-node-0.2024-12-18T14-30-00.log`
- `logos-blockchain-node-1.2024-12-18T14-30-00.log`

**Local runner note:** The local runner uses per-run temporary directories under the current working directory and removes them after the run unless `LOGOS_BLOCKCHAIN_TESTS_KEEP_LOGS=1`. Use `LOGOS_BLOCKCHAIN_LOG_DIR=/path/to/logs` to write per-node log files to a stable location.

## Filter Target Names

Common target prefixes for `LOGOS_BLOCKCHAIN_LOG_FILTER`:

| Target Prefix | Subsystem |
|---------------|-----------|
| `cryptarchia` | Consensus (Cryptarchia) |
| `nomos_blend` | Mix network/privacy layer |
| `chain_service` | Chain service (node APIs/state) |
| `chain_network` | P2P networking |
| `chain_leader` | Leader election |

**Example filter:**

```bash
LOGOS_BLOCKCHAIN_LOG_FILTER="cryptarchia=trace,chain_service=info,chain_network=info"
```

---

## Accessing Logs by Runner

### Local Runner (Host Processes)

**Default (temporary directories, auto-cleanup):**

```bash
POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin local_runner
# Logs written to temporary directories in working directory
# Automatically cleaned up after test completes
```

**Persistent file output:**

```bash
LOGOS_BLOCKCHAIN_LOG_DIR=/tmp/local-logs \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin local_runner

# After test completes:
ls /tmp/local-logs/
# Files with prefix: logos-blockchain-node-0*, logos-blockchain-node-1*
# May include timestamps in filename
```

**Tip:** Use `LOGOS_BLOCKCHAIN_LOG_DIR` for persistent per-node log files, and `LOGOS_BLOCKCHAIN_TESTS_KEEP_LOGS=1` if you want to keep the per-run temporary directories (configs/state) for post-mortem inspection.

### Compose Runner (Docker Containers)

**Via Docker logs (default, recommended):**

```bash
# List containers (note the UUID prefix in names)
docker ps --filter "name=nomos-compose-"

# Stream logs from specific container
docker logs -f <container-id-or-name>

# Or use name pattern matching:
docker logs -f $(docker ps --filter "name=nomos-compose-.*-node-0" -q | head -1)

# Show last 100 lines
docker logs --tail 100 <container-id>
```

**Via file collection (advanced):**

To write per-node log files inside containers, set `tracing_settings.logger: !File` in `testing-framework/assets/stack/cfgsync.yaml` (and ensure the directory is writable). To access them, you must either:

1. **Copy files out after the run:**

```bash
# Ensure cfgsync.yaml is configured to log to /logs
LOGOS_BLOCKCHAIN_TESTNET_IMAGE=logos-blockchain-testing:local \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin compose_runner

# After test, copy files from containers:
docker ps --filter "name=nomos-compose-"
docker cp <container-id>:/logs/node* /tmp/
```

2. **Mount a host volume** (requires modifying compose template):

```yaml
volumes:
  - /tmp/host-logs:/logs  # Add to docker-compose.yml.tera
```

**Recommendation:** Use `docker logs` by default. File collection inside containers is complex and rarely needed.

**Keep containers for debugging:**

```bash
COMPOSE_RUNNER_PRESERVE=1 \
LOGOS_BLOCKCHAIN_TESTNET_IMAGE=logos-blockchain-testing:local \
cargo run -p runner-examples --bin compose_runner
# Containers remain running after test—inspect with docker logs or docker exec
```

**Compose debugging variables:**
- `COMPOSE_RUNNER_HOST=127.0.0.1` — host used for readiness probes
- `COMPOSE_RUNNER_HOST_GATEWAY=host.docker.internal:host-gateway` — controls `extra_hosts` entry (set to `disable` to omit)
- `TESTNET_RUNNER_PRESERVE=1` — alias for `COMPOSE_RUNNER_PRESERVE=1`
- `COMPOSE_RUNNER_HTTP_TIMEOUT_SECS=<secs>` — override HTTP readiness timeout

**Note:** Container names follow pattern `nomos-compose-{uuid}-node-{index}-1` where `{uuid}` changes per run.

### K8s Runner (Kubernetes Pods)

**Via kubectl logs (use label selectors):**

```bash
# List pods
kubectl get pods

# Stream logs using label selectors (recommended)
# Helm chart labels:
# - nomos/logical-role=node
# - nomos/node-index
kubectl logs -l nomos/logical-role=node -f

# Stream logs from specific pod
kubectl logs -f logos-blockchain-node-0

# Previous logs from crashed pods
kubectl logs --previous -l nomos/logical-role=node
```

**Download logs for offline analysis:**

```bash
# Using label selectors
kubectl logs -l nomos/logical-role=node --tail=1000 > all-nodes.log

# Specific pods
kubectl logs logos-blockchain-node-0 > node-0.log
```

**K8s debugging variables:**
- `K8S_RUNNER_DEBUG=1` — logs Helm stdout/stderr for install commands
- `K8S_RUNNER_PRESERVE=1` — keep namespace/release after run
- `K8S_RUNNER_NODE_HOST=<ip|hostname>` — override NodePort host resolution
- `K8S_RUNNER_NAMESPACE=<name>` / `K8S_RUNNER_RELEASE=<name>` — pin namespace/release (useful for debugging)

**Specify namespace (if not using default):**

```bash
kubectl logs -n my-namespace -l nomos/logical-role=node -f
```

**Note:** K8s runner is optimized for local clusters (Docker Desktop K8s, minikube, kind). Remote clusters require additional setup.

---

## OTLP and Telemetry

**OTLP exporters are optional.** If you see errors about unreachable OTLP endpoints, it's safe to ignore them unless you're actively collecting traces/metrics.

**To enable OTLP:**

```bash
LOGOS_BLOCKCHAIN_OTLP_ENDPOINT=http://localhost:4317 \
LOGOS_BLOCKCHAIN_OTLP_METRICS_ENDPOINT=http://localhost:4318 \
cargo run -p runner-examples --bin local_runner
```

**To silence OTLP errors:** Simply leave these variables unset (the default).

---

## Observability: Prometheus and Node APIs

Runners expose metrics and node HTTP endpoints for expectation code and debugging.

### Prometheus-Compatible Metrics Querying (Optional)

- Runners do **not** provision Prometheus automatically
- For a ready-to-run stack, use `scripts/setup/setup-observability.sh`:
  - Compose: `scripts/setup/setup-observability.sh compose up` then `scripts/setup/setup-observability.sh compose env`
  - K8s: `scripts/setup/setup-observability.sh k8s install` then `scripts/setup/setup-observability.sh k8s env`
- Provide `LOGOS_BLOCKCHAIN_METRICS_QUERY_URL` (PromQL base URL) to enable `ctx.telemetry()` queries
- Access from expectations when configured: `ctx.telemetry().prometheus().map(|p| p.base_url())`

**Example:**

```bash
# Start observability stack (Compose)
scripts/setup/setup-observability.sh compose up

# Get environment variables
eval $(scripts/setup/setup-observability.sh compose env)

# Run scenario with metrics
POL_PROOF_DEV_MODE=true \
scripts/run/run-examples.sh -t 60 -n 3 compose
```

### Grafana (Optional)

- Runners do **not** provision Grafana automatically (but `scripts/setup/setup-observability.sh` can)
- If you set `LOGOS_BLOCKCHAIN_GRAFANA_URL`, the deployer prints it in `TESTNET_ENDPOINTS`
- Dashboards live in `testing-framework/assets/stack/monitoring/grafana/dashboards/` (the bundled stack auto-provisions them)

**Example:**

```bash
# Bring up the bundled Prometheus+Grafana stack (optional)
scripts/setup/setup-observability.sh compose up
eval $(scripts/setup/setup-observability.sh compose env)

export LOGOS_BLOCKCHAIN_GRAFANA_URL=http://localhost:3000
POL_PROOF_DEV_MODE=true scripts/run/run-examples.sh -t 60 -n 3 compose
```

**Default bundled Grafana login:** `admin` / `admin` (see `scripts/observability/compose/docker-compose.yml`).

### Node APIs

- Access from expectations: `ctx.node_clients().node_clients().get(0)`
- Endpoints: consensus info, network info, etc.
- See `testing-framework/core/src/nodes/api_client.rs` for available methods

**Example usage in expectations:**

```rust,ignore
use testing_framework_core::scenario::{DynError, RunContext};

async fn evaluate(ctx: &RunContext) -> Result<(), DynError> {
    let client = &ctx.node_clients().node_clients()[0];

    let info = client.consensus_info().await?;
    tracing::info!(height = info.height, "consensus info from node 0");

    Ok(())
}
```

---

## Observability Flow

```mermaid
flowchart TD
    Expose[Runner exposes endpoints/ports] --> Collect[Runtime collects block/health signals]
    Collect --> Consume[Expectations consume signals<br/>decide pass/fail]
    Consume --> Inspect[Operators inspect logs/metrics<br/>when failures arise]
```

---

## Quick Reference

### Debug Logging (Host)

```bash
LOGOS_BLOCKCHAIN_LOG_DIR=/tmp/logs \
LOGOS_BLOCKCHAIN_LOG_LEVEL=debug \
LOGOS_BLOCKCHAIN_LOG_FILTER="cryptarchia=trace" \
POL_PROOF_DEV_MODE=true \
scripts/run/run-examples.sh -t 60 -n 3 host
```

### Compose with Observability

```bash
# Start observability stack
scripts/setup/setup-observability.sh compose up
eval $(scripts/setup/setup-observability.sh compose env)

# Run with metrics
POL_PROOF_DEV_MODE=true \
scripts/run/run-examples.sh -t 60 -n 3 compose

# Access Grafana at http://localhost:3000
```

### K8s with Debug

```bash
K8S_RUNNER_NAMESPACE=nomos-debug \
K8S_RUNNER_DEBUG=1 \
K8S_RUNNER_PRESERVE=1 \
POL_PROOF_DEV_MODE=true \
scripts/run/run-examples.sh -t 60 -n 3 k8s

# Inspect logs
kubectl logs -n nomos-debug -l nomos/logical-role=node
```

---

## See Also

- [Environment Variables](environment-variables.md) — Complete variable reference
- [Troubleshooting](troubleshooting.md) — Log-related debugging (see "Where to Find Logs")
- [Running Examples](running-examples.md) — Runner-specific logging details
- [Prerequisites & Setup](prerequisites.md) — Setup before running