526 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Operations
Operational readiness focuses on prerequisites, environment fit, and clear
signals:
- **Prerequisites**:
- **`versions.env` file** at repository root (required by helper scripts; defines VERSION, NOMOS_NODE_REV, NOMOS_BUNDLE_VERSION)
- Keep a sibling `nomos-node` checkout available, or use `scripts/run-examples.sh` which clones/builds on demand
- Ensure the chosen runner's platform needs are met (Docker for compose, cluster access for k8s)
- CI uses prebuilt binary artifacts from the `build-binaries` workflow
- **Artifacts**: DA scenarios require KZG parameters (circuit assets) located at
`testing-framework/assets/stack/kzgrs_test_params`. Fetch them via
`scripts/setup-nomos-circuits.sh` or override the path with `NOMOS_KZGRS_PARAMS_PATH`.
- **Environment flags**: `POL_PROOF_DEV_MODE=true` is **required for all runners**
(local, compose, k8s) unless you want expensive Groth16 proof generation that
will cause tests to timeout. Configure logging via `NOMOS_LOG_DIR`, `NOMOS_LOG_LEVEL`,
and `NOMOS_LOG_FILTER` (see [Logging and Observability](#logging-and-observability)
for details). Note that nodes ignore `RUST_LOG` and only respond to `NOMOS_*` variables.
- **Readiness checks**: verify runners report node readiness before starting
workloads; this avoids false negatives from starting too early.
- **Failure triage**: map failures to missing prerequisites (wallet seeding,
node control availability), runner platform issues, or unmet expectations.
Start with liveness signals, then dive into workload-specific assertions.
Treat operational hygiene—assets present, prerequisites satisfied, observability
reachable—as the first step to reliable scenario outcomes.
## CI Usage
Both **LocalDeployer** and **ComposeDeployer** work in CI environments:
**LocalDeployer in CI:**
- Faster (no Docker overhead)
- Good for quick smoke tests
- **Trade-off:** Less isolation (processes share host)
**ComposeDeployer in CI (recommended):**
- Better isolation (containerized)
- Reproducible environment
- Includes Prometheus/observability
- **Trade-off:** Slower startup (Docker image build)
- **Trade-off:** Requires Docker daemon
See `.github/workflows/lint.yml` (jobs: `host_smoke`, `compose_smoke`) for CI examples running the demo scenarios.
## Running Examples
The framework provides three runner modes: **host** (local processes), **compose** (Docker Compose), and **k8s** (Kubernetes).
**Recommended:** Use `scripts/run-examples.sh` for all modes:
```bash
# Host mode (local processes)
scripts/run-examples.sh -t 60 -v 1 -e 1 host
# Compose mode (Docker Compose)
scripts/run-examples.sh -t 60 -v 1 -e 1 compose
# K8s mode (Kubernetes)
scripts/run-examples.sh -t 60 -v 1 -e 1 k8s
```
This script handles circuit setup, binary building/bundling, image building, and execution.
**Environment overrides:**
- `VERSION=v0.3.1` — Circuit version
- `NOMOS_NODE_REV=<commit>` — nomos-node git revision
- `NOMOS_BINARIES_TAR=path/to/bundle.tar.gz` — Use prebuilt bundle
- `NOMOS_SKIP_IMAGE_BUILD=1` — Skip image rebuild (compose/k8s)
- `NOMOS_BUNDLE_DOCKER_PLATFORM=linux/arm64|linux/amd64` — Docker platform used when building a Linux bundle on non-Linux hosts (macOS/Windows)
- `COMPOSE_CIRCUITS_PLATFORM=linux-aarch64|linux-x86_64` — Circuits platform used when building the compose/k8s image (defaults based on host arch)
- `SLOW_TEST_ENV=true` — Doubles built-in readiness timeouts (useful in slower CI / constrained laptops)
- `TESTNET_PRINT_ENDPOINTS=1` — Print `TESTNET_ENDPOINTS` / `TESTNET_PPROF` lines during deploy (set automatically by `scripts/run-examples.sh`)
- `COMPOSE_RUNNER_HTTP_TIMEOUT_SECS=<secs>` — Override compose node HTTP readiness timeout
- `K8S_RUNNER_DEPLOYMENT_TIMEOUT_SECS=<secs>` — Override k8s deployment readiness timeout
- `K8S_RUNNER_HTTP_TIMEOUT_SECS=<secs>` — Override k8s HTTP readiness timeout for port-forwards
- `K8S_RUNNER_HTTP_PROBE_TIMEOUT_SECS=<secs>` — Override k8s HTTP readiness timeout for NodePort probes
- `K8S_RUNNER_PROMETHEUS_HTTP_TIMEOUT_SECS=<secs>` — Override k8s Prometheus readiness timeout
- `K8S_RUNNER_PROMETHEUS_HTTP_PROBE_TIMEOUT_SECS=<secs>` — Override k8s Prometheus NodePort probe timeout
### Cleanup Helper
If you hit Docker build failures, mysterious I/O errors, or are running out of disk space:
```bash
scripts/clean
```
For extra Docker cache cleanup:
```bash
scripts/clean --docker
```
### Host Runner (Direct Cargo Run)
For manual control, you can run the `local_runner` binary directly:
```bash
POL_PROOF_DEV_MODE=true \
NOMOS_NODE_BIN=/path/to/nomos-node \
NOMOS_EXECUTOR_BIN=/path/to/nomos-executor \
cargo run -p runner-examples --bin local_runner
```
**Environment variables:**
- `NOMOS_DEMO_VALIDATORS=3` — Number of validators (default: 1, or use legacy `LOCAL_DEMO_VALIDATORS`)
- `NOMOS_DEMO_EXECUTORS=2` — Number of executors (default: 1, or use legacy `LOCAL_DEMO_EXECUTORS`)
- `NOMOS_DEMO_RUN_SECS=120` — Run duration in seconds (default: 60, or use legacy `LOCAL_DEMO_RUN_SECS`)
- `NOMOS_NODE_BIN` / `NOMOS_EXECUTOR_BIN` — Paths to binaries (required for direct run)
- `NOMOS_LOG_DIR=/tmp/logs` — Directory for per-node log files (works across runners)
- `NOMOS_TESTS_KEEP_LOGS=1` — Keep per-run temporary directories (useful for debugging/CI artifacts)
- `NOMOS_TESTS_TRACING=true` — Enable the debug tracing preset (optional; combine with `NOMOS_LOG_DIR` unless you have external tracing backends configured)
- `NOMOS_LOG_LEVEL=debug` — Set log level (default: info)
- `NOMOS_LOG_FILTER=consensus=trace,da=debug` — Fine-grained module filtering
**Note:** Requires circuit assets and host binaries. Use `scripts/run-examples.sh host` to handle setup automatically.
### Compose Runner (Direct Cargo Run)
For manual control, you can run the `compose_runner` binary directly. Compose requires a Docker image with embedded assets.
**Recommended setup:** Use a prebuilt bundle:
```bash
# Build a Linux bundle (includes binaries + circuits)
scripts/build-bundle.sh --platform linux
# Creates .tmp/nomos-binaries-linux-v0.3.1.tar.gz
# Build image (embeds bundle assets)
export NOMOS_BINARIES_TAR=.tmp/nomos-binaries-linux-v0.3.1.tar.gz
testing-framework/assets/stack/scripts/build_test_image.sh
# Run
NOMOS_TESTNET_IMAGE=logos-blockchain-testing:local \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin compose_runner
```
**Platform note (macOS / Apple silicon):**
- Docker Desktop runs a `linux/arm64` engine. If Linux bundle builds are slow/unstable when producing `.tmp/nomos-binaries-linux-*.tar.gz`, prefer `NOMOS_BUNDLE_DOCKER_PLATFORM=linux/arm64` for local compose/k8s runs.
- If you need amd64 images/binaries specifically (e.g., deploying to amd64-only environments), set `NOMOS_BUNDLE_DOCKER_PLATFORM=linux/amd64` and expect slower builds via emulation.
**Alternative:** Manual circuit/image setup (rebuilds during image build):
```bash
# Fetch and copy circuits
scripts/setup-nomos-circuits.sh v0.3.1 /tmp/nomos-circuits
cp -r /tmp/nomos-circuits/* testing-framework/assets/stack/kzgrs_test_params/
# Build image
testing-framework/assets/stack/scripts/build_test_image.sh
# Run
NOMOS_TESTNET_IMAGE=logos-blockchain-testing:local \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin compose_runner
```
**Environment variables:**
- `NOMOS_TESTNET_IMAGE=logos-blockchain-testing:local` — Image tag (required, must match built image)
- `POL_PROOF_DEV_MODE=true`**Required** for all runners
- `NOMOS_DEMO_VALIDATORS=3` / `NOMOS_DEMO_EXECUTORS=2` / `NOMOS_DEMO_RUN_SECS=120` — Topology overrides
- `COMPOSE_NODE_PAIRS=1x1` — Alternative topology format: "validators×executors"
- `TEST_FRAMEWORK_PROMETHEUS_PORT=9091` — Override Prometheus port (default: 9090)
- `COMPOSE_RUNNER_HOST=127.0.0.1` — Host address for port mappings
- `COMPOSE_RUNNER_PRESERVE=1` — Keep containers running after test
- `NOMOS_LOG_DIR=/tmp/compose-logs` — Write logs to files inside containers
**Compose-specific features:**
- **Node control support**: Only runner that supports chaos testing (`.enable_node_control()` + chaos workloads)
- **Prometheus observability**: Metrics at `http://localhost:9090`
**Important:**
- Containers expect KZG parameters at `/kzgrs_test_params/kzgrs_test_params` (note the repeated filename)
- Use `scripts/run-examples.sh compose` to handle all setup automatically
### K8s Runner (Direct Cargo Run)
For manual control, you can run the `k8s_runner` binary directly. K8s requires the same image setup as Compose.
**Prerequisites:**
1. **Kubernetes cluster** with `kubectl` configured
2. **Test image built** (same as Compose, preferably with prebuilt bundle)
3. **Image available in cluster** (loaded or pushed to registry)
**Build and load image:**
```bash
# Build image with bundle (recommended)
scripts/build-bundle.sh --platform linux
export NOMOS_BINARIES_TAR=.tmp/nomos-binaries-linux-v0.3.1.tar.gz
testing-framework/assets/stack/scripts/build_test_image.sh
# Load into cluster
export NOMOS_TESTNET_IMAGE=logos-blockchain-testing:local
kind load docker-image logos-blockchain-testing:local # For kind
# OR: minikube image load logos-blockchain-testing:local # For minikube
# OR: docker push your-registry/logos-blockchain-testing:local # For remote
```
**Run the example:**
```bash
export NOMOS_TESTNET_IMAGE=logos-blockchain-testing:local
export POL_PROOF_DEV_MODE=true
cargo run -p runner-examples --bin k8s_runner
```
**Environment variables:**
- `NOMOS_TESTNET_IMAGE` — Image tag (required)
- `POL_PROOF_DEV_MODE=true`**Required** for all runners
- `NOMOS_DEMO_VALIDATORS` / `NOMOS_DEMO_EXECUTORS` / `NOMOS_DEMO_RUN_SECS` — Topology overrides
**Important:**
- K8s runner mounts `testing-framework/assets/stack/kzgrs_test_params` as a hostPath volume with file `/kzgrs_test_params/kzgrs_test_params` inside pods
- **No node control support yet**: Chaos workloads (`.enable_node_control()`) will fail
- Use `scripts/run-examples.sh k8s` to handle all setup automatically
## Circuit Assets (KZG Parameters)
DA workloads require KZG cryptographic parameters for polynomial commitment schemes.
### Asset Location
**Default path:** `testing-framework/assets/stack/kzgrs_test_params/kzgrs_test_params`
Note the repeated filename: the directory `kzgrs_test_params/` contains a file named `kzgrs_test_params`. This is the actual proving key file.
**Container path** (compose/k8s): `/kzgrs_test_params/kzgrs_test_params`
**Override:** Set `NOMOS_KZGRS_PARAMS_PATH` to use a custom location (must point to the file):
```bash
NOMOS_KZGRS_PARAMS_PATH=/path/to/custom/params cargo run -p runner-examples --bin local_runner
```
### Directory vs File (KZG)
The system uses KZG assets in two distinct ways:
| Concept | Used by | Meaning |
|--------|---------|---------|
| **KZG directory** | deployers/scripts | A directory that contains the KZG file (and related artifacts). Defaults to `testing-framework/assets/stack/kzgrs_test_params` and is controlled by `NOMOS_KZG_DIR_REL` (relative to the workspace root). |
| **KZG file path** | node processes | A single file path passed to nodes via `NOMOS_KZGRS_PARAMS_PATH` (inside containers/pods this is typically `/kzgrs_test_params/kzgrs_test_params`). |
### Getting Circuit Assets
**Option 1: Use helper script** (recommended):
```bash
# From the repository root
chmod +x scripts/setup-nomos-circuits.sh
scripts/setup-nomos-circuits.sh v0.3.1 /tmp/nomos-circuits
# Copy to default location
cp -r /tmp/nomos-circuits/* testing-framework/assets/stack/kzgrs_test_params/
```
**Option 2: Build locally** (advanced):
```bash
# This repository does not provide a `make kzgrs_test_params` target.
# If you need to regenerate KZG params from source, follow upstream tooling
# instructions (unspecified here) or use the helper scripts above to fetch a
# known-good bundle.
```
### CI Workflow
The CI automatically fetches and places assets:
```yaml
- name: Install circuits for host build
run: |
scripts/setup-nomos-circuits.sh v0.3.1 "$TMPDIR/nomos-circuits"
cp -a "$TMPDIR/nomos-circuits"/. testing-framework/assets/stack/kzgrs_test_params/
```
### When Are Assets Needed?
| Runner | When Required |
|--------|---------------|
| **Local** | Always (for DA workloads) |
| **Compose** | During image build (baked into `NOMOS_TESTNET_IMAGE`) |
| **K8s** | During image build + deployed to cluster via hostPath volume |
**Error without assets:**
```
Error: missing KZG parameters at testing-framework/assets/stack/kzgrs_test_params/kzgrs_test_params
```
If you see this error, the file `kzgrs_test_params` is missing from the directory. Use `scripts/run-examples.sh` or `scripts/setup-nomos-circuits.sh` to fetch it.
## Logging and Observability
### Node Logging vs Framework Logging
**Critical distinction:** Node logs and framework logs use different configuration mechanisms.
| Component | Controlled By | Purpose |
|-----------|--------------|---------|
| **Framework binaries** (`cargo run -p runner-examples --bin local_runner`) | `RUST_LOG` | Runner orchestration, deployment logs |
| **Node processes** (validators, executors spawned by runner) | `NOMOS_LOG_LEVEL`, `NOMOS_LOG_FILTER`, `NOMOS_LOG_DIR` | Consensus, DA, mempool, network logs |
**Common mistake:** Setting `RUST_LOG=debug` only increases verbosity of the runner binary itself. Node logs remain at their default level unless you also set `NOMOS_LOG_LEVEL=debug`.
**Example:**
```bash
# This only makes the RUNNER verbose, not the nodes:
RUST_LOG=debug cargo run -p runner-examples --bin local_runner
# This makes the NODES verbose:
NOMOS_LOG_LEVEL=debug cargo run -p runner-examples --bin local_runner
# Both verbose (typically not needed):
RUST_LOG=debug NOMOS_LOG_LEVEL=debug cargo run -p runner-examples --bin local_runner
```
### Logging Environment Variables
| Variable | Default | Effect |
|----------|---------|--------|
| `NOMOS_LOG_DIR` | None (console only) | Directory for per-node log files. If unset, logs go to stdout/stderr. |
| `NOMOS_LOG_LEVEL` | `info` | Global log level: `error`, `warn`, `info`, `debug`, `trace` |
| `NOMOS_LOG_FILTER` | None | Fine-grained target filtering (e.g., `consensus=trace,da=debug`) |
| `NOMOS_TESTS_TRACING` | `false` | Enable the debug tracing preset (optional; combine with `NOMOS_LOG_DIR` unless you have external tracing backends configured) |
| `NOMOS_OTLP_ENDPOINT` | None | OTLP trace endpoint (optional, disables OTLP noise if unset) |
| `NOMOS_OTLP_METRICS_ENDPOINT` | None | OTLP metrics endpoint (optional) |
**Example:** Full debug logging to files:
```bash
NOMOS_TESTS_TRACING=true \
NOMOS_LOG_DIR=/tmp/test-logs \
NOMOS_LOG_LEVEL=debug \
NOMOS_LOG_FILTER="nomos_consensus=trace,nomos_da_sampling=debug" \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin local_runner
```
### Per-Node Log Files
When `NOMOS_LOG_DIR` is set, each node writes logs to separate files:
**File naming pattern:**
- **Validators**: Prefix `nomos-node-0`, `nomos-node-1`, etc. (may include timestamp suffix)
- **Executors**: Prefix `nomos-executor-0`, `nomos-executor-1`, etc. (may include timestamp suffix)
**Local runner note:** The local runner uses per-run temporary directories under the current working directory and removes them after the run unless `NOMOS_TESTS_KEEP_LOGS=1`. Use `NOMOS_LOG_DIR=/path/to/logs` to write per-node log files to a stable location.
### Filter Target Names
Common target prefixes for `NOMOS_LOG_FILTER`:
| Target Prefix | Subsystem |
|---------------|-----------|
| `nomos_consensus` | Consensus (Cryptarchia) |
| `nomos_da_sampling` | DA sampling service |
| `nomos_da_dispersal` | DA dispersal service |
| `nomos_da_verifier` | DA verification |
| `nomos_mempool` | Transaction mempool |
| `nomos_blend` | Mix network/privacy layer |
| `chain_network` | P2P networking |
| `chain_leader` | Leader election |
**Example filter:**
```bash
NOMOS_LOG_FILTER="nomos_consensus=trace,nomos_da_sampling=debug,chain_network=info"
```
### Accessing Logs Per Runner
#### Local Runner
**Default (temporary directories, auto-cleanup):**
```bash
POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin local_runner
# Logs written to temporary directories in working directory
# Automatically cleaned up after test completes
```
**Persistent file output:**
```bash
NOMOS_LOG_DIR=/tmp/local-logs \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin local_runner
# After test completes:
ls /tmp/local-logs/
# Files with prefix: nomos-node-0*, nomos-node-1*, nomos-executor-0*
# May include timestamps in filename
```
**Tip:** Use `NOMOS_LOG_DIR` for persistent per-node log files, and `NOMOS_TESTS_KEEP_LOGS=1` if you want to keep the per-run temporary directories (configs/state) for post-mortem inspection.
#### Compose Runner
**Via Docker logs (default, recommended):**
```bash
# List containers (note the UUID prefix in names)
docker ps --filter "name=nomos-compose-"
# Stream logs from specific container
docker logs -f <container-id-or-name>
# Or use name pattern matching:
docker logs -f $(docker ps --filter "name=nomos-compose-.*-validator-0" -q | head -1)
```
**Via file collection (advanced):**
Setting `NOMOS_LOG_DIR` writes files **inside the container**. To access them, you must either:
1. **Copy files out after the run:**
```bash
NOMOS_LOG_DIR=/logs \
NOMOS_TESTNET_IMAGE=logos-blockchain-testing:local \
POL_PROOF_DEV_MODE=true \
cargo run -p runner-examples --bin compose_runner
# After test, copy files from containers:
docker ps --filter "name=nomos-compose-"
docker cp <container-id>:/logs/nomos-node-0* /tmp/
```
2. **Mount a host volume** (requires modifying compose template):
```yaml
volumes:
- /tmp/host-logs:/logs # Add to docker-compose.yml.tera
```
**Recommendation:** Use `docker logs` by default. File collection inside containers is complex and rarely needed.
**Keep containers for debugging:**
```bash
COMPOSE_RUNNER_PRESERVE=1 \
NOMOS_TESTNET_IMAGE=logos-blockchain-testing:local \
cargo run -p runner-examples --bin compose_runner
# Containers remain running after test—inspect with docker logs or docker exec
```
**Compose networking/debug knobs:**
- `COMPOSE_RUNNER_HOST=127.0.0.1` — host used for readiness probes (override for remote Docker daemons / VM networking)
- `COMPOSE_RUNNER_HOST_GATEWAY=host.docker.internal:host-gateway` — controls the `extra_hosts` entry injected into compose (set to `disable` to omit)
- `TESTNET_RUNNER_PRESERVE=1` — alias for `COMPOSE_RUNNER_PRESERVE=1`
- `COMPOSE_GRAFANA_PORT=<port>` — pin Grafana to a fixed host port instead of ephemeral assignment
- `COMPOSE_RUNNER_HTTP_TIMEOUT_SECS=<secs>` — override compose node HTTP readiness timeout
**Note:** Container names follow pattern `nomos-compose-{uuid}-validator-{index}-1` where `{uuid}` changes per run.
#### K8s Runner
**Via kubectl logs (use label selectors):**
```bash
# List pods
kubectl get pods
# Stream logs using label selectors (recommended)
# Helm chart labels:
# - nomos/logical-role=validator|executor
# - nomos/validator-index / nomos/executor-index
kubectl logs -l nomos/logical-role=validator -f
kubectl logs -l nomos/logical-role=executor -f
# Stream logs from specific pod
kubectl logs -f nomos-validator-0
# Previous logs from crashed pods
kubectl logs --previous -l nomos/logical-role=validator
```
**Download logs for offline analysis:**
```bash
# Using label selectors
kubectl logs -l nomos/logical-role=validator --tail=1000 > all-validators.log
kubectl logs -l nomos/logical-role=executor --tail=1000 > all-executors.log
# Specific pods
kubectl logs nomos-validator-0 > validator-0.log
kubectl logs nomos-executor-1 > executor-1.log
```
**K8s environment notes:**
- The k8s runner is optimized for local clusters (Docker Desktop Kubernetes / minikube / kind):
- The default image `logos-blockchain-testing:local` must be available on the clusters nodes (Docker Desktop shares the local daemon; kind/minikube often requires an explicit image load step).
- The Helm chart mounts KZG params via a `hostPath` to your workspace path; this typically wont work on remote/managed clusters without replacing it with a PV/CSI volume or baking the params into an image.
- Debug helpers:
- `K8S_RUNNER_DEBUG=1` — logs Helm stdout/stderr for install commands.
- `K8S_RUNNER_PRESERVE=1` — keep the namespace/release after the run.
- `K8S_RUNNER_NODE_HOST=<ip|hostname>` — override NodePort host resolution for non-local clusters.
- `K8S_RUNNER_NAMESPACE=<name>` / `K8S_RUNNER_RELEASE=<name>` — pin namespace/release instead of random IDs (useful for debugging)
**Specify namespace (if not using default):**
```bash
kubectl logs -n my-namespace -l nomos/logical-role=validator -f
```
### OTLP and Telemetry
**OTLP exporters are optional.** If you see errors about unreachable OTLP endpoints, it's safe to ignore them unless you're actively collecting traces/metrics.
**To enable OTLP:**
```bash
NOMOS_OTLP_ENDPOINT=http://localhost:4317 \
NOMOS_OTLP_METRICS_ENDPOINT=http://localhost:4318 \
cargo run -p runner-examples --bin local_runner
```
**To silence OTLP errors:** Simply leave these variables unset (the default).
### Observability: Prometheus and Node APIs
Runners expose metrics and node HTTP endpoints for expectation code and debugging:
**Prometheus (Compose only):**
- Default: `http://localhost:9090`
- Override: `TEST_FRAMEWORK_PROMETHEUS_PORT=9091`
- Note: the host port can vary if `9090` is unavailable; prefer the printed `TESTNET_ENDPOINTS` line as the source of truth.
- Access from expectations: `ctx.telemetry().prometheus().map(|p| p.base_url())`
**Node APIs:**
- Access from expectations: `ctx.node_clients().validator_clients().get(0)`
- Endpoints: consensus info, network info, DA membership, etc.
- See `testing-framework/core/src/nodes/api_client.rs` for available methods
```mermaid
flowchart TD
Expose[Runner exposes endpoints/ports] --> Collect[Runtime collects block/health signals]
Collect --> Consume[Expectations consume signals<br/>decide pass/fail]
Consume --> Inspect[Operators inspect logs/metrics<br/>when failures arise]
```