# Nomos Testing Framework — Complete Reference > **GitBook Structure Note**: This document is organized with `` markers indicating how to split for GitBook deployment. --- # Nomos Testing Framework A purpose-built toolkit for exercising Nomos in realistic, multi-node environments. ## Quick Links - [5-Minute Quickstart](#5-minute-quickstart) — Get running immediately - [Foundations](#part-i--foundations) — Core concepts and architecture - [User Guide](#part-ii--user-guide) — Authoring and running scenarios - [Developer Reference](#part-iii--developer-reference) — Extending the framework - [Recipes](#part-v--scenario-recipes) — Copy-paste runnable examples ## Reading Guide by Role | If you are... | Start with... | Then read... | |---------------|---------------|--------------| | **Protocol/Core Engineer** | Quickstart → Testing Philosophy | Workloads & Expectations → Recipes | | **Infra/DevOps** | Quickstart → Runners | Operations → Configuration Sync → Troubleshooting | | **Test Designer** | Quickstart → Authoring Scenarios | DSL Cheat Sheet → Recipes → Extending | ## Prerequisites This book assumes: - Rust competency (async/await, traits, cargo) - Basic familiarity with Nomos architecture (validators, executors, DA) - Docker knowledge (for Compose runner) - Optional: Kubernetes access (for K8s runner) --- # 5-Minute Quickstart Get a scenario running in under 5 minutes. ## Step 1: Clone and Build ```bash # Clone the testing framework (assumes nomos-node sibling checkout) # Note: If the testing framework lives inside the main Nomos monorepo, # adjust the clone URL and paths accordingly. git clone https://github.com/logos-co/nomos-testing.git cd nomos-testing # Build the testing framework crates cargo build -p testing-framework-core -p testing-framework-workflows ``` > **Build modes**: Node binaries use `--release` for realistic performance. Framework crates use debug for faster iteration. For pure development speed, you can build everything in debug mode. ## Step 2: Run the Simplest Scenario ```bash # Run a local 2-validator smoke test cargo test --package tests-workflows --test local_runner -- local_runner_mixed_workloads --nocapture ``` ## Step 3: What Good Output Looks Like ``` running 1 test [INFO] Spawning validator 0 on port 18800 [INFO] Spawning validator 1 on port 18810 [INFO] Waiting for network readiness... [INFO] Network ready: all peers connected [INFO] Waiting for membership readiness... [INFO] Membership ready for session 0 [INFO] Starting workloads... [INFO] Transaction workload submitting at 5 tx/block [INFO] DA workload: channel inscription submitted [INFO] Block 1 observed: 3 transactions [INFO] Block 2 observed: 5 transactions ... [INFO] Workloads complete, evaluating expectations [INFO] consensus_liveness: target=8, observed heights=[12, 11] ✓ [INFO] tx_inclusion_expectation: 42/50 included (84%) ✓ test local_runner_mixed_workloads ... ok ``` ## Step 4: What Failure Looks Like ``` [ERROR] consensus_liveness violated (target=8): - validator-0 height 2 below target 8 - validator-1 height 3 below target 8 test local_runner_mixed_workloads ... FAILED ``` Common causes: run duration too short, readiness not complete, node crashed. ## Step 5: Modify a Scenario Open `tests/workflows/tests/local_runner.rs`: ```rust // Change this: const RUN_DURATION: Duration = Duration::from_secs(60); // To this for a longer run: const RUN_DURATION: Duration = Duration::from_secs(120); // Or change validator count: const VALIDATORS: usize = 3; // was 2 ``` Re-run: ```bash cargo test --package tests-workflows --test local_runner -- --nocapture ``` You're now ready to explore the framework! --- # Part I — Foundations ## Introduction The Nomos Testing Framework solves the gap between small, isolated unit tests and full-system validation by letting teams: 1. **Describe** a cluster layout (topology) 2. **Drive** meaningful traffic (workloads) 3. **Assert** outcomes (expectations) ...all in one coherent, portable plan (a `Scenario` in code terms). ### Why Multi-Node Testing? Many Nomos behaviors only emerge when multiple roles interact: ``` ┌─────────────────────────────────────────────────────────────────┐ │ BEHAVIORS REQUIRING MULTI-NODE │ ├─────────────────────────────────────────────────────────────────┤ │ • Block progression across validators │ │ • Data availability sampling and dispersal │ │ • Consensus under network partitions │ │ • Liveness recovery after node restarts │ │ • Transaction propagation and inclusion │ │ • Membership and session transitions │ └─────────────────────────────────────────────────────────────────┘ ``` Unit tests can't catch these. This framework makes multi-node checks declarative, observable, and repeatable. ### Target Audience | Role | Primary Concerns | |------|------------------| | **Protocol Engineers** | Consensus correctness, DA behavior, block progression | | **Infrastructure/DevOps** | Runners, CI integration, logs, failure triage | | **QA/Test Designers** | Scenario composition, workload tuning, coverage | --- ## Architecture Overview The framework follows a clear pipeline: ``` ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────────┐ │ TOPOLOGY │───▶│ SCENARIO │───▶│ RUNNER │───▶│ WORKLOADS│───▶│EXPECTATIONS │ │ │ │ │ │ │ │ │ │ │ │ Shape │ │ Assemble │ │ Deploy & │ │ Drive │ │ Verify │ │ cluster │ │ plan │ │ wait │ │ traffic │ │ outcomes │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └─────────────┘ ``` ### Component Responsibilities | Component | Responsibility | Key Types | |-----------|----------------|-----------| | **Topology** | Declares cluster shape: node counts, network layout, DA parameters | `TopologyConfig`, `GeneratedTopology`, `TopologyBuilder` | | **Scenario** | Assembles topology + workloads + expectations + duration | `Scenario`, `ScenarioBuilder` | | **Runner** | Deploys to environment, waits for readiness, provides `RunContext` | `Runner`, `LocalDeployer`, `ComposeRunner`, `K8sRunner` | | **Workloads** | Generate traffic/conditions during the run | `Workload` trait, `TransactionWorkload`, `DaWorkload`, `RandomRestartWorkload` | | **Expectations** | Judge success/failure after workloads complete | `Expectation` trait, `ConsensusLiveness`, `TxInclusionExpectation` | ### Type Flow Diagram ``` TopologyConfig │ │ TopologyBuilder::new() ▼ TopologyBuilder ──.build()──▶ GeneratedTopology │ │ contains ▼ GeneratedNodeConfig[] │ │ Runner spawns ▼ Topology (live nodes) │ │ provides ▼ NodeClients │ │ wrapped in ▼ RunContext ``` ``` ScenarioBuilder │ │ .with_workload() / .with_expectation() / .with_run_duration() │ │ .build() ▼ Scenario │ │ Deployer::deploy() ▼ Runner │ │ .run(&mut scenario) ▼ RunHandle (success) or ScenarioError (failure) ``` --- ## Testing Philosophy ### Core Principles 1. **Declarative over imperative** - Describe desired state, let framework orchestrate - Scenarios are data, not scripts 2. **Observable health signals** - Prefer liveness/inclusion signals over internal debug state - If users can't see it, don't assert on it 3. **Determinism first** - Fixed topologies and traffic rates by default - Variability is opt-in (chaos workloads) 4. **Protocol time, not wall time** - Reason in blocks and slots - Reduces host speed dependence 5. **Minimum run window** - Always allow enough blocks for meaningful assertions - Framework enforces minimum 2 blocks 6. **Chaos with intent** - Chaos workloads for resilience testing only - Avoid chaos in basic functional smoke tests; reserve it for dedicated resilience scenarios ### Testing Spectrum ``` ┌────────────────────────────────────────────────────────────────┐ │ WHERE THIS FRAMEWORK FITS │ ├──────────────┬────────────────────┬────────────────────────────┤ │ UNIT TESTS │ INTEGRATION │ MULTI-NODE SCENARIOS │ │ │ │ │ │ Fast │ Single process │ ◀── THIS FRAMEWORK │ │ Isolated │ Mock network │ │ │ Deterministic│ No real timing │ Real networking │ │ │ │ Protocol timing │ │ ~1000s/sec │ ~100s/sec │ ~1-10/hour │ └──────────────┴────────────────────┴────────────────────────────┘ ``` --- ## Scenario Lifecycle ### Phase Overview ``` ┌─────────┐ ┌─────────┐ ┌───────────┐ ┌─────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ │ PLAN │──▶│ DEPLOY │──▶│ READINESS │──▶│ DRIVE │──▶│ COOLDOWN │──▶│ EVALUATE │──▶│ CLEANUP │ └─────────┘ └─────────┘ └───────────┘ └─────────┘ └──────────┘ └──────────┘ └─────────┘ ``` ### Detailed Timeline ``` Time ──────────────────────────────────────────────────────────────────────▶ │ PLAN │ DEPLOY │ READY │ WORKLOADS │COOL│ EVAL │ │ │ │ │ │DOWN│ │ │ Build │ Spawn │ Network │ Traffic runs │ │Check │ │ scenario │ nodes │ DA │ Blocks produce │ 5× │ all │ │ │ (local/ │ Member │ │blk │expect│ │ │ docker/k8s) │ ship │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ t=0 t=5s t=30s t=35s t=95s t=100s t=105s │ (example │ 60s run) ▼ CLEANUP ``` ### Phase Details | Phase | What Happens | Code Entry Point | |-------|--------------|------------------| | **Plan** | Declare topology, attach workloads/expectations, set duration | `ScenarioBuilder::build()` | | **Deploy** | Runner provisions environment | `deployer.deploy(&scenario)` | | **Readiness** | Wait for network peers, DA balancer, membership | `wait_network_ready()`, `wait_membership_ready()`, `wait_da_balancer_ready()` | | **Drive** | Workloads run concurrently for configured duration | `workload.start(ctx)` inside `Runner::run_workloads()` | | **Cooldown** | Stabilization period (5× block interval, 30s min if chaos used) | Automatic in `Runner::cooldown()` | | **Evaluate** | All expectations run; failures **aggregated** (not short-circuited) | `expectation.evaluate(ctx)` | | **Cleanup** | Resources reclaimed via `CleanupGuard` | `Drop` impl on `Runner` | ### Readiness Phases (Detail) Runners perform three distinct readiness checks: ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ NETWORK │────▶│ MEMBERSHIP │────▶│ DA BALANCER │ │ │ │ │ │ │ │ libp2p peers │ │ Session 0 │ │ Dispersal peers │ │ connected │ │ assignments │ │ available │ │ │ │ propagated │ │ │ │ Timeout: 60s │ │ Timeout: 60s │ │ Timeout: 60s │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ``` --- # Part II — User Guide ## Authoring Scenarios ### The 5-Step Process ``` ┌─────────────────────────────────────────────────────────────────┐ │ SCENARIO AUTHORING FLOW │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ 1. SHAPE TOPOLOGY 2. ATTACH WORKLOADS │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Validators │ │ Transactions│ │ │ │ Executors │ │ DA blobs │ │ │ │ Network │ │ Chaos │ │ │ │ DA params │ └─────────────┘ │ │ └─────────────┘ │ │ │ │ 3. DEFINE EXPECTATIONS 4. SET DURATION │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Liveness │ │ See duration│ │ │ │ Inclusion │ │ heuristics │ │ │ │ Custom │ │ table below │ │ │ └─────────────┘ └─────────────┘ │ │ │ │ 5. CHOOSE RUNNER │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Local │ │ Compose │ │ K8s │ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ### Duration Heuristics Use protocol time (blocks), not wall time. With default 2-second slots and active slot coefficient of 0.9, expect roughly one block every ~2–3 seconds (subject to randomness). Individual topologies may override these defaults. | Scenario Type | Min Blocks | Recommended Duration | Notes | |---------------|------------|---------------------|-------| | Smoke test | 5-10 | 30-60s | Quick validation | | Tx throughput | 20-50 | 2-3 min | Capture steady state | | DA + tx combined | 30-50 | 3-5 min | Observe interaction | | Chaos/resilience | 50-100 | 5-10 min | Allow restart recovery | | Long-run stability | 100+ | 10-30 min | Trend validation | > **Note**: The framework enforces a minimum of 2 blocks. Very short durations are clamped automatically. ### Builder Pattern Overview ```rust ScenarioBuilder::with_node_counts(validators, executors) // 1. Topology sub-builder .topology() .network_star() .validators(n) .executors(n) .apply() // Returns to main builder // 2. Wallet seeding .wallets(user_count) // 3. Workload sub-builders .transactions() .rate(per_block) .users(actors) .apply() .da() .channel_rate(n) .blob_rate(n) .apply() // 4. Optional chaos (changes Caps type) .enable_node_control() .chaos_random_restart() .validators(true) .executors(true) .min_delay(Duration) .max_delay(Duration) .target_cooldown(Duration) .apply() // 5. Duration and expectations .with_run_duration(duration) .expect_consensus_liveness() // 6. Build .build() ``` --- ## Workloads Workloads generate traffic and conditions during a scenario run. ### Available Workloads | Workload | Purpose | Key Config | Bundled Expectation | |----------|---------|------------|---------------------| | **Transaction** | Submit transactions at configurable rate | `rate`, `users` | `TxInclusionExpectation` | | **DA** | Create channels, publish blobs | `channel_rate`, `blob_rate` | `DaWorkloadExpectation` | | **Chaos** | Restart nodes randomly | `min_delay`, `max_delay`, `target_cooldown` | None (use `ConsensusLiveness`) | ### Transaction Workload Submits user-level transactions at a configurable rate. ```rust .transactions() .rate(5) // 5 transactions per block opportunity .users(8) // Use 8 distinct wallet actors .apply() ``` **Requires**: Seeded wallets (`.wallets(n)`) ### DA Workload Drives data-availability paths: channel inscriptions and blob publishing. ```rust .da() .channel_rate(1) // 1 channel operation per block .blob_rate(1) // 1 blob per channel .apply() ``` **Requires**: At least one executor for blob publishing. ### Chaos Workload Triggers controlled node restarts to test resilience. ```rust .enable_node_control() // Required capability .chaos_random_restart() .validators(true) // Include validators .executors(true) // Include executors .min_delay(Duration::from_secs(45)) // Min time between restarts .max_delay(Duration::from_secs(75)) // Max time between restarts .target_cooldown(Duration::from_secs(120)) // Per-node cooldown .apply() ``` **Safety behavior**: If only one validator is configured, the chaos workload automatically skips validator restarts to avoid halting consensus. **Cooldown behavior**: After chaos workloads, the runner adds a minimum 30-second cooldown before evaluating expectations. --- ## Expectations Expectations are post-run assertions that judge success or failure. ### Available Expectations | Expectation | Asserts | Default Tolerance | |-------------|---------|-------------------| | **ConsensusLiveness** | All validators reach minimum block height | 80% of expected blocks | | **TxInclusionExpectation** | Submitted transactions appear in blocks | 50% inclusion ratio | | **DaWorkloadExpectation** | Planned channels/blobs were included | 80% inclusion ratio | | **PrometheusBlockProduction** | Prometheus metrics show block production | Exact minimum | ### ConsensusLiveness The primary health check. Polls each validator's HTTP consensus info. ```rust // With default 80% tolerance: .expect_consensus_liveness() // Or with specific minimum: .with_expectation(ConsensusLiveness::with_minimum(10)) // Or with custom tolerance: .with_expectation(ConsensusLiveness::with_tolerance(0.9)) ``` > **Note for advanced users**: There are two `ConsensusLiveness` implementations in the codebase: > - `testing_framework_workflows::ConsensusLiveness` — HTTP-based, checks heights via `consensus_info()` API. This is what `.expect_consensus_liveness()` uses. > - `testing_framework_core::scenario::expectations::ConsensusLiveness` — Also HTTP-based but with different tolerance semantics. > > There's also `PrometheusBlockProduction` in core for Prometheus-based metrics checks when telemetry is configured. ### Expectation Lifecycle ``` ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ init() │────▶│start_capture│────▶│ evaluate() │ │ │ │ () │ │ │ │ Validate │ │ Snapshot │ │ Assert │ │ prereqs │ │ baseline │ │ conditions │ │ │ │ (optional) │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ ▼ ▼ ▼ At build() Before workloads After workloads ``` ### Common Expectation Mistakes | Mistake | Why It Fails | Fix | |---------|--------------|-----| | Expecting inclusion too soon | Transactions need blocks to be included | Increase duration | | Wall-clock timing assertions | Host speed varies | Use block counts via `RunMetrics` | | Duration too short | Not enough blocks observed | Use duration heuristics table | | Skipping `start_capture()` | Baseline not established | Implement if comparing before/after | | Asserting on internal state | Framework can't observe it | Use `consensus_info()` or `BlockFeed` | --- ## BlockFeed Deep Dive The `BlockFeed` is the primary mechanism for observing block production during a run. ### What BlockFeed Provides ```rust pub struct BlockFeed { // Subscribe to receive block notifications pub fn subscribe(&self) -> broadcast::Receiver>; // Access aggregate statistics pub fn stats(&self) -> Arc; } pub struct BlockRecord { pub header: HeaderId, // Block header ID pub block: Arc>, // Full block with transactions } pub struct BlockStats { // Total transactions observed across all blocks pub fn total_transactions(&self) -> u64; } ``` ### How It Works ``` ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ BlockScanner │────▶│ BlockFeed │────▶│ Subscribers │ │ │ │ │ │ │ │ Polls validator│ │ broadcast │ │ Workloads │ │ consensus_info │ │ channel │ │ Expectations │ │ every 1 second │ │ (1024 buffer) │ │ │ │ │ │ │ │ │ │ Fetches blocks │ │ Records stats │ │ │ │ via storage_ │ │ │ │ │ │ block() │ │ │ │ │ └────────────────┘ └────────────────┘ └────────────────┘ ``` ### Using BlockFeed in Workloads ```rust async fn start(&self, ctx: &RunContext) -> Result<(), DynError> { let mut receiver = ctx.block_feed().subscribe(); loop { match receiver.recv().await { Ok(record) => { // Process block let height = record.block.header().slot().into(); let tx_count = record.block.transactions().len(); // Check for specific transactions for tx in record.block.transactions() { // ... examine transaction } } Err(broadcast::error::RecvError::Lagged(n)) => { // Fell behind, n messages skipped continue; } Err(broadcast::error::RecvError::Closed) => { return Err("block feed closed".into()); } } } } ``` ### Using BlockFeed in Expectations ```rust async fn start_capture(&mut self, ctx: &RunContext) -> Result<(), DynError> { let mut receiver = ctx.block_feed().subscribe(); let observed = Arc::new(Mutex::new(HashSet::new())); let observed_clone = Arc::clone(&observed); // Spawn background task to collect observations tokio::spawn(async move { while let Ok(record) = receiver.recv().await { // Record what we observe let mut guard = observed_clone.lock().unwrap(); for tx in record.block.transactions() { guard.insert(tx.hash()); } } }); self.observed = Some(observed); Ok(()) } async fn evaluate(&mut self, ctx: &RunContext) -> Result<(), DynError> { let observed = self.observed.as_ref().ok_or("not captured")?; let guard = observed.lock().unwrap(); // Compare observed vs expected if guard.len() < self.expected_count { return Err(format!( "insufficient inclusions: {} < {}", guard.len(), self.expected_count ).into()); } Ok(()) } ``` --- # Runner: Local Runs node binaries as local processes on the host. ## What It Does - Spawns validators/executors directly on the host with ephemeral data dirs. - Binds HTTP/libp2p ports on localhost; no containers involved. - Fastest feedback loop; best for unit-level scenarios and debugging. ## Prerequisites - Rust toolchain installed. - No ports in use on the default ranges (see runner config if you need to override). ## How to Run ```bash cargo test -p tests-workflows --test local_runner -- local_runner_mixed_workloads --nocapture ``` Adjust validator/executor counts inside the test file or via the scenario builder. ## Troubleshooting - Port already in use → change base ports in the test or stop the conflicting process. - Slow start on first run → binaries need to be built; reruns are faster. - No blocks → ensure workloads enabled and duration long enough (≥60s default). --- # Runner: Docker Compose Runs validators/executors in Docker containers using docker-compose. ## What It Does - Builds/pulls the node image, then creates a network and one container per role. - Uses Compose health checks for readiness, then runs workloads/expectations. - Cleans up containers and network unless preservation is requested. ## Prerequisites - Docker with the Compose plugin. - Built node image available locally (default `nomos-testnet:local`). - Build from repo root: `testnet/scripts/build_test_image.sh` - Optional env vars: - `NOMOS_TESTNET_IMAGE` (override tag) - `COMPOSE_NODE_PAIRS=1x1` (validators x executors) - `COMPOSE_RUNNER_PRESERVE=1` to keep the stack for inspection ## How to Run ```bash POL_PROOF_DEV_MODE=true COMPOSE_NODE_PAIRS=1x1 \ cargo test -p tests-workflows compose_runner_mixed_workloads -- --nocapture ``` ## Troubleshooting - Image not found → set `NOMOS_TESTNET_IMAGE` to a built/pulled tag. - Peers not connecting → inspect `docker compose logs` for validator/executor. - Stack left behind → `docker compose -p down` and remove the network. --- # Runner: Kubernetes Deploys validators/executors as a Helm release into the current Kubernetes context. ## What It Does - Builds/pulls the node image, packages Helm assets, installs into a unique namespace. - Waits for pod readiness and validator HTTP endpoint, then drives workloads. - Tears down the namespace unless preservation is requested. ## Prerequisites - kubectl and Helm on PATH; a running Kubernetes cluster/context (e.g., Docker Desktop, kind). - Docker buildx to build the node image for your arch. - Built image tag exported: - Build: `testnet/scripts/build_test_image.sh` (default tag `nomos-testnet:local`) - Export: `export NOMOS_TESTNET_IMAGE=nomos-testnet:local` - Optional: `K8S_RUNNER_PRESERVE=1` to keep the namespace for debugging. ## How to Run ```bash NOMOS_TESTNET_IMAGE=nomos-testnet:local \ cargo test -p tests-workflows demo_k8s_runner_tx_workload -- --nocapture ``` ## Troubleshooting - Timeout waiting for validator HTTP → check pod logs: `kubectl logs -n deploy/validator`. - No peers/tx inclusion → inspect rendered `/config.yaml` in the pod and cfgsync logs. - Cleanup stuck → `kubectl delete namespace ` from the preserved namespace name. --- ## Runners Runners deploy scenarios to different environments. ### Runner Decision Matrix | Goal | Recommended Runner | Why | |------|-------------------|-----| | Fast local iteration | `LocalDeployer` | No container overhead | | Reproducible e2e checks | `ComposeRunner` | Stable multi-node isolation | | High fidelity / CI | `K8sRunner` | Real cluster behavior | | Config validation only | Dry-run (future) | Catch errors before nodes | ### Runner Comparison | Aspect | LocalDeployer | ComposeRunner | K8sRunner | |--------|---------------|---------------|-----------| | **Speed** | ⚡ Fastest | 🔄 Medium | 🏗️ Slowest | | **Setup** | Binaries only | Docker daemon | Cluster access | | **Isolation** | Process-level | Container-level | Pod-level | | **Port discovery** | Direct | Auto via Docker | NodePort | | **Node control** | Full | Via container restart | Via pod restart | | **Observability** | Local files | Container logs | Prometheus + logs | | **CI suitability** | Dev only | Good | Best | ### LocalDeployer Spawns nodes as host processes. ```rust let deployer = LocalDeployer::default(); // Or skip membership check for faster startup: let deployer = LocalDeployer::new().with_membership_check(false); let runner = deployer.deploy(&scenario).await?; ``` ### ComposeRunner Starts nodes in Docker containers via Docker Compose. ```rust let deployer = ComposeRunner::default(); let runner = deployer.deploy(&scenario).await?; ``` **Uses Configuration Sync (cfgsync)** — see Operations section. ### K8sRunner Deploys to a Kubernetes cluster. ```rust let deployer = K8sRunner::new(); let runner = match deployer.deploy(&scenario).await { Ok(r) => r, Err(K8sRunnerError::ClientInit { source }) => { // Cluster unavailable return; } Err(e) => panic!("deployment failed: {e}"), }; ``` --- ## Operations ### Prerequisites Checklist ``` □ nomos-node checkout available (sibling directory) □ Binaries built: cargo build -p nomos-node -p nomos-executor □ Runner platform ready: □ Local: binaries in target/debug/ □ Compose: Docker daemon running □ K8s: kubectl configured, cluster accessible □ KZG prover assets fetched (for DA scenarios) □ Ports available (default ranges: 18800+, 4400 for cfgsync) ``` ### Environment Variables | Variable | Effect | Default | |----------|--------|---------| | `SLOW_TEST_ENV=true` | 2× timeout multiplier for all readiness checks | `false` | | `NOMOS_TESTS_TRACING=true` | Enable debug tracing output | `false` | | `NOMOS_TESTS_KEEP_LOGS=1` | Preserve temp directories after run | Delete | | `NOMOS_TESTNET_IMAGE` | Docker image for Compose/K8s runners | `nomos-testnet:local` | | `COMPOSE_RUNNER_PRESERVE=1` | Keep Compose resources after run | Delete | | `TEST_FRAMEWORK_PROMETHEUS_PORT` | Host port for Prometheus (Compose) | `9090` | ### Configuration Synchronization (cfgsync) When running in Docker Compose or Kubernetes, the framework uses **dynamic configuration injection** instead of static config files. ``` ┌─────────────────┐ ┌─────────────────┐ │ RUNNER HOST │ │ NODE CONTAINER │ │ │ │ │ │ ┌─────────────┐ │ HTTP :4400 │ ┌─────────────┐ │ │ │ cfgsync │◀├───────────────────┤│ cfgsync │ │ │ │ server │ │ │ │ client │ │ │ │ │ │ 1. Request config │ │ │ │ │ │ Holds │ │ 2. Receive YAML │ │ Fetches │ │ │ │ generated │ │ 3. Start node │ │ config at │ │ │ │ topology │ │ │ │ startup │ │ │ └─────────────┘ │ │ └─────────────┘ │ └─────────────────┘ └─────────────────┘ ``` **Why cfgsync?** - Handles dynamic port discovery - Injects cryptographic keys - Supports topology changes without rebuilding images **Troubleshooting cfgsync:** | Symptom | Cause | Fix | |---------|-------|-----| | Containers stuck at startup | cfgsync server unreachable | Check port 4400 is not blocked | | "connection refused" in logs | Server not started | Verify runner started cfgsync | | Config mismatch errors | Stale cfgsync template | Clean temp directories | --- # Part IV — Reference ## Troubleshooting ### Error Messages and Fixes #### Readiness Timeout ``` Error: readiness probe failed: timed out waiting for network readiness: validator#0@18800: 0 peers (expected 1) validator#1@18810: 0 peers (expected 1) ``` **Causes:** - Nodes not fully started - Network configuration mismatch - Ports blocked **Fixes:** - Set `SLOW_TEST_ENV=true` for 2× timeout - Check node logs for startup errors - Verify ports are available #### Consensus Liveness Violation ``` Error: expectations failed: consensus liveness violated (target=8): - validator-0 height 2 below target 8 - validator-1 height 3 below target 8 ``` **Causes:** - Run duration too short - Node crashed during run - Consensus stalled **Fixes:** - Increase `with_run_duration()` - Check node logs for panics - Verify network connectivity #### Transaction Inclusion Below Threshold ``` Error: tx_inclusion_expectation: observed 15 below required 25 ``` **Causes:** - Wallet not seeded - Transaction rate too high - Mempool full **Fixes:** - Add `.wallets(n)` to scenario - Reduce `.rate()` in transaction workload - Increase duration for more blocks #### Chaos Workload No Targets ``` Error: chaos restart workload has no eligible targets ``` **Causes:** - No validators or executors configured - Only one validator (skipped for safety) - Chaos disabled for both roles **Fixes:** - Add more validators (≥2) for chaos - Enable `.executors(true)` if executors present - Use different workload for single-validator tests #### BlockFeed Closed ``` Error: block feed closed while waiting for channel operations ``` **Causes:** - Source validator crashed - Network partition - Run ended prematurely **Fixes:** - Check validator logs - Increase run duration - Verify readiness completed ### Log Locations | Runner | Log Location | |--------|--------------| | Local | Temp directory (printed at startup), or set `NOMOS_TESTS_KEEP_LOGS=1` | | Compose | `docker logs ` | | K8s | `kubectl logs ` | ### Debugging Flow ``` ┌─────────────────┐ │ Scenario fails │ └────────┬────────┘ ▼ ┌────────────────────────────────────────┐ │ 1. Check error message category │ │ - Readiness? → Check startup logs │ │ - Workload? → Check workload config │ │ - Expectation? → Check assertions │ └────────┬───────────────────────────────┘ ▼ ┌────────────────────────────────────────┐ │ 2. Check node logs │ │ - Panics? → Bug in node │ │ - Connection errors? → Network │ │ - Config errors? → cfgsync issue │ └────────┬───────────────────────────────┘ ▼ ┌────────────────────────────────────────┐ │ 3. Reproduce with tracing │ │ NOMOS_TESTS_TRACING=true cargo test │ └────────┬───────────────────────────────┘ ▼ ┌────────────────────────────────────────┐ │ 4. Simplify scenario │ │ - Reduce validators │ │ - Remove workloads one by one │ │ - Increase duration │ └────────────────────────────────────────┘ ``` --- ## DSL Cheat Sheet ### Complete Builder Reference ```rust // ═══════════════════════════════════════════════════════════════ // TOPOLOGY // ═══════════════════════════════════════════════════════════════ ScenarioBuilder::with_node_counts(validators, executors) .topology() .network_star() // Star layout (hub-spoke) .validators(count) // Validator count .executors(count) // Executor count .apply() // Return to main builder // ═══════════════════════════════════════════════════════════════ // WALLET SEEDING // ═══════════════════════════════════════════════════════════════ .wallets(user_count) // Uniform: 100 funds/user .with_wallet_config(custom) // Custom WalletConfig // ═══════════════════════════════════════════════════════════════ // TRANSACTION WORKLOAD // ═══════════════════════════════════════════════════════════════ .transactions() .rate(txs_per_block) // NonZeroU64 .users(actor_count) // NonZeroUsize .apply() // ═══════════════════════════════════════════════════════════════ // DA WORKLOAD // ═══════════════════════════════════════════════════════════════ .da() .channel_rate(ops_per_block) // Channel inscriptions .blob_rate(blobs_per_chan) // Blobs per channel .apply() // ═══════════════════════════════════════════════════════════════ // CHAOS WORKLOAD (requires .enable_node_control()) // ═══════════════════════════════════════════════════════════════ .enable_node_control() // Required first! .chaos_random_restart() .validators(bool) // Restart validators? .executors(bool) // Restart executors? .min_delay(Duration) // Min between restarts .max_delay(Duration) // Max between restarts .target_cooldown(Duration) // Per-node cooldown .apply() // ═══════════════════════════════════════════════════════════════ // DURATION & EXPECTATIONS // ═══════════════════════════════════════════════════════════════ .with_run_duration(Duration) // Clamped to ≥2 blocks .expect_consensus_liveness() // Default 80% tolerance .with_expectation(custom) // Add custom Expectation .with_workload(custom) // Add custom Workload // ═══════════════════════════════════════════════════════════════ // BUILD // ═══════════════════════════════════════════════════════════════ .build() // Returns Scenario ``` ### Quick Patterns ```rust // Minimal smoke test ScenarioBuilder::with_node_counts(2, 0) .with_run_duration(Duration::from_secs(30)) .expect_consensus_liveness() .build() // Transaction throughput ScenarioBuilder::with_node_counts(2, 0) .wallets(64) .transactions().rate(10).users(8).apply() .with_run_duration(Duration::from_secs(120)) .expect_consensus_liveness() .build() // DA + transactions ScenarioBuilder::with_node_counts(1, 1) .wallets(64) .transactions().rate(5).users(4).apply() .da().channel_rate(1).blob_rate(1).apply() .with_run_duration(Duration::from_secs(180)) .expect_consensus_liveness() .build() // Chaos resilience ScenarioBuilder::with_node_counts(3, 1) .enable_node_control() .wallets(64) .transactions().rate(3).users(4).apply() .chaos_random_restart() .validators(true).executors(true) .min_delay(Duration::from_secs(45)) .max_delay(Duration::from_secs(75)) .target_cooldown(Duration::from_secs(120)) .apply() .with_run_duration(Duration::from_secs(300)) .expect_consensus_liveness() .build() ``` --- ## API Quick Reference ### RunContext ```rust impl RunContext { // ───────────────────────────────────────────────────────────── // TOPOLOGY ACCESS // ───────────────────────────────────────────────────────────── /// Static topology configuration pub fn descriptors(&self) -> &GeneratedTopology; /// Live node handles (if available) pub fn topology(&self) -> Option<&Topology>; // ───────────────────────────────────────────────────────────── // CLIENT ACCESS // ───────────────────────────────────────────────────────────── /// All node clients pub fn node_clients(&self) -> &NodeClients; /// Random node client pub fn random_node_client(&self) -> Option<&ApiClient>; /// Cluster client with retry logic pub fn cluster_client(&self) -> ClusterClient<'_>; // ───────────────────────────────────────────────────────────── // WALLET ACCESS // ───────────────────────────────────────────────────────────── /// Seeded wallet accounts pub fn wallet_accounts(&self) -> &[WalletAccount]; // ───────────────────────────────────────────────────────────── // OBSERVABILITY // ───────────────────────────────────────────────────────────── /// Block observation stream pub fn block_feed(&self) -> BlockFeed; /// Prometheus metrics (if configured) pub fn telemetry(&self) -> &Metrics; // ───────────────────────────────────────────────────────────── // TIMING // ───────────────────────────────────────────────────────────── /// Configured run duration pub fn run_duration(&self) -> Duration; /// Expected block count for this run pub fn expected_blocks(&self) -> u64; /// Full timing metrics pub fn run_metrics(&self) -> RunMetrics; // ───────────────────────────────────────────────────────────── // NODE CONTROL (CHAOS) // ───────────────────────────────────────────────────────────── /// Node control handle (if enabled) pub fn node_control(&self) -> Option>; } ``` ### NodeClients ```rust impl NodeClients { pub fn validator_clients(&self) -> &[ApiClient]; pub fn executor_clients(&self) -> &[ApiClient]; pub fn random_validator(&self) -> Option<&ApiClient>; pub fn random_executor(&self) -> Option<&ApiClient>; pub fn all_clients(&self) -> impl Iterator; pub fn any_client(&self) -> Option<&ApiClient>; pub fn cluster_client(&self) -> ClusterClient<'_>; } ``` ### ApiClient ```rust impl ApiClient { // Consensus pub async fn consensus_info(&self) -> reqwest::Result; // Network pub async fn network_info(&self) -> reqwest::Result; // Transactions pub async fn submit_transaction(&self, tx: &SignedMantleTx) -> reqwest::Result<()>; // Storage pub async fn storage_block(&self, id: &HeaderId) -> reqwest::Result>>; // DA pub async fn balancer_stats(&self) -> reqwest::Result; pub async fn monitor_stats(&self) -> reqwest::Result; pub async fn da_get_membership(&self, session: &SessionNumber) -> reqwest::Result; // URLs pub fn base_url(&self) -> &Url; } ``` ### CryptarchiaInfo ```rust pub struct CryptarchiaInfo { pub height: u64, // Current block height pub slot: Slot, // Current slot number pub tip: HeaderId, // Tip of the chain // ... additional fields } ``` ### Key Traits ```rust #[async_trait] pub trait Workload: Send + Sync { fn name(&self) -> &str; fn expectations(&self) -> Vec> { vec![] } fn init(&mut self, topology: &GeneratedTopology, metrics: &RunMetrics) -> Result<(), DynError> { Ok(()) } async fn start(&self, ctx: &RunContext) -> Result<(), DynError>; } #[async_trait] pub trait Expectation: Send + Sync { fn name(&self) -> &str; fn init(&mut self, topology: &GeneratedTopology, metrics: &RunMetrics) -> Result<(), DynError> { Ok(()) } async fn start_capture(&mut self, ctx: &RunContext) -> Result<(), DynError> { Ok(()) } async fn evaluate(&mut self, ctx: &RunContext) -> Result<(), DynError>; } #[async_trait] pub trait Deployer: Send + Sync { type Error; async fn deploy(&self, scenario: &Scenario) -> Result; } #[async_trait] pub trait NodeControlHandle: Send + Sync { async fn restart_validator(&self, index: usize) -> Result<(), DynError>; async fn restart_executor(&self, index: usize) -> Result<(), DynError>; } ``` --- ## Glossary ### Protocol Terms | Term | Definition | |------|------------| | **Slot** | Fixed time interval in the consensus protocol (default: 2 seconds) | | **Block** | Unit of consensus; contains transactions and header | | **Active Slot Coefficient** | Probability of block production per slot (default: 0.5) | | **Protocol Interval** | Expected time between blocks: `slot_duration / active_slot_coeff` | ### Framework Terms | Term | Definition | |------|------------| | **Topology** | Declarative description of cluster shape, roles, and parameters | | **GeneratedTopology** | Concrete topology with generated configs, ports, and keys | | **Scenario** | Plan combining topology + workloads + expectations + duration | | **Workload** | Traffic/behavior generator during a run | | **Expectation** | Post-run assertion judging success/failure | | **BlockFeed** | Stream of block observations for workloads/expectations | | **RunContext** | Shared context with clients, metrics, observability | | **RunMetrics** | Computed timing: expected blocks, block interval, duration | | **NodeClients** | Collection of API clients for validators and executors | | **ApiClient** | HTTP client for node consensus, network, and DA endpoints | | **cfgsync** | Dynamic configuration injection for distributed runners | ### Runner Terms | Term | Definition | |------|------------| | **Deployer** | Creates a `Runner` from a `Scenario` | | **Runner** | Manages execution: workloads, expectations, cleanup | | **RunHandle** | Returned on success; holds context and cleanup | | **CleanupGuard** | Ensures resources are reclaimed on drop | | **NodeControlHandle** | Interface for restarting nodes (chaos) | --- # Part V — Scenario Recipes Complete, copy-paste runnable scenarios. ## Recipe 1: Minimal Smoke Test **Goal**: Verify basic consensus works with minimal setup. ```rust use std::time::Duration; use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder}; use testing_framework_runner_local::LocalDeployer; #[tokio::test] async fn smoke_test_consensus() { // Minimal: 2 validators, no workloads, just check blocks produced let mut plan = ScenarioBuilder::with_node_counts(2, 0) .topology() .network_star() .validators(2) .executors(0) .apply() .with_run_duration(Duration::from_secs(30)) .expect_consensus_liveness() .build(); let deployer = LocalDeployer::default(); let runner = deployer.deploy(&plan).await.expect("deployment"); runner.run(&mut plan).await.expect("scenario passed"); } ``` **Expected output**: ``` [INFO] consensus_liveness: target=4, observed heights=[6, 5] ✓ ``` **Common failures**: - `height 0 below target`: Nodes didn't start, check binaries exist - Timeout: Increase to 60s or set `SLOW_TEST_ENV=true` --- ## Recipe 2: Transaction Throughput Baseline **Goal**: Measure transaction inclusion under load. ```rust use std::time::Duration; use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder}; use testing_framework_runner_local::LocalDeployer; use tests_workflows::ScenarioBuilderExt as _; const VALIDATORS: usize = 2; const TX_RATE: u64 = 10; const USERS: usize = 8; const WALLETS: usize = 64; const DURATION: Duration = Duration::from_secs(120); #[tokio::test] async fn transaction_throughput_baseline() { let mut plan = ScenarioBuilder::with_node_counts(VALIDATORS, 0) .topology() .network_star() .validators(VALIDATORS) .executors(0) .apply() .wallets(WALLETS) .transactions() .rate(TX_RATE) .users(USERS) .apply() .with_run_duration(DURATION) .expect_consensus_liveness() .build(); let deployer = LocalDeployer::default(); let runner = deployer.deploy(&plan).await.expect("deployment"); let handle = runner.run(&mut plan).await.expect("scenario passed"); // Optional: Check stats let stats = handle.context().block_feed().stats(); println!("Total transactions included: {}", stats.total_transactions()); } ``` **Expected output**: ``` [INFO] tx_inclusion_expectation: 180/200 included (90%) ✓ [INFO] consensus_liveness: target=15, observed heights=[18, 17] ✓ Total transactions included: 180 ``` **Common failures**: - `observed 0 below required`: Forgot `.wallets()` - Low inclusion: Reduce `TX_RATE` or increase `DURATION` --- ## Recipe 3: DA + Transaction Combined Stress **Goal**: Exercise both transaction and data-availability paths. ```rust use std::time::Duration; use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder}; use testing_framework_runner_local::LocalDeployer; use tests_workflows::ScenarioBuilderExt as _; #[tokio::test] async fn da_tx_combined_stress() { let mut plan = ScenarioBuilder::with_node_counts(1, 1) // Need executor for DA .topology() .network_star() .validators(1) .executors(1) .apply() .wallets(64) .transactions() .rate(5) .users(4) .apply() .da() .channel_rate(2) // 2 channel inscriptions per block .blob_rate(1) // 1 blob per channel .apply() .with_run_duration(Duration::from_secs(180)) .expect_consensus_liveness() .build(); let deployer = LocalDeployer::default(); let runner = deployer.deploy(&plan).await.expect("deployment"); runner.run(&mut plan).await.expect("scenario passed"); } ``` **Expected output**: ``` [INFO] da_workload_inclusions: 2/2 channels inscribed ✓ [INFO] tx_inclusion_expectation: 45/50 included (90%) ✓ [INFO] consensus_liveness: target=22, observed heights=[25, 24] ✓ ``` **Common failures**: - `da workload requires at least one executor`: Add executor to topology - Blob publish failures: Check DA balancer readiness --- ## Recipe 4: Chaos Resilience Test **Goal**: Verify system recovers from node restarts. ```rust use std::time::Duration; use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder}; use testing_framework_runner_local::LocalDeployer; use tests_workflows::{ChaosBuilderExt as _, ScenarioBuilderExt as _}; #[tokio::test] async fn chaos_resilience_test() { let mut plan = ScenarioBuilder::with_node_counts(3, 1) // Need >1 validator for chaos .enable_node_control() // Required for chaos! .topology() .network_star() .validators(3) .executors(1) .apply() .wallets(64) .transactions() .rate(3) // Lower rate for stability during chaos .users(4) .apply() .chaos_random_restart() .validators(true) .executors(true) .min_delay(Duration::from_secs(45)) .max_delay(Duration::from_secs(75)) .target_cooldown(Duration::from_secs(120)) .apply() .with_run_duration(Duration::from_secs(300)) // 5 minutes .expect_consensus_liveness() .build(); let deployer = LocalDeployer::default(); let runner = deployer.deploy(&plan).await.expect("deployment"); runner.run(&mut plan).await.expect("chaos scenario passed"); } ``` **Expected output**: ``` [INFO] Restarting validator-1 [INFO] Restarting executor-0 [INFO] Restarting validator-2 [INFO] consensus_liveness: target=35, observed heights=[42, 38, 40, 39] ✓ ``` **Common failures**: - `no eligible targets`: Need ≥2 validators (safety skips single validator) - Liveness violation: Increase `target_cooldown`, reduce restart frequency --- ## Recipe 5: Docker Compose Reproducible Test **Goal**: Run in containers for CI reproducibility. ```rust use std::time::Duration; use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder}; use testing_framework_runner_compose::ComposeRunner; use tests_workflows::ScenarioBuilderExt as _; #[tokio::test] #[ignore = "requires Docker"] async fn compose_reproducible_test() { let mut plan = ScenarioBuilder::with_node_counts(2, 1) .topology() .network_star() .validators(2) .executors(1) .apply() .wallets(64) .transactions() .rate(5) .users(8) .apply() .da() .channel_rate(1) .blob_rate(1) .apply() .with_run_duration(Duration::from_secs(120)) .expect_consensus_liveness() .build(); let deployer = ComposeRunner::default(); let runner = deployer.deploy(&plan).await.expect("compose deployment"); // Verify Prometheus is available assert!(runner.context().telemetry().is_configured()); runner.run(&mut plan).await.expect("compose scenario passed"); } ``` **Required environment**: ```bash # Build the Docker image first docker build -t nomos-testnet:local . # Or use custom image export NOMOS_TESTNET_IMAGE=myregistry/nomos-testnet:v1.0 ``` **Common failures**: - `cfgsync connection refused`: Check port 4400 is accessible - Image not found: Build or pull `nomos-testnet:local` --- ## FAQ **Q: Why does chaos skip validators when only one is configured?** A: Restarting the only validator would halt consensus entirely. The framework protects against this by requiring ≥2 validators for chaos to restart validators. See `RandomRestartWorkload::targets()`. **Q: Can I run the same scenario on different runners?** A: Yes! The `Scenario` is runner-agnostic. Just swap the deployer: ```rust let plan = build_my_scenario(); // Same plan // Local let runner = LocalDeployer::default().deploy(&plan).await?; // Or Compose let runner = ComposeRunner::default().deploy(&plan).await?; // Or K8s let runner = K8sRunner::new().deploy(&plan).await?; ``` **Q: How do I debug a flaky scenario?** A: 1. Enable tracing: `NOMOS_TESTS_TRACING=true` 2. Keep logs: `NOMOS_TESTS_KEEP_LOGS=1` 3. Increase duration 4. Simplify (remove workloads one by one) **Q: Why are expectations evaluated after all workloads, not during?** A: This ensures the system has reached steady state. If you need continuous assertions, implement them inside your workload using `BlockFeed`. **Q: How long should my scenario run?** A: See the [Duration Heuristics](#duration-heuristics) table. Rule of thumb: enough blocks to observe your workload's effects plus margin for variability. **Q: What's the difference between `Plan` and `Scenario`?** A: In the code, `ScenarioBuilder` builds a `Scenario`. The term "plan" is informal shorthand for "fully constructed scenario ready for deployment." --- ## Changelog ### v3 (Current) **New sections:** - 5-Minute Quickstart - Reading Guide by Role - Duration Heuristics table - BlockFeed Deep Dive - Configuration Sync (cfgsync) documentation - Environment Variables reference - Complete Scenario Recipes (5 recipes) - Common Expectation Mistakes table - Debugging Flow diagram - GitBook structure markers **Fixes from v2:** - All API method names verified against codebase - Error messages taken from actual error types - Environment variables verified in source **Improvements:** - More diagrams (timeline, readiness phases, type flow) - Troubleshooting with actual error messages - FAQ expanded with common questions