logos-blockchain-testing/book/nomos_testing_framework_book_v4.md
2025-11-26 12:15:07 +01:00

62 KiB
Raw Blame History

Nomos Testing Framework — Complete Reference

GitBook Structure Note: This document is organized with <!-- FILE: path/to/file.md --> markers indicating how to split for GitBook deployment.


Nomos Testing Framework

A purpose-built toolkit for exercising Nomos in realistic, multi-node environments.

Reading Guide by Role

If you are... Start with... Then read...
Protocol/Core Engineer Quickstart → Testing Philosophy Workloads & Expectations → Recipes
Infra/DevOps Quickstart → Runners Operations → Configuration Sync → Troubleshooting
Test Designer Quickstart → Authoring Scenarios DSL Cheat Sheet → Recipes → Extending

Prerequisites

This book assumes:

  • Rust competency (async/await, traits, cargo)
  • Basic familiarity with Nomos architecture (validators, executors, DA)
  • Docker knowledge (for Compose runner)
  • Optional: Kubernetes access (for K8s runner)

5-Minute Quickstart

Get a scenario running in under 5 minutes.

Step 1: Clone and Build

# Clone the testing framework (assumes nomos-node sibling checkout)
# Note: If the testing framework lives inside the main Nomos monorepo,
# adjust the clone URL and paths accordingly.
git clone https://github.com/logos-co/nomos-testing.git
cd nomos-testing

# Build the testing framework crates
cargo build -p testing-framework-core -p testing-framework-workflows

Build modes: Node binaries use --release for realistic performance. Framework crates use debug for faster iteration. For pure development speed, you can build everything in debug mode.

Step 2: Run the Simplest Scenario

# Run a local 2-validator smoke test
cargo test --package tests-workflows --test local_runner -- local_runner_mixed_workloads --nocapture

Step 3: What Good Output Looks Like

running 1 test
[INFO] Spawning validator 0 on port 18800
[INFO] Spawning validator 1 on port 18810
[INFO] Waiting for network readiness...
[INFO] Network ready: all peers connected
[INFO] Waiting for membership readiness...
[INFO] Membership ready for session 0
[INFO] Starting workloads...
[INFO] Transaction workload submitting at 5 tx/block
[INFO] DA workload: channel inscription submitted
[INFO] Block 1 observed: 3 transactions
[INFO] Block 2 observed: 5 transactions
...
[INFO] Workloads complete, evaluating expectations
[INFO] consensus_liveness: target=8, observed heights=[12, 11] ✓
[INFO] tx_inclusion_expectation: 42/50 included (84%) ✓
test local_runner_mixed_workloads ... ok

Step 4: What Failure Looks Like

[ERROR] consensus_liveness violated (target=8):
- validator-0 height 2 below target 8
- validator-1 height 3 below target 8

test local_runner_mixed_workloads ... FAILED

Common causes: run duration too short, readiness not complete, node crashed.

Step 5: Modify a Scenario

Open tests/workflows/tests/local_runner.rs:

// Change this:
const RUN_DURATION: Duration = Duration::from_secs(60);

// To this for a longer run:
const RUN_DURATION: Duration = Duration::from_secs(120);

// Or change validator count:
const VALIDATORS: usize = 3;  // was 2

Re-run:

cargo test --package tests-workflows --test local_runner -- --nocapture

You're now ready to explore the framework!


Part I — Foundations

Introduction

The Nomos Testing Framework solves the gap between small, isolated unit tests and full-system validation by letting teams:

  1. Describe a cluster layout (topology)
  2. Drive meaningful traffic (workloads)
  3. Assert outcomes (expectations)

...all in one coherent, portable plan (a Scenario in code terms).

Why Multi-Node Testing?

Many Nomos behaviors only emerge when multiple roles interact:

┌─────────────────────────────────────────────────────────────────┐
│                BEHAVIORS REQUIRING MULTI-NODE                   │
├─────────────────────────────────────────────────────────────────┤
│ • Block progression across validators                           │
│ • Data availability sampling and dispersal                      │
│ • Consensus under network partitions                            │
│ • Liveness recovery after node restarts                         │
│ • Transaction propagation and inclusion                         │
│ • Membership and session transitions                            │
└─────────────────────────────────────────────────────────────────┘

Unit tests can't catch these. This framework makes multi-node checks declarative, observable, and repeatable.

Target Audience

Role Primary Concerns
Protocol Engineers Consensus correctness, DA behavior, block progression
Infrastructure/DevOps Runners, CI integration, logs, failure triage
QA/Test Designers Scenario composition, workload tuning, coverage

Architecture Overview

The framework follows a clear pipeline:

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌─────────────┐
│ TOPOLOGY │───▶│ SCENARIO │───▶│  RUNNER  │───▶│ WORKLOADS│───▶│EXPECTATIONS │
│          │    │          │    │          │    │          │    │             │
│ Shape    │    │ Assemble │    │ Deploy & │    │ Drive    │    │ Verify      │
│ cluster  │    │ plan     │    │ wait     │    │ traffic  │    │ outcomes    │
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └─────────────┘

Component Responsibilities

Component Responsibility Key Types
Topology Declares cluster shape: node counts, network layout, DA parameters TopologyConfig, GeneratedTopology, TopologyBuilder
Scenario Assembles topology + workloads + expectations + duration Scenario<Caps>, ScenarioBuilder
Runner Deploys to environment, waits for readiness, provides RunContext Runner, LocalDeployer, ComposeRunner, K8sRunner
Workloads Generate traffic/conditions during the run Workload trait, TransactionWorkload, DaWorkload, RandomRestartWorkload
Expectations Judge success/failure after workloads complete Expectation trait, ConsensusLiveness, TxInclusionExpectation

Type Flow Diagram

TopologyConfig
    │
    │ TopologyBuilder::new()
    ▼
TopologyBuilder ──.build()──▶ GeneratedTopology
                                    │
                                    │ contains
                                    ▼
                            GeneratedNodeConfig[]
                                    │
                                    │ Runner spawns
                                    ▼
                              Topology (live nodes)
                                    │
                                    │ provides
                                    ▼
                              NodeClients
                                    │
                                    │ wrapped in
                                    ▼
                              RunContext
ScenarioBuilder
    │
    │ .with_workload() / .with_expectation() / .with_run_duration()
    │
    │ .build()
    ▼
Scenario<Caps>
    │
    │ Deployer::deploy()
    ▼
Runner
    │
    │ .run(&mut scenario)
    ▼
RunHandle (success) or ScenarioError (failure)

Testing Philosophy

Core Principles

  1. Declarative over imperative

    • Describe desired state, let framework orchestrate
    • Scenarios are data, not scripts
  2. Observable health signals

    • Prefer liveness/inclusion signals over internal debug state
    • If users can't see it, don't assert on it
  3. Determinism first

    • Fixed topologies and traffic rates by default
    • Variability is opt-in (chaos workloads)
  4. Protocol time, not wall time

    • Reason in blocks and slots
    • Reduces host speed dependence
  5. Minimum run window

    • Always allow enough blocks for meaningful assertions
    • Framework enforces minimum 2 blocks
  6. Chaos with intent

    • Chaos workloads for resilience testing only
    • Avoid chaos in basic functional smoke tests; reserve it for dedicated resilience scenarios

Testing Spectrum

┌────────────────────────────────────────────────────────────────┐
│                    WHERE THIS FRAMEWORK FITS                   │
├──────────────┬────────────────────┬────────────────────────────┤
│  UNIT TESTS  │  INTEGRATION       │  MULTI-NODE SCENARIOS      │
│              │                    │                            │
│  Fast        │  Single process    │  ◀── THIS FRAMEWORK        │
│  Isolated    │  Mock network      │                            │
│  Deterministic│  No real timing   │  Real networking           │
│              │                    │  Protocol timing           │
│  ~1000s/sec  │  ~100s/sec         │  ~1-10/hour                │
└──────────────┴────────────────────┴────────────────────────────┘

Scenario Lifecycle

Phase Overview

┌─────────┐   ┌─────────┐   ┌───────────┐   ┌─────────┐   ┌──────────┐   ┌──────────┐   ┌─────────┐
│  PLAN   │──▶│ DEPLOY  │──▶│ READINESS │──▶│  DRIVE  │──▶│ COOLDOWN │──▶│ EVALUATE │──▶│ CLEANUP │
└─────────┘   └─────────┘   └───────────┘   └─────────┘   └──────────┘   └──────────┘   └─────────┘

Detailed Timeline

Time ──────────────────────────────────────────────────────────────────────▶

     │ PLAN          │ DEPLOY        │ READY    │ WORKLOADS      │COOL│ EVAL │
     │               │               │          │                │DOWN│      │
     │ Build         │ Spawn         │ Network  │ Traffic runs   │    │Check │
     │ scenario      │ nodes         │ DA       │ Blocks produce │ 5× │ all  │
     │               │ (local/       │ Member   │                │blk │expect│
     │               │ docker/k8s)   │ ship     │                │    │      │
     │               │               │          │                │    │      │
     ▼               ▼               ▼          ▼                ▼    ▼      ▼
   t=0            t=5s           t=30s       t=35s            t=95s t=100s t=105s
                                                                          │
                                                              (example    │
                                                               60s run)   ▼
                                                                       CLEANUP

Phase Details

Phase What Happens Code Entry Point
Plan Declare topology, attach workloads/expectations, set duration ScenarioBuilder::build()
Deploy Runner provisions environment deployer.deploy(&scenario)
Readiness Wait for network peers, DA balancer, membership wait_network_ready(), wait_membership_ready(), wait_da_balancer_ready()
Drive Workloads run concurrently for configured duration workload.start(ctx) inside Runner::run_workloads()
Cooldown Stabilization period (5× block interval, 30s min if chaos used) Automatic in Runner::cooldown()
Evaluate All expectations run; failures aggregated (not short-circuited) expectation.evaluate(ctx)
Cleanup Resources reclaimed via CleanupGuard Drop impl on Runner

Readiness Phases (Detail)

Runners perform three distinct readiness checks:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│ NETWORK         │────▶│ MEMBERSHIP      │────▶│ DA BALANCER     │
│                 │     │                 │     │                 │
│ libp2p peers    │     │ Session 0       │     │ Dispersal peers │
│ connected       │     │ assignments     │     │ available       │
│                 │     │ propagated      │     │                 │
│ Timeout: 60s    │     │ Timeout: 60s    │     │ Timeout: 60s    │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Part II — User Guide

Authoring Scenarios

The 5-Step Process

┌─────────────────────────────────────────────────────────────────┐
│                    SCENARIO AUTHORING FLOW                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. SHAPE TOPOLOGY          2. ATTACH WORKLOADS                 │
│     ┌─────────────┐            ┌─────────────┐                  │
│     │ Validators  │            │ Transactions│                  │
│     │ Executors   │            │ DA blobs    │                  │
│     │ Network     │            │ Chaos       │                  │
│     │ DA params   │            └─────────────┘                  │
│     └─────────────┘                                             │
│                                                                 │
│  3. DEFINE EXPECTATIONS     4. SET DURATION                     │
│     ┌─────────────┐            ┌─────────────┐                  │
│     │ Liveness    │            │ See duration│                  │
│     │ Inclusion   │            │ heuristics  │                  │
│     │ Custom      │            │ table below │                  │
│     └─────────────┘            └─────────────┘                  │
│                                                                 │
│  5. CHOOSE RUNNER                                               │
│     ┌─────────┐ ┌─────────┐ ┌─────────┐                         │
│     │ Local   │ │ Compose │ │ K8s     │                         │
│     └─────────┘ └─────────┘ └─────────┘                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Duration Heuristics

Use protocol time (blocks), not wall time. With default 2-second slots and active slot coefficient of 0.9, expect roughly one block every ~23 seconds (subject to randomness). Individual topologies may override these defaults.

Scenario Type Min Blocks Recommended Duration Notes
Smoke test 5-10 30-60s Quick validation
Tx throughput 20-50 2-3 min Capture steady state
DA + tx combined 30-50 3-5 min Observe interaction
Chaos/resilience 50-100 5-10 min Allow restart recovery
Long-run stability 100+ 10-30 min Trend validation

Note

: The framework enforces a minimum of 2 blocks. Very short durations are clamped automatically.

Builder Pattern Overview

ScenarioBuilder::with_node_counts(validators, executors)
    // 1. Topology sub-builder
    .topology()
        .network_star()
        .validators(n)
        .executors(n)
        .apply()  // Returns to main builder
    
    // 2. Wallet seeding
    .wallets(user_count)
    
    // 3. Workload sub-builders
    .transactions()
        .rate(per_block)
        .users(actors)
        .apply()
    
    .da()
        .channel_rate(n)
        .blob_rate(n)
        .apply()
    
    // 4. Optional chaos (changes Caps type)
    .enable_node_control()
    .chaos_random_restart()
        .validators(true)
        .executors(true)
        .min_delay(Duration)
        .max_delay(Duration)
        .target_cooldown(Duration)
        .apply()
    
    // 5. Duration and expectations
    .with_run_duration(duration)
    .expect_consensus_liveness()
    
    // 6. Build
    .build()

Workloads

Workloads generate traffic and conditions during a scenario run.

Available Workloads

Workload Purpose Key Config Bundled Expectation
Transaction Submit transactions at configurable rate rate, users TxInclusionExpectation
DA Create channels, publish blobs channel_rate, blob_rate DaWorkloadExpectation
Chaos Restart nodes randomly min_delay, max_delay, target_cooldown None (use ConsensusLiveness)

Transaction Workload

Submits user-level transactions at a configurable rate.

.transactions()
    .rate(5)      // 5 transactions per block opportunity
    .users(8)     // Use 8 distinct wallet actors
    .apply()

Requires: Seeded wallets (.wallets(n))

DA Workload

Drives data-availability paths: channel inscriptions and blob publishing.

.da()
    .channel_rate(1)  // 1 channel operation per block
    .blob_rate(1)     // 1 blob per channel
    .apply()

Requires: At least one executor for blob publishing.

Chaos Workload

Triggers controlled node restarts to test resilience.

.enable_node_control()  // Required capability
.chaos_random_restart()
    .validators(true)           // Include validators
    .executors(true)            // Include executors
    .min_delay(Duration::from_secs(45))    // Min time between restarts
    .max_delay(Duration::from_secs(75))    // Max time between restarts
    .target_cooldown(Duration::from_secs(120))  // Per-node cooldown
    .apply()

Safety behavior: If only one validator is configured, the chaos workload automatically skips validator restarts to avoid halting consensus.

Cooldown behavior: After chaos workloads, the runner adds a minimum 30-second cooldown before evaluating expectations.


Expectations

Expectations are post-run assertions that judge success or failure.

Available Expectations

Expectation Asserts Default Tolerance
ConsensusLiveness All validators reach minimum block height 80% of expected blocks
TxInclusionExpectation Submitted transactions appear in blocks 50% inclusion ratio
DaWorkloadExpectation Planned channels/blobs were included 80% inclusion ratio
PrometheusBlockProduction Prometheus metrics show block production Exact minimum

ConsensusLiveness

The primary health check. Polls each validator's HTTP consensus info.

// With default 80% tolerance:
.expect_consensus_liveness()

// Or with specific minimum:
.with_expectation(ConsensusLiveness::with_minimum(10))

// Or with custom tolerance:
.with_expectation(ConsensusLiveness::with_tolerance(0.9))

Note for advanced users: There are two ConsensusLiveness implementations in the codebase:

  • testing_framework_workflows::ConsensusLiveness — HTTP-based, checks heights via consensus_info() API. This is what .expect_consensus_liveness() uses.
  • testing_framework_core::scenario::expectations::ConsensusLiveness — Also HTTP-based but with different tolerance semantics.

There's also PrometheusBlockProduction in core for Prometheus-based metrics checks when telemetry is configured.

Expectation Lifecycle

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│    init()   │────▶│start_capture│────▶│  evaluate() │
│             │     │    ()       │     │             │
│ Validate    │     │ Snapshot    │     │ Assert      │
│ prereqs     │     │ baseline    │     │ conditions  │
│             │     │ (optional)  │     │             │
└─────────────┘     └─────────────┘     └─────────────┘
     │                    │                    │
     ▼                    ▼                    ▼
  At build()         Before workloads     After workloads

Common Expectation Mistakes

Mistake Why It Fails Fix
Expecting inclusion too soon Transactions need blocks to be included Increase duration
Wall-clock timing assertions Host speed varies Use block counts via RunMetrics
Duration too short Not enough blocks observed Use duration heuristics table
Skipping start_capture() Baseline not established Implement if comparing before/after
Asserting on internal state Framework can't observe it Use consensus_info() or BlockFeed

BlockFeed Deep Dive

The BlockFeed is the primary mechanism for observing block production during a run.

What BlockFeed Provides

pub struct BlockFeed {
    // Subscribe to receive block notifications
    pub fn subscribe(&self) -> broadcast::Receiver<Arc<BlockRecord>>;
    
    // Access aggregate statistics
    pub fn stats(&self) -> Arc<BlockStats>;
}

pub struct BlockRecord {
    pub header: HeaderId,                    // Block header ID
    pub block: Arc<Block<SignedMantleTx>>,   // Full block with transactions
}

pub struct BlockStats {
    // Total transactions observed across all blocks
    pub fn total_transactions(&self) -> u64;
}

How It Works

┌────────────────┐     ┌────────────────┐     ┌────────────────┐
│  BlockScanner  │────▶│   BlockFeed    │────▶│  Subscribers   │
│                │     │                │     │                │
│ Polls validator│     │ broadcast      │     │ Workloads      │
│ consensus_info │     │ channel        │     │ Expectations   │
│ every 1 second │     │ (1024 buffer)  │     │                │
│                │     │                │     │                │
│ Fetches blocks │     │ Records stats  │     │                │
│ via storage_   │     │                │     │                │
│ block()        │     │                │     │                │
└────────────────┘     └────────────────┘     └────────────────┘

Using BlockFeed in Workloads

async fn start(&self, ctx: &RunContext) -> Result<(), DynError> {
    let mut receiver = ctx.block_feed().subscribe();
    
    loop {
        match receiver.recv().await {
            Ok(record) => {
                // Process block
                let height = record.block.header().slot().into();
                let tx_count = record.block.transactions().len();
                
                // Check for specific transactions
                for tx in record.block.transactions() {
                    // ... examine transaction
                }
            }
            Err(broadcast::error::RecvError::Lagged(n)) => {
                // Fell behind, n messages skipped
                continue;
            }
            Err(broadcast::error::RecvError::Closed) => {
                return Err("block feed closed".into());
            }
        }
    }
}

Using BlockFeed in Expectations

async fn start_capture(&mut self, ctx: &RunContext) -> Result<(), DynError> {
    let mut receiver = ctx.block_feed().subscribe();
    let observed = Arc::new(Mutex::new(HashSet::new()));
    let observed_clone = Arc::clone(&observed);
    
    // Spawn background task to collect observations
    tokio::spawn(async move {
        while let Ok(record) = receiver.recv().await {
            // Record what we observe
            let mut guard = observed_clone.lock().unwrap();
            for tx in record.block.transactions() {
                guard.insert(tx.hash());
            }
        }
    });
    
    self.observed = Some(observed);
    Ok(())
}

async fn evaluate(&mut self, ctx: &RunContext) -> Result<(), DynError> {
    let observed = self.observed.as_ref().ok_or("not captured")?;
    let guard = observed.lock().unwrap();
    
    // Compare observed vs expected
    if guard.len() < self.expected_count {
        return Err(format!(
            "insufficient inclusions: {} < {}",
            guard.len(), self.expected_count
        ).into());
    }
    Ok(())
}

Runner: Local

Runs node binaries as local processes on the host.

What It Does

  • Spawns validators/executors directly on the host with ephemeral data dirs.
  • Binds HTTP/libp2p ports on localhost; no containers involved.
  • Fastest feedback loop; best for unit-level scenarios and debugging.

Prerequisites

  • Rust toolchain installed.
  • No ports in use on the default ranges (see runner config if you need to override).

How to Run

cargo test -p tests-workflows --test local_runner -- local_runner_mixed_workloads --nocapture

Adjust validator/executor counts inside the test file or via the scenario builder.

Troubleshooting

  • Port already in use → change base ports in the test or stop the conflicting process.
  • Slow start on first run → binaries need to be built; reruns are faster.
  • No blocks → ensure workloads enabled and duration long enough (≥60s default).

Runner: Docker Compose

Runs validators/executors in Docker containers using docker-compose.

What It Does

  • Builds/pulls the node image, then creates a network and one container per role.
  • Uses Compose health checks for readiness, then runs workloads/expectations.
  • Cleans up containers and network unless preservation is requested.

Prerequisites

  • Docker with the Compose plugin.
  • Built node image available locally (default nomos-testnet:local).
    • Build from repo root: testnet/scripts/build_test_image.sh
  • Optional env vars:
    • NOMOS_TESTNET_IMAGE (override tag)
    • COMPOSE_NODE_PAIRS=1x1 (validators x executors)
    • COMPOSE_RUNNER_PRESERVE=1 to keep the stack for inspection

How to Run

POL_PROOF_DEV_MODE=true COMPOSE_NODE_PAIRS=1x1 \
cargo test -p tests-workflows compose_runner_mixed_workloads -- --nocapture

Troubleshooting

  • Image not found → set NOMOS_TESTNET_IMAGE to a built/pulled tag.
  • Peers not connecting → inspect docker compose logs for validator/executor.
  • Stack left behind → docker compose -p <project> down and remove the network.

Runner: Kubernetes

Deploys validators/executors as a Helm release into the current Kubernetes context.

What It Does

  • Builds/pulls the node image, packages Helm assets, installs into a unique namespace.
  • Waits for pod readiness and validator HTTP endpoint, then drives workloads.
  • Tears down the namespace unless preservation is requested.

Prerequisites

  • kubectl and Helm on PATH; a running Kubernetes cluster/context (e.g., Docker Desktop, kind).
  • Docker buildx to build the node image for your arch.
  • Built image tag exported:
    • Build: testnet/scripts/build_test_image.sh (default tag nomos-testnet:local)
    • Export: export NOMOS_TESTNET_IMAGE=nomos-testnet:local
  • Optional: K8S_RUNNER_PRESERVE=1 to keep the namespace for debugging.

How to Run

NOMOS_TESTNET_IMAGE=nomos-testnet:local \
cargo test -p tests-workflows demo_k8s_runner_tx_workload -- --nocapture

Troubleshooting

  • Timeout waiting for validator HTTP → check pod logs: kubectl logs -n <ns> deploy/validator.
  • No peers/tx inclusion → inspect rendered /config.yaml in the pod and cfgsync logs.
  • Cleanup stuck → kubectl delete namespace <ns> from the preserved namespace name.

Runners

Runners deploy scenarios to different environments.

Runner Decision Matrix

Goal Recommended Runner Why
Fast local iteration LocalDeployer No container overhead
Reproducible e2e checks ComposeRunner Stable multi-node isolation
High fidelity / CI K8sRunner Real cluster behavior
Config validation only Dry-run (future) Catch errors before nodes

Runner Comparison

Aspect LocalDeployer ComposeRunner K8sRunner
Speed Fastest 🔄 Medium 🏗️ Slowest
Setup Binaries only Docker daemon Cluster access
Isolation Process-level Container-level Pod-level
Port discovery Direct Auto via Docker NodePort
Node control Full Via container restart Via pod restart
Observability Local files Container logs Prometheus + logs
CI suitability Dev only Good Best

LocalDeployer

Spawns nodes as host processes.

let deployer = LocalDeployer::default();
// Or skip membership check for faster startup:
let deployer = LocalDeployer::new().with_membership_check(false);

let runner = deployer.deploy(&scenario).await?;

ComposeRunner

Starts nodes in Docker containers via Docker Compose.

let deployer = ComposeRunner::default();
let runner = deployer.deploy(&scenario).await?;

Uses Configuration Sync (cfgsync) — see Operations section.

K8sRunner

Deploys to a Kubernetes cluster.

let deployer = K8sRunner::new();
let runner = match deployer.deploy(&scenario).await {
    Ok(r) => r,
    Err(K8sRunnerError::ClientInit { source }) => {
        // Cluster unavailable
        return;
    }
    Err(e) => panic!("deployment failed: {e}"),
};

Operations

Prerequisites Checklist

□ nomos-node checkout available (sibling directory)
□ Binaries built: cargo build -p nomos-node -p nomos-executor
□ Runner platform ready:
  □ Local: binaries in target/debug/
  □ Compose: Docker daemon running
  □ K8s: kubectl configured, cluster accessible
□ KZG prover assets fetched (for DA scenarios)
□ Ports available (default ranges: 18800+, 4400 for cfgsync)

Environment Variables

Variable Effect Default
SLOW_TEST_ENV=true 2× timeout multiplier for all readiness checks false
NOMOS_TESTS_TRACING=true Enable debug tracing output false
NOMOS_TESTS_KEEP_LOGS=1 Preserve temp directories after run Delete
NOMOS_TESTNET_IMAGE Docker image for Compose/K8s runners nomos-testnet:local
COMPOSE_RUNNER_PRESERVE=1 Keep Compose resources after run Delete
TEST_FRAMEWORK_PROMETHEUS_PORT Host port for Prometheus (Compose) 9090

Configuration Synchronization (cfgsync)

When running in Docker Compose or Kubernetes, the framework uses dynamic configuration injection instead of static config files.

┌─────────────────┐                    ┌─────────────────┐
│  RUNNER HOST    │                    │  NODE CONTAINER │
│                 │                    │                 │
│ ┌─────────────┐ │   HTTP :4400       │ ┌─────────────┐ │
│ │ cfgsync     │◀├───────────────────┤│ cfgsync     │ │
│ │ server      │ │                    │ │ client      │ │
│ │             │ │  1. Request config │ │             │ │
│ │ Holds       │ │  2. Receive YAML   │ │ Fetches     │ │
│ │ generated   │ │  3. Start node     │ │ config at   │ │
│ │ topology    │ │                    │ │ startup     │ │
│ └─────────────┘ │                    │ └─────────────┘ │
└─────────────────┘                    └─────────────────┘

Why cfgsync?

  • Handles dynamic port discovery
  • Injects cryptographic keys
  • Supports topology changes without rebuilding images

Troubleshooting cfgsync:

Symptom Cause Fix
Containers stuck at startup cfgsync server unreachable Check port 4400 is not blocked
"connection refused" in logs Server not started Verify runner started cfgsync
Config mismatch errors Stale cfgsync template Clean temp directories

Part IV — Reference

Troubleshooting

Error Messages and Fixes

Readiness Timeout

Error: readiness probe failed: timed out waiting for network readiness:
  validator#0@18800: 0 peers (expected 1)
  validator#1@18810: 0 peers (expected 1)

Causes:

  • Nodes not fully started
  • Network configuration mismatch
  • Ports blocked

Fixes:

  • Set SLOW_TEST_ENV=true for 2× timeout
  • Check node logs for startup errors
  • Verify ports are available

Consensus Liveness Violation

Error: expectations failed:
consensus liveness violated (target=8):
- validator-0 height 2 below target 8
- validator-1 height 3 below target 8

Causes:

  • Run duration too short
  • Node crashed during run
  • Consensus stalled

Fixes:

  • Increase with_run_duration()
  • Check node logs for panics
  • Verify network connectivity

Transaction Inclusion Below Threshold

Error: tx_inclusion_expectation: observed 15 below required 25

Causes:

  • Wallet not seeded
  • Transaction rate too high
  • Mempool full

Fixes:

  • Add .wallets(n) to scenario
  • Reduce .rate() in transaction workload
  • Increase duration for more blocks

Chaos Workload No Targets

Error: chaos restart workload has no eligible targets

Causes:

  • No validators or executors configured
  • Only one validator (skipped for safety)
  • Chaos disabled for both roles

Fixes:

  • Add more validators (≥2) for chaos
  • Enable .executors(true) if executors present
  • Use different workload for single-validator tests

BlockFeed Closed

Error: block feed closed while waiting for channel operations

Causes:

  • Source validator crashed
  • Network partition
  • Run ended prematurely

Fixes:

  • Check validator logs
  • Increase run duration
  • Verify readiness completed

Log Locations

Runner Log Location
Local Temp directory (printed at startup), or set NOMOS_TESTS_KEEP_LOGS=1
Compose docker logs <container_name>
K8s kubectl logs <pod_name>

Debugging Flow

┌─────────────────┐
│ Scenario fails  │
└────────┬────────┘
         ▼
┌────────────────────────────────────────┐
│ 1. Check error message category        │
│    - Readiness? → Check startup logs   │
│    - Workload? → Check workload config │
│    - Expectation? → Check assertions   │
└────────┬───────────────────────────────┘
         ▼
┌────────────────────────────────────────┐
│ 2. Check node logs                     │
│    - Panics? → Bug in node             │
│    - Connection errors? → Network      │
│    - Config errors? → cfgsync issue    │
└────────┬───────────────────────────────┘
         ▼
┌────────────────────────────────────────┐
│ 3. Reproduce with tracing              │
│    NOMOS_TESTS_TRACING=true cargo test │
└────────┬───────────────────────────────┘
         ▼
┌────────────────────────────────────────┐
│ 4. Simplify scenario                   │
│    - Reduce validators                 │
│    - Remove workloads one by one       │
│    - Increase duration                 │
└────────────────────────────────────────┘

DSL Cheat Sheet

Complete Builder Reference

// ═══════════════════════════════════════════════════════════════
// TOPOLOGY
// ═══════════════════════════════════════════════════════════════

ScenarioBuilder::with_node_counts(validators, executors)

    .topology()
        .network_star()              // Star layout (hub-spoke)
        .validators(count)           // Validator count
        .executors(count)            // Executor count
        .apply()                     // Return to main builder

// ═══════════════════════════════════════════════════════════════
// WALLET SEEDING
// ═══════════════════════════════════════════════════════════════

    .wallets(user_count)             // Uniform: 100 funds/user
    .with_wallet_config(custom)      // Custom WalletConfig

// ═══════════════════════════════════════════════════════════════
// TRANSACTION WORKLOAD
// ═══════════════════════════════════════════════════════════════

    .transactions()
        .rate(txs_per_block)         // NonZeroU64
        .users(actor_count)          // NonZeroUsize
        .apply()

// ═══════════════════════════════════════════════════════════════
// DA WORKLOAD
// ═══════════════════════════════════════════════════════════════

    .da()
        .channel_rate(ops_per_block) // Channel inscriptions
        .blob_rate(blobs_per_chan)   // Blobs per channel
        .apply()

// ═══════════════════════════════════════════════════════════════
// CHAOS WORKLOAD (requires .enable_node_control())
// ═══════════════════════════════════════════════════════════════

    .enable_node_control()           // Required first!
    
    .chaos_random_restart()
        .validators(bool)            // Restart validators?
        .executors(bool)             // Restart executors?
        .min_delay(Duration)         // Min between restarts
        .max_delay(Duration)         // Max between restarts
        .target_cooldown(Duration)   // Per-node cooldown
        .apply()

// ═══════════════════════════════════════════════════════════════
// DURATION & EXPECTATIONS
// ═══════════════════════════════════════════════════════════════

    .with_run_duration(Duration)     // Clamped to ≥2 blocks
    
    .expect_consensus_liveness()     // Default 80% tolerance
    
    .with_expectation(custom)        // Add custom Expectation
    .with_workload(custom)           // Add custom Workload

// ═══════════════════════════════════════════════════════════════
// BUILD
// ═══════════════════════════════════════════════════════════════

    .build()                         // Returns Scenario<Caps>

Quick Patterns

// Minimal smoke test
ScenarioBuilder::with_node_counts(2, 0)
    .with_run_duration(Duration::from_secs(30))
    .expect_consensus_liveness()
    .build()

// Transaction throughput
ScenarioBuilder::with_node_counts(2, 0)
    .wallets(64)
    .transactions().rate(10).users(8).apply()
    .with_run_duration(Duration::from_secs(120))
    .expect_consensus_liveness()
    .build()

// DA + transactions
ScenarioBuilder::with_node_counts(1, 1)
    .wallets(64)
    .transactions().rate(5).users(4).apply()
    .da().channel_rate(1).blob_rate(1).apply()
    .with_run_duration(Duration::from_secs(180))
    .expect_consensus_liveness()
    .build()

// Chaos resilience
ScenarioBuilder::with_node_counts(3, 1)
    .enable_node_control()
    .wallets(64)
    .transactions().rate(3).users(4).apply()
    .chaos_random_restart()
        .validators(true).executors(true)
        .min_delay(Duration::from_secs(45))
        .max_delay(Duration::from_secs(75))
        .target_cooldown(Duration::from_secs(120))
        .apply()
    .with_run_duration(Duration::from_secs(300))
    .expect_consensus_liveness()
    .build()

API Quick Reference

RunContext

impl RunContext {
    // ─────────────────────────────────────────────────────────────
    // TOPOLOGY ACCESS
    // ─────────────────────────────────────────────────────────────
    
    /// Static topology configuration
    pub fn descriptors(&self) -> &GeneratedTopology;
    
    /// Live node handles (if available)
    pub fn topology(&self) -> Option<&Topology>;
    
    // ─────────────────────────────────────────────────────────────
    // CLIENT ACCESS
    // ─────────────────────────────────────────────────────────────
    
    /// All node clients
    pub fn node_clients(&self) -> &NodeClients;
    
    /// Random node client
    pub fn random_node_client(&self) -> Option<&ApiClient>;
    
    /// Cluster client with retry logic
    pub fn cluster_client(&self) -> ClusterClient<'_>;
    
    // ─────────────────────────────────────────────────────────────
    // WALLET ACCESS
    // ─────────────────────────────────────────────────────────────
    
    /// Seeded wallet accounts
    pub fn wallet_accounts(&self) -> &[WalletAccount];
    
    // ─────────────────────────────────────────────────────────────
    // OBSERVABILITY
    // ─────────────────────────────────────────────────────────────
    
    /// Block observation stream
    pub fn block_feed(&self) -> BlockFeed;
    
    /// Prometheus metrics (if configured)
    pub fn telemetry(&self) -> &Metrics;
    
    // ─────────────────────────────────────────────────────────────
    // TIMING
    // ─────────────────────────────────────────────────────────────
    
    /// Configured run duration
    pub fn run_duration(&self) -> Duration;
    
    /// Expected block count for this run
    pub fn expected_blocks(&self) -> u64;
    
    /// Full timing metrics
    pub fn run_metrics(&self) -> RunMetrics;
    
    // ─────────────────────────────────────────────────────────────
    // NODE CONTROL (CHAOS)
    // ─────────────────────────────────────────────────────────────
    
    /// Node control handle (if enabled)
    pub fn node_control(&self) -> Option<Arc<dyn NodeControlHandle>>;
}

NodeClients

impl NodeClients {
    pub fn validator_clients(&self) -> &[ApiClient];
    pub fn executor_clients(&self) -> &[ApiClient];
    pub fn random_validator(&self) -> Option<&ApiClient>;
    pub fn random_executor(&self) -> Option<&ApiClient>;
    pub fn all_clients(&self) -> impl Iterator<Item = &ApiClient>;
    pub fn any_client(&self) -> Option<&ApiClient>;
    pub fn cluster_client(&self) -> ClusterClient<'_>;
}

ApiClient

impl ApiClient {
    // Consensus
    pub async fn consensus_info(&self) -> reqwest::Result<CryptarchiaInfo>;
    
    // Network
    pub async fn network_info(&self) -> reqwest::Result<Libp2pInfo>;
    
    // Transactions
    pub async fn submit_transaction(&self, tx: &SignedMantleTx) -> reqwest::Result<()>;
    
    // Storage
    pub async fn storage_block(&self, id: &HeaderId) 
        -> reqwest::Result<Option<Block<SignedMantleTx>>>;
    
    // DA
    pub async fn balancer_stats(&self) -> reqwest::Result<BalancerStats>;
    pub async fn monitor_stats(&self) -> reqwest::Result<MonitorStats>;
    pub async fn da_get_membership(&self, session: &SessionNumber)
        -> reqwest::Result<MembershipResponse>;
    
    // URLs
    pub fn base_url(&self) -> &Url;
}

CryptarchiaInfo

pub struct CryptarchiaInfo {
    pub height: u64,      // Current block height
    pub slot: Slot,       // Current slot number
    pub tip: HeaderId,    // Tip of the chain
    // ... additional fields
}

Key Traits

#[async_trait]
pub trait Workload: Send + Sync {
    fn name(&self) -> &str;
    fn expectations(&self) -> Vec<Box<dyn Expectation>> { vec![] }
    fn init(&mut self, topology: &GeneratedTopology, metrics: &RunMetrics) 
        -> Result<(), DynError> { Ok(()) }
    async fn start(&self, ctx: &RunContext) -> Result<(), DynError>;
}

#[async_trait]
pub trait Expectation: Send + Sync {
    fn name(&self) -> &str;
    fn init(&mut self, topology: &GeneratedTopology, metrics: &RunMetrics)
        -> Result<(), DynError> { Ok(()) }
    async fn start_capture(&mut self, ctx: &RunContext) -> Result<(), DynError> { Ok(()) }
    async fn evaluate(&mut self, ctx: &RunContext) -> Result<(), DynError>;
}

#[async_trait]
pub trait Deployer<Caps = ()>: Send + Sync {
    type Error;
    async fn deploy(&self, scenario: &Scenario<Caps>) -> Result<Runner, Self::Error>;
}

#[async_trait]
pub trait NodeControlHandle: Send + Sync {
    async fn restart_validator(&self, index: usize) -> Result<(), DynError>;
    async fn restart_executor(&self, index: usize) -> Result<(), DynError>;
}

Glossary

Protocol Terms

Term Definition
Slot Fixed time interval in the consensus protocol (default: 2 seconds)
Block Unit of consensus; contains transactions and header
Active Slot Coefficient Probability of block production per slot (default: 0.5)
Protocol Interval Expected time between blocks: slot_duration / active_slot_coeff

Framework Terms

Term Definition
Topology Declarative description of cluster shape, roles, and parameters
GeneratedTopology Concrete topology with generated configs, ports, and keys
Scenario Plan combining topology + workloads + expectations + duration
Workload Traffic/behavior generator during a run
Expectation Post-run assertion judging success/failure
BlockFeed Stream of block observations for workloads/expectations
RunContext Shared context with clients, metrics, observability
RunMetrics Computed timing: expected blocks, block interval, duration
NodeClients Collection of API clients for validators and executors
ApiClient HTTP client for node consensus, network, and DA endpoints
cfgsync Dynamic configuration injection for distributed runners

Runner Terms

Term Definition
Deployer Creates a Runner from a Scenario
Runner Manages execution: workloads, expectations, cleanup
RunHandle Returned on success; holds context and cleanup
CleanupGuard Ensures resources are reclaimed on drop
NodeControlHandle Interface for restarting nodes (chaos)

Part V — Scenario Recipes

Complete, copy-paste runnable scenarios.

Recipe 1: Minimal Smoke Test

Goal: Verify basic consensus works with minimal setup.

use std::time::Duration;
use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder};
use testing_framework_runner_local::LocalDeployer;

#[tokio::test]
async fn smoke_test_consensus() {
    // Minimal: 2 validators, no workloads, just check blocks produced
    let mut plan = ScenarioBuilder::with_node_counts(2, 0)
        .topology()
            .network_star()
            .validators(2)
            .executors(0)
            .apply()
        .with_run_duration(Duration::from_secs(30))
        .expect_consensus_liveness()
        .build();

    let deployer = LocalDeployer::default();
    let runner = deployer.deploy(&plan).await.expect("deployment");
    runner.run(&mut plan).await.expect("scenario passed");
}

Expected output:

[INFO] consensus_liveness: target=4, observed heights=[6, 5] ✓

Common failures:

  • height 0 below target: Nodes didn't start, check binaries exist
  • Timeout: Increase to 60s or set SLOW_TEST_ENV=true

Recipe 2: Transaction Throughput Baseline

Goal: Measure transaction inclusion under load.

use std::time::Duration;
use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder};
use testing_framework_runner_local::LocalDeployer;
use tests_workflows::ScenarioBuilderExt as _;

const VALIDATORS: usize = 2;
const TX_RATE: u64 = 10;
const USERS: usize = 8;
const WALLETS: usize = 64;
const DURATION: Duration = Duration::from_secs(120);

#[tokio::test]
async fn transaction_throughput_baseline() {
    let mut plan = ScenarioBuilder::with_node_counts(VALIDATORS, 0)
        .topology()
            .network_star()
            .validators(VALIDATORS)
            .executors(0)
            .apply()
        .wallets(WALLETS)
        .transactions()
            .rate(TX_RATE)
            .users(USERS)
            .apply()
        .with_run_duration(DURATION)
        .expect_consensus_liveness()
        .build();

    let deployer = LocalDeployer::default();
    let runner = deployer.deploy(&plan).await.expect("deployment");
    
    let handle = runner.run(&mut plan).await.expect("scenario passed");
    
    // Optional: Check stats
    let stats = handle.context().block_feed().stats();
    println!("Total transactions included: {}", stats.total_transactions());
}

Expected output:

[INFO] tx_inclusion_expectation: 180/200 included (90%) ✓
[INFO] consensus_liveness: target=15, observed heights=[18, 17] ✓
Total transactions included: 180

Common failures:

  • observed 0 below required: Forgot .wallets()
  • Low inclusion: Reduce TX_RATE or increase DURATION

Recipe 3: DA + Transaction Combined Stress

Goal: Exercise both transaction and data-availability paths.

use std::time::Duration;
use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder};
use testing_framework_runner_local::LocalDeployer;
use tests_workflows::ScenarioBuilderExt as _;

#[tokio::test]
async fn da_tx_combined_stress() {
    let mut plan = ScenarioBuilder::with_node_counts(1, 1)  // Need executor for DA
        .topology()
            .network_star()
            .validators(1)
            .executors(1)
            .apply()
        .wallets(64)
        .transactions()
            .rate(5)
            .users(4)
            .apply()
        .da()
            .channel_rate(2)   // 2 channel inscriptions per block
            .blob_rate(1)      // 1 blob per channel
            .apply()
        .with_run_duration(Duration::from_secs(180))
        .expect_consensus_liveness()
        .build();

    let deployer = LocalDeployer::default();
    let runner = deployer.deploy(&plan).await.expect("deployment");
    runner.run(&mut plan).await.expect("scenario passed");
}

Expected output:

[INFO] da_workload_inclusions: 2/2 channels inscribed ✓
[INFO] tx_inclusion_expectation: 45/50 included (90%) ✓
[INFO] consensus_liveness: target=22, observed heights=[25, 24] ✓

Common failures:

  • da workload requires at least one executor: Add executor to topology
  • Blob publish failures: Check DA balancer readiness

Recipe 4: Chaos Resilience Test

Goal: Verify system recovers from node restarts.

use std::time::Duration;
use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder};
use testing_framework_runner_local::LocalDeployer;
use tests_workflows::{ChaosBuilderExt as _, ScenarioBuilderExt as _};

#[tokio::test]
async fn chaos_resilience_test() {
    let mut plan = ScenarioBuilder::with_node_counts(3, 1)  // Need >1 validator for chaos
        .enable_node_control()  // Required for chaos!
        .topology()
            .network_star()
            .validators(3)
            .executors(1)
            .apply()
        .wallets(64)
        .transactions()
            .rate(3)  // Lower rate for stability during chaos
            .users(4)
            .apply()
        .chaos_random_restart()
            .validators(true)
            .executors(true)
            .min_delay(Duration::from_secs(45))
            .max_delay(Duration::from_secs(75))
            .target_cooldown(Duration::from_secs(120))
            .apply()
        .with_run_duration(Duration::from_secs(300))  // 5 minutes
        .expect_consensus_liveness()
        .build();

    let deployer = LocalDeployer::default();
    let runner = deployer.deploy(&plan).await.expect("deployment");
    runner.run(&mut plan).await.expect("chaos scenario passed");
}

Expected output:

[INFO] Restarting validator-1
[INFO] Restarting executor-0
[INFO] Restarting validator-2
[INFO] consensus_liveness: target=35, observed heights=[42, 38, 40, 39] ✓

Common failures:

  • no eligible targets: Need ≥2 validators (safety skips single validator)
  • Liveness violation: Increase target_cooldown, reduce restart frequency

Recipe 5: Docker Compose Reproducible Test

Goal: Run in containers for CI reproducibility.

use std::time::Duration;
use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder};
use testing_framework_runner_compose::ComposeRunner;
use tests_workflows::ScenarioBuilderExt as _;

#[tokio::test]
#[ignore = "requires Docker"]
async fn compose_reproducible_test() {
    let mut plan = ScenarioBuilder::with_node_counts(2, 1)
        .topology()
            .network_star()
            .validators(2)
            .executors(1)
            .apply()
        .wallets(64)
        .transactions()
            .rate(5)
            .users(8)
            .apply()
        .da()
            .channel_rate(1)
            .blob_rate(1)
            .apply()
        .with_run_duration(Duration::from_secs(120))
        .expect_consensus_liveness()
        .build();

    let deployer = ComposeRunner::default();
    let runner = deployer.deploy(&plan).await.expect("compose deployment");
    
    // Verify Prometheus is available
    assert!(runner.context().telemetry().is_configured());
    
    runner.run(&mut plan).await.expect("compose scenario passed");
}

Required environment:

# Build the Docker image first
docker build -t nomos-testnet:local .

# Or use custom image
export NOMOS_TESTNET_IMAGE=myregistry/nomos-testnet:v1.0

Common failures:

  • cfgsync connection refused: Check port 4400 is accessible
  • Image not found: Build or pull nomos-testnet:local

FAQ

Q: Why does chaos skip validators when only one is configured?

A: Restarting the only validator would halt consensus entirely. The framework protects against this by requiring ≥2 validators for chaos to restart validators. See RandomRestartWorkload::targets().

Q: Can I run the same scenario on different runners?

A: Yes! The Scenario is runner-agnostic. Just swap the deployer:

let plan = build_my_scenario();  // Same plan

// Local
let runner = LocalDeployer::default().deploy(&plan).await?;

// Or Compose
let runner = ComposeRunner::default().deploy(&plan).await?;

// Or K8s
let runner = K8sRunner::new().deploy(&plan).await?;

Q: How do I debug a flaky scenario?

A:

  1. Enable tracing: NOMOS_TESTS_TRACING=true
  2. Keep logs: NOMOS_TESTS_KEEP_LOGS=1
  3. Increase duration
  4. Simplify (remove workloads one by one)

Q: Why are expectations evaluated after all workloads, not during?

A: This ensures the system has reached steady state. If you need continuous assertions, implement them inside your workload using BlockFeed.

Q: How long should my scenario run?

A: See the Duration Heuristics table. Rule of thumb: enough blocks to observe your workload's effects plus margin for variability.

Q: What's the difference between Plan and Scenario?

A: In the code, ScenarioBuilder builds a Scenario. The term "plan" is informal shorthand for "fully constructed scenario ready for deployment."


Changelog

v3 (Current)

New sections:

  • 5-Minute Quickstart
  • Reading Guide by Role
  • Duration Heuristics table
  • BlockFeed Deep Dive
  • Configuration Sync (cfgsync) documentation
  • Environment Variables reference
  • Complete Scenario Recipes (5 recipes)
  • Common Expectation Mistakes table
  • Debugging Flow diagram
  • GitBook structure markers

Fixes from v2:

  • All API method names verified against codebase
  • Error messages taken from actual error types
  • Environment variables verified in source

Improvements:

  • More diagrams (timeline, readiness phases, type flow)
  • Troubleshooting with actual error messages
  • FAQ expanded with common questions