logos-blockchain-testing/book/nomos_testing_framework_book_v4.md

# Nomos Testing Framework — Complete Reference

> **GitBook Structure Note**: This document is organized with `<!-- FILE: path/to/file.md -->` markers indicating how to split for GitBook deployment.

---

<!-- FILE: README.md -->

# Nomos Testing Framework

A purpose-built toolkit for exercising Nomos in realistic, multi-node environments.

## Quick Links

- [5-Minute Quickstart](#5-minute-quickstart) — Get running immediately
- [Foundations](#part-i--foundations) — Core concepts and architecture
- [User Guide](#part-ii--user-guide) — Authoring and running scenarios
- [Developer Reference](#part-iii--developer-reference) — Extending the framework
- [Recipes](#part-v--scenario-recipes) — Copy-paste runnable examples

## Reading Guide by Role

| If you are... | Start with... | Then read... |
|---------------|---------------|--------------|
| **Protocol/Core Engineer** | Quickstart → Testing Philosophy | Workloads & Expectations → Recipes |
| **Infra/DevOps** | Quickstart → Runners | Operations → Configuration Sync → Troubleshooting |
| **Test Designer** | Quickstart → Authoring Scenarios | DSL Cheat Sheet → Recipes → Extending |

## Prerequisites

This book assumes:

- Rust competency (async/await, traits, cargo)
- Basic familiarity with Nomos architecture (validators, executors, DA)
- Docker knowledge (for Compose runner)
- Optional: Kubernetes access (for K8s runner)

---

<!-- FILE: quickstart.md -->

# 5-Minute Quickstart

Get a scenario running in under 5 minutes.

## Step 1: Clone and Build

```bash
# Clone the testing framework (assumes nomos-node sibling checkout)
# Note: If the testing framework lives inside the main Nomos monorepo,
# adjust the clone URL and paths accordingly.
git clone https://github.com/logos-co/nomos-testing.git
cd nomos-testing

# Build the testing framework crates
cargo build -p testing-framework-core -p testing-framework-workflows
```

> **Build modes**: Node binaries use `--release` for realistic performance. Framework crates use debug for faster iteration. For pure development speed, you can build everything in debug mode.

## Step 2: Run the Simplest Scenario

```bash
# Run a local 2-validator smoke test
cargo test --package tests-workflows --test local_runner -- local_runner_mixed_workloads --nocapture
```

## Step 3: What Good Output Looks Like

```
running 1 test
[INFO] Spawning validator 0 on port 18800
[INFO] Spawning validator 1 on port 18810
[INFO] Waiting for network readiness...
[INFO] Network ready: all peers connected
[INFO] Waiting for membership readiness...
[INFO] Membership ready for session 0
[INFO] Starting workloads...
[INFO] Transaction workload submitting at 5 tx/block
[INFO] DA workload: channel inscription submitted
[INFO] Block 1 observed: 3 transactions
[INFO] Block 2 observed: 5 transactions
...
[INFO] Workloads complete, evaluating expectations
[INFO] consensus_liveness: target=8, observed heights=[12, 11] ✓
[INFO] tx_inclusion_expectation: 42/50 included (84%) ✓
test local_runner_mixed_workloads ... ok
```

## Step 4: What Failure Looks Like

```
[ERROR] consensus_liveness violated (target=8):
- validator-0 height 2 below target 8
- validator-1 height 3 below target 8

test local_runner_mixed_workloads ... FAILED
```

Common causes: run duration too short, readiness not complete, node crashed.

## Step 5: Modify a Scenario

Open `tests/workflows/tests/local_runner.rs`:

```rust
// Change this:
const RUN_DURATION: Duration = Duration::from_secs(60);

// To this for a longer run:
const RUN_DURATION: Duration = Duration::from_secs(120);

// Or change validator count:
const VALIDATORS: usize = 3;  // was 2
```

Re-run:

```bash
cargo test --package tests-workflows --test local_runner -- --nocapture
```

You're now ready to explore the framework!

---

<!-- FILE: foundations/introduction.md -->

# Part I — Foundations

## Introduction

The Nomos Testing Framework solves the gap between small, isolated unit tests and full-system validation by letting teams:

1. **Describe** a cluster layout (topology)
2. **Drive** meaningful traffic (workloads)
3. **Assert** outcomes (expectations)

...all in one coherent, portable plan (a `Scenario` in code terms).

### Why Multi-Node Testing?

Many Nomos behaviors only emerge when multiple roles interact:

```
┌─────────────────────────────────────────────────────────────────┐
│                BEHAVIORS REQUIRING MULTI-NODE                   │
├─────────────────────────────────────────────────────────────────┤
│ • Block progression across validators                           │
│ • Data availability sampling and dispersal                      │
│ • Consensus under network partitions                            │
│ • Liveness recovery after node restarts                         │
│ • Transaction propagation and inclusion                         │
│ • Membership and session transitions                            │
└─────────────────────────────────────────────────────────────────┘
```

Unit tests can't catch these. This framework makes multi-node checks declarative, observable, and repeatable.

### Target Audience

| Role | Primary Concerns |
|------|------------------|
| **Protocol Engineers** | Consensus correctness, DA behavior, block progression |
| **Infrastructure/DevOps** | Runners, CI integration, logs, failure triage |
| **QA/Test Designers** | Scenario composition, workload tuning, coverage |

---

<!-- FILE: foundations/architecture.md -->

## Architecture Overview

The framework follows a clear pipeline:

```
┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌─────────────┐
│ TOPOLOGY │───▶│ SCENARIO │───▶│  RUNNER  │───▶│ WORKLOADS│───▶│EXPECTATIONS │
│          │    │          │    │          │    │          │    │             │
│ Shape    │    │ Assemble │    │ Deploy & │    │ Drive    │    │ Verify      │
│ cluster  │    │ plan     │    │ wait     │    │ traffic  │    │ outcomes    │
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └─────────────┘
```

### Component Responsibilities

| Component | Responsibility | Key Types |
|-----------|----------------|-----------|
| **Topology** | Declares cluster shape: node counts, network layout, DA parameters | `TopologyConfig`, `GeneratedTopology`, `TopologyBuilder` |
| **Scenario** | Assembles topology + workloads + expectations + duration | `Scenario<Caps>`, `ScenarioBuilder` |
| **Runner** | Deploys to environment, waits for readiness, provides `RunContext` | `Runner`, `LocalDeployer`, `ComposeRunner`, `K8sRunner` |
| **Workloads** | Generate traffic/conditions during the run | `Workload` trait, `TransactionWorkload`, `DaWorkload`, `RandomRestartWorkload` |
| **Expectations** | Judge success/failure after workloads complete | `Expectation` trait, `ConsensusLiveness`, `TxInclusionExpectation` |

### Type Flow Diagram

```
TopologyConfig
    │
    │ TopologyBuilder::new()
    ▼
TopologyBuilder ──.build()──▶ GeneratedTopology
                                    │
                                    │ contains
                                    ▼
                            GeneratedNodeConfig[]
                                    │
                                    │ Runner spawns
                                    ▼
                              Topology (live nodes)
                                    │
                                    │ provides
                                    ▼
                              NodeClients
                                    │
                                    │ wrapped in
                                    ▼
                              RunContext
```

```
ScenarioBuilder
    │
    │ .with_workload() / .with_expectation() / .with_run_duration()
    │
    │ .build()
    ▼
Scenario<Caps>
    │
    │ Deployer::deploy()
    ▼
Runner
    │
    │ .run(&mut scenario)
    ▼
RunHandle (success) or ScenarioError (failure)
```

---

<!-- FILE: foundations/testing-philosophy.md -->

## Testing Philosophy

### Core Principles

1. **Declarative over imperative**
   - Describe desired state, let framework orchestrate
   - Scenarios are data, not scripts

2. **Observable health signals**
   - Prefer liveness/inclusion signals over internal debug state
   - If users can't see it, don't assert on it

3. **Determinism first**
   - Fixed topologies and traffic rates by default
   - Variability is opt-in (chaos workloads)

4. **Protocol time, not wall time**
   - Reason in blocks and slots
   - Reduces host speed dependence

5. **Minimum run window**
   - Always allow enough blocks for meaningful assertions
   - Framework enforces minimum 2 blocks

6. **Chaos with intent**
   - Chaos workloads for resilience testing only
   - Avoid chaos in basic functional smoke tests; reserve it for dedicated resilience scenarios

### Testing Spectrum

```
┌────────────────────────────────────────────────────────────────┐
│                    WHERE THIS FRAMEWORK FITS                   │
├──────────────┬────────────────────┬────────────────────────────┤
│  UNIT TESTS  │  INTEGRATION       │  MULTI-NODE SCENARIOS      │
│              │                    │                            │
│  Fast        │  Single process    │  ◀── THIS FRAMEWORK        │
│  Isolated    │  Mock network      │                            │
│  Deterministic│  No real timing   │  Real networking           │
│              │                    │  Protocol timing           │
│  ~1000s/sec  │  ~100s/sec         │  ~1-10/hour                │
└──────────────┴────────────────────┴────────────────────────────┘
```

---

<!-- FILE: foundations/lifecycle.md -->

## Scenario Lifecycle

### Phase Overview

```
┌─────────┐   ┌─────────┐   ┌───────────┐   ┌─────────┐   ┌──────────┐   ┌──────────┐   ┌─────────┐
│  PLAN   │──▶│ DEPLOY  │──▶│ READINESS │──▶│  DRIVE  │──▶│ COOLDOWN │──▶│ EVALUATE │──▶│ CLEANUP │
└─────────┘   └─────────┘   └───────────┘   └─────────┘   └──────────┘   └──────────┘   └─────────┘
```

### Detailed Timeline

```
Time ──────────────────────────────────────────────────────────────────────▶

     │ PLAN          │ DEPLOY        │ READY    │ WORKLOADS      │COOL│ EVAL │
     │               │               │          │                │DOWN│      │
     │ Build         │ Spawn         │ Network  │ Traffic runs   │    │Check │
     │ scenario      │ nodes         │ DA       │ Blocks produce │ 5× │ all  │
     │               │ (local/       │ Member   │                │blk │expect│
     │               │ docker/k8s)   │ ship     │                │    │      │
     │               │               │          │                │    │      │
     ▼               ▼               ▼          ▼                ▼    ▼      ▼
   t=0            t=5s           t=30s       t=35s            t=95s t=100s t=105s
                                                                          │
                                                              (example    │
                                                               60s run)   ▼
                                                                       CLEANUP
```

### Phase Details

| Phase | What Happens | Code Entry Point |
|-------|--------------|------------------|
| **Plan** | Declare topology, attach workloads/expectations, set duration | `ScenarioBuilder::build()` |
| **Deploy** | Runner provisions environment | `deployer.deploy(&scenario)` |
| **Readiness** | Wait for network peers, DA balancer, membership | `wait_network_ready()`, `wait_membership_ready()`, `wait_da_balancer_ready()` |
| **Drive** | Workloads run concurrently for configured duration | `workload.start(ctx)` inside `Runner::run_workloads()` |
| **Cooldown** | Stabilization period (5× block interval, 30s min if chaos used) | Automatic in `Runner::cooldown()` |
| **Evaluate** | All expectations run; failures **aggregated** (not short-circuited) | `expectation.evaluate(ctx)` |
| **Cleanup** | Resources reclaimed via `CleanupGuard` | `Drop` impl on `Runner` |

### Readiness Phases (Detail)

Runners perform three distinct readiness checks:

```
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│ NETWORK         │────▶│ MEMBERSHIP      │────▶│ DA BALANCER     │
│                 │     │                 │     │                 │
│ libp2p peers    │     │ Session 0       │     │ Dispersal peers │
│ connected       │     │ assignments     │     │ available       │
│                 │     │ propagated      │     │                 │
│ Timeout: 60s    │     │ Timeout: 60s    │     │ Timeout: 60s    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
```

---

<!-- FILE: guide/authoring-scenarios.md -->

# Part II — User Guide

## Authoring Scenarios

### The 5-Step Process

```
┌─────────────────────────────────────────────────────────────────┐
│                    SCENARIO AUTHORING FLOW                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. SHAPE TOPOLOGY          2. ATTACH WORKLOADS                 │
│     ┌─────────────┐            ┌─────────────┐                  │
│     │ Validators  │            │ Transactions│                  │
│     │ Executors   │            │ DA blobs    │                  │
│     │ Network     │            │ Chaos       │                  │
│     │ DA params   │            └─────────────┘                  │
│     └─────────────┘                                             │
│                                                                 │
│  3. DEFINE EXPECTATIONS     4. SET DURATION                     │
│     ┌─────────────┐            ┌─────────────┐                  │
│     │ Liveness    │            │ See duration│                  │
│     │ Inclusion   │            │ heuristics  │                  │
│     │ Custom      │            │ table below │                  │
│     └─────────────┘            └─────────────┘                  │
│                                                                 │
│  5. CHOOSE RUNNER                                               │
│     ┌─────────┐ ┌─────────┐ ┌─────────┐                         │
│     │ Local   │ │ Compose │ │ K8s     │                         │
│     └─────────┘ └─────────┘ └─────────┘                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Duration Heuristics

Use protocol time (blocks), not wall time. With default 2-second slots and active slot coefficient of 0.9, expect roughly one block every ~2–3 seconds (subject to randomness). Individual topologies may override these defaults.

| Scenario Type | Min Blocks | Recommended Duration | Notes |
|---------------|------------|---------------------|-------|
| Smoke test | 5-10 | 30-60s | Quick validation |
| Tx throughput | 20-50 | 2-3 min | Capture steady state |
| DA + tx combined | 30-50 | 3-5 min | Observe interaction |
| Chaos/resilience | 50-100 | 5-10 min | Allow restart recovery |
| Long-run stability | 100+ | 10-30 min | Trend validation |

> **Note**: The framework enforces a minimum of 2 blocks. Very short durations are clamped automatically.

### Builder Pattern Overview

```rust
ScenarioBuilder::with_node_counts(validators, executors)
    // 1. Topology sub-builder
    .topology()
        .network_star()
        .validators(n)
        .executors(n)
        .apply()  // Returns to main builder

    // 2. Wallet seeding
    .wallets(user_count)

    // 3. Workload sub-builders
    .transactions()
        .rate(per_block)
        .users(actors)
        .apply()

    .da()
        .channel_rate(n)
        .blob_rate(n)
        .apply()

    // 4. Optional chaos (changes Caps type)
    .enable_node_control()
    .chaos_random_restart()
        .validators(true)
        .executors(true)
        .min_delay(Duration)
        .max_delay(Duration)
        .target_cooldown(Duration)
        .apply()

    // 5. Duration and expectations
    .with_run_duration(duration)
    .expect_consensus_liveness()

    // 6. Build
    .build()
```

---

<!-- FILE: guide/workloads.md -->

## Workloads

Workloads generate traffic and conditions during a scenario run.

### Available Workloads

| Workload | Purpose | Key Config | Bundled Expectation |
|----------|---------|------------|---------------------|
| **Transaction** | Submit transactions at configurable rate | `rate`, `users` | `TxInclusionExpectation` |
| **DA** | Create channels, publish blobs | `channel_rate`, `blob_rate` | `DaWorkloadExpectation` |
| **Chaos** | Restart nodes randomly | `min_delay`, `max_delay`, `target_cooldown` | None (use `ConsensusLiveness`) |

### Transaction Workload

Submits user-level transactions at a configurable rate.

```rust
.transactions()
    .rate(5)      // 5 transactions per block opportunity
    .users(8)     // Use 8 distinct wallet actors
    .apply()
```

**Requires**: Seeded wallets (`.wallets(n)`)

### DA Workload

Drives data-availability paths: channel inscriptions and blob publishing.

```rust
.da()
    .channel_rate(1)  // 1 channel operation per block
    .blob_rate(1)     // 1 blob per channel
    .apply()
```

**Requires**: At least one executor for blob publishing.

### Chaos Workload

Triggers controlled node restarts to test resilience.

```rust
.enable_node_control()  // Required capability
.chaos_random_restart()
    .validators(true)           // Include validators
    .executors(true)            // Include executors
    .min_delay(Duration::from_secs(45))    // Min time between restarts
    .max_delay(Duration::from_secs(75))    // Max time between restarts
    .target_cooldown(Duration::from_secs(120))  // Per-node cooldown
    .apply()
```

**Safety behavior**: If only one validator is configured, the chaos workload automatically skips validator restarts to avoid halting consensus.

**Cooldown behavior**: After chaos workloads, the runner adds a minimum 30-second cooldown before evaluating expectations.

---

<!-- FILE: guide/expectations.md -->

## Expectations

Expectations are post-run assertions that judge success or failure.

### Available Expectations

| Expectation | Asserts | Default Tolerance |
|-------------|---------|-------------------|
| **ConsensusLiveness** | All validators reach minimum block height | 80% of expected blocks |
| **TxInclusionExpectation** | Submitted transactions appear in blocks | 50% inclusion ratio |
| **DaWorkloadExpectation** | Planned channels/blobs were included | 80% inclusion ratio |
| **PrometheusBlockProduction** | Prometheus metrics show block production | Exact minimum |

### ConsensusLiveness

The primary health check. Polls each validator's HTTP consensus info.

```rust
// With default 80% tolerance:
.expect_consensus_liveness()

// Or with specific minimum:
.with_expectation(ConsensusLiveness::with_minimum(10))

// Or with custom tolerance:
.with_expectation(ConsensusLiveness::with_tolerance(0.9))
```

> **Note for advanced users**: There are two `ConsensusLiveness` implementations in the codebase:
> - `testing_framework_workflows::ConsensusLiveness` — HTTP-based, checks heights via `consensus_info()` API. This is what `.expect_consensus_liveness()` uses.
> - `testing_framework_core::scenario::expectations::ConsensusLiveness` — Also HTTP-based but with different tolerance semantics.
>
> There's also `PrometheusBlockProduction` in core for Prometheus-based metrics checks when telemetry is configured.

### Expectation Lifecycle

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│    init()   │────▶│start_capture│────▶│  evaluate() │
│             │     │    ()       │     │             │
│ Validate    │     │ Snapshot    │     │ Assert      │
│ prereqs     │     │ baseline    │     │ conditions  │
│             │     │ (optional)  │     │             │
└─────────────┘     └─────────────┘     └─────────────┘
     │                    │                    │
     ▼                    ▼                    ▼
  At build()         Before workloads     After workloads
```

### Common Expectation Mistakes

| Mistake | Why It Fails | Fix |
|---------|--------------|-----|
| Expecting inclusion too soon | Transactions need blocks to be included | Increase duration |
| Wall-clock timing assertions | Host speed varies | Use block counts via `RunMetrics` |
| Duration too short | Not enough blocks observed | Use duration heuristics table |
| Skipping `start_capture()` | Baseline not established | Implement if comparing before/after |
| Asserting on internal state | Framework can't observe it | Use `consensus_info()` or `BlockFeed` |

---

<!-- FILE: guide/blockfeed.md -->

## BlockFeed Deep Dive

The `BlockFeed` is the primary mechanism for observing block production during a run.

### What BlockFeed Provides

```rust
pub struct BlockFeed {
    // Subscribe to receive block notifications
    pub fn subscribe(&self) -> broadcast::Receiver<Arc<BlockRecord>>;

    // Access aggregate statistics
    pub fn stats(&self) -> Arc<BlockStats>;
}

pub struct BlockRecord {
    pub header: HeaderId,                    // Block header ID
    pub block: Arc<Block<SignedMantleTx>>,   // Full block with transactions
}

pub struct BlockStats {
    // Total transactions observed across all blocks
    pub fn total_transactions(&self) -> u64;
}
```

### How It Works

```
┌────────────────┐     ┌────────────────┐     ┌────────────────┐
│  BlockScanner  │────▶│   BlockFeed    │────▶│  Subscribers   │
│                │     │                │     │                │
│ Polls validator│     │ broadcast      │     │ Workloads      │
│ consensus_info │     │ channel        │     │ Expectations   │
│ every 1 second │     │ (1024 buffer)  │     │                │
│                │     │                │     │                │
│ Fetches blocks │     │ Records stats  │     │                │
│ via storage_   │     │                │     │                │
│ block()        │     │                │     │                │
└────────────────┘     └────────────────┘     └────────────────┘
```

### Using BlockFeed in Workloads

```rust
async fn start(&self, ctx: &RunContext) -> Result<(), DynError> {
    let mut receiver = ctx.block_feed().subscribe();

    loop {
        match receiver.recv().await {
            Ok(record) => {
                // Process block
                let height = record.block.header().slot().into();
                let tx_count = record.block.transactions().len();

                // Check for specific transactions
                for tx in record.block.transactions() {
                    // ... examine transaction
                }
            }
            Err(broadcast::error::RecvError::Lagged(n)) => {
                // Fell behind, n messages skipped
                continue;
            }
            Err(broadcast::error::RecvError::Closed) => {
                return Err("block feed closed".into());
            }
        }
    }
}
```

### Using BlockFeed in Expectations

```rust
async fn start_capture(&mut self, ctx: &RunContext) -> Result<(), DynError> {
    let mut receiver = ctx.block_feed().subscribe();
    let observed = Arc::new(Mutex::new(HashSet::new()));
    let observed_clone = Arc::clone(&observed);

    // Spawn background task to collect observations
    tokio::spawn(async move {
        while let Ok(record) = receiver.recv().await {
            // Record what we observe
            let mut guard = observed_clone.lock().unwrap();
            for tx in record.block.transactions() {
                guard.insert(tx.hash());
            }
        }
    });

    self.observed = Some(observed);
    Ok(())
}

async fn evaluate(&mut self, ctx: &RunContext) -> Result<(), DynError> {
    let observed = self.observed.as_ref().ok_or("not captured")?;
    let guard = observed.lock().unwrap();

    // Compare observed vs expected
    if guard.len() < self.expected_count {
        return Err(format!(
            "insufficient inclusions: {} < {}",
            guard.len(), self.expected_count
        ).into());
    }
    Ok(())
}
```

---

<!-- FILE: runners/local.md -->

# Runner: Local

Runs node binaries as local processes on the host.

## What It Does

- Spawns validators/executors directly on the host with ephemeral data dirs.
- Binds HTTP/libp2p ports on localhost; no containers involved.
- Fastest feedback loop; best for unit-level scenarios and debugging.

## Prerequisites

- Rust toolchain installed.
- No ports in use on the default ranges (see runner config if you need to override).

## How to Run

```bash
cargo test -p tests-workflows --test local_runner -- local_runner_mixed_workloads --nocapture
```

Adjust validator/executor counts inside the test file or via the scenario builder.

## Troubleshooting

- Port already in use → change base ports in the test or stop the conflicting process.
- Slow start on first run → binaries need to be built; reruns are faster.
- No blocks → ensure workloads enabled and duration long enough (≥60s default).

---

<!-- FILE: runners/compose.md -->

# Runner: Docker Compose

Runs validators/executors in Docker containers using docker-compose.

## What It Does

- Builds/pulls the node image, then creates a network and one container per role.
- Uses Compose health checks for readiness, then runs workloads/expectations.
- Cleans up containers and network unless preservation is requested.

## Prerequisites

- Docker with the Compose plugin.
- Built node image available locally (default `nomos-testnet:local`).
  - Build from repo root: `testnet/scripts/build_test_image.sh`
- Optional env vars:
  - `NOMOS_TESTNET_IMAGE` (override tag)
  - `COMPOSE_NODE_PAIRS=1x1` (validators x executors)
  - `COMPOSE_RUNNER_PRESERVE=1` to keep the stack for inspection

## How to Run

```bash
POL_PROOF_DEV_MODE=true COMPOSE_NODE_PAIRS=1x1 \
cargo test -p tests-workflows compose_runner_mixed_workloads -- --nocapture
```

## Troubleshooting

- Image not found → set `NOMOS_TESTNET_IMAGE` to a built/pulled tag.
- Peers not connecting → inspect `docker compose logs` for validator/executor.
- Stack left behind → `docker compose -p <project> down` and remove the network.

---

<!-- FILE: runners/k8s.md -->

# Runner: Kubernetes

Deploys validators/executors as a Helm release into the current Kubernetes context.

## What It Does

- Builds/pulls the node image, packages Helm assets, installs into a unique namespace.
- Waits for pod readiness and validator HTTP endpoint, then drives workloads.
- Tears down the namespace unless preservation is requested.

## Prerequisites

- kubectl and Helm on PATH; a running Kubernetes cluster/context (e.g., Docker Desktop, kind).
- Docker buildx to build the node image for your arch.
- Built image tag exported:
  - Build: `testnet/scripts/build_test_image.sh` (default tag `nomos-testnet:local`)
  - Export: `export NOMOS_TESTNET_IMAGE=nomos-testnet:local`
- Optional: `K8S_RUNNER_PRESERVE=1` to keep the namespace for debugging.

## How to Run

```bash
NOMOS_TESTNET_IMAGE=nomos-testnet:local \
cargo test -p tests-workflows demo_k8s_runner_tx_workload -- --nocapture
```

## Troubleshooting

- Timeout waiting for validator HTTP → check pod logs: `kubectl logs -n <ns> deploy/validator`.
- No peers/tx inclusion → inspect rendered `/config.yaml` in the pod and cfgsync logs.
- Cleanup stuck → `kubectl delete namespace <ns>` from the preserved namespace name.

---

<!-- FILE: guide/runners.md -->

## Runners

Runners deploy scenarios to different environments.

### Runner Decision Matrix

| Goal | Recommended Runner | Why |
|------|-------------------|-----|
| Fast local iteration | `LocalDeployer` | No container overhead |
| Reproducible e2e checks | `ComposeRunner` | Stable multi-node isolation |
| High fidelity / CI | `K8sRunner` | Real cluster behavior |
| Config validation only | Dry-run (future) | Catch errors before nodes |

### Runner Comparison

| Aspect | LocalDeployer | ComposeRunner | K8sRunner |
|--------|---------------|---------------|-----------|
| **Speed** | ⚡ Fastest | 🔄 Medium | 🏗️ Slowest |
| **Setup** | Binaries only | Docker daemon | Cluster access |
| **Isolation** | Process-level | Container-level | Pod-level |
| **Port discovery** | Direct | Auto via Docker | NodePort |
| **Node control** | Full | Via container restart | Via pod restart |
| **Observability** | Local files | Container logs | Prometheus + logs |
| **CI suitability** | Dev only | Good | Best |

### LocalDeployer

Spawns nodes as host processes.

```rust
let deployer = LocalDeployer::default();
// Or skip membership check for faster startup:
let deployer = LocalDeployer::new().with_membership_check(false);

let runner = deployer.deploy(&scenario).await?;
```

### ComposeRunner

Starts nodes in Docker containers via Docker Compose.

```rust
let deployer = ComposeRunner::default();
let runner = deployer.deploy(&scenario).await?;
```

**Uses Configuration Sync (cfgsync)** — see Operations section.

### K8sRunner

Deploys to a Kubernetes cluster.

```rust
let deployer = K8sRunner::new();
let runner = match deployer.deploy(&scenario).await {
    Ok(r) => r,
    Err(K8sRunnerError::ClientInit { source }) => {
        // Cluster unavailable
        return;
    }
    Err(e) => panic!("deployment failed: {e}"),
};
```

---

<!-- FILE: guide/operations.md -->

## Operations

### Prerequisites Checklist

```
□ nomos-node checkout available (sibling directory)
□ Binaries built: cargo build -p nomos-node -p nomos-executor
□ Runner platform ready:
  □ Local: binaries in target/debug/
  □ Compose: Docker daemon running
  □ K8s: kubectl configured, cluster accessible
□ KZG prover assets fetched (for DA scenarios)
□ Ports available (default ranges: 18800+, 4400 for cfgsync)
```

### Environment Variables

| Variable | Effect | Default |
|----------|--------|---------|
| `SLOW_TEST_ENV=true` | 2× timeout multiplier for all readiness checks | `false` |
| `NOMOS_TESTS_TRACING=true` | Enable debug tracing output | `false` |
| `NOMOS_TESTS_KEEP_LOGS=1` | Preserve temp directories after run | Delete |
| `NOMOS_TESTNET_IMAGE` | Docker image for Compose/K8s runners | `nomos-testnet:local` |
| `COMPOSE_RUNNER_PRESERVE=1` | Keep Compose resources after run | Delete |
| `TEST_FRAMEWORK_PROMETHEUS_PORT` | Host port for Prometheus (Compose) | `9090` |

### Configuration Synchronization (cfgsync)

When running in Docker Compose or Kubernetes, the framework uses **dynamic configuration injection** instead of static config files.

```
┌─────────────────┐                    ┌─────────────────┐
│  RUNNER HOST    │                    │  NODE CONTAINER │
│                 │                    │                 │
│ ┌─────────────┐ │   HTTP :4400       │ ┌─────────────┐ │
│ │ cfgsync     │◀├───────────────────┤│ cfgsync     │ │
│ │ server      │ │                    │ │ client      │ │
│ │             │ │  1. Request config │ │             │ │
│ │ Holds       │ │  2. Receive YAML   │ │ Fetches     │ │
│ │ generated   │ │  3. Start node     │ │ config at   │ │
│ │ topology    │ │                    │ │ startup     │ │
│ └─────────────┘ │                    │ └─────────────┘ │
└─────────────────┘                    └─────────────────┘
```

**Why cfgsync?**
- Handles dynamic port discovery
- Injects cryptographic keys
- Supports topology changes without rebuilding images

**Troubleshooting cfgsync:**

| Symptom | Cause | Fix |
|---------|-------|-----|
| Containers stuck at startup | cfgsync server unreachable | Check port 4400 is not blocked |
| "connection refused" in logs | Server not started | Verify runner started cfgsync |
| Config mismatch errors | Stale cfgsync template | Clean temp directories |

---

<!-- FILE: reference/troubleshooting.md -->

# Part IV — Reference

## Troubleshooting

### Error Messages and Fixes

#### Readiness Timeout

```
Error: readiness probe failed: timed out waiting for network readiness:
  validator#0@18800: 0 peers (expected 1)
  validator#1@18810: 0 peers (expected 1)
```

**Causes:**
- Nodes not fully started
- Network configuration mismatch
- Ports blocked

**Fixes:**
- Set `SLOW_TEST_ENV=true` for 2× timeout
- Check node logs for startup errors
- Verify ports are available

#### Consensus Liveness Violation

```
Error: expectations failed:
consensus liveness violated (target=8):
- validator-0 height 2 below target 8
- validator-1 height 3 below target 8
```

**Causes:**
- Run duration too short
- Node crashed during run
- Consensus stalled

**Fixes:**
- Increase `with_run_duration()`
- Check node logs for panics
- Verify network connectivity

#### Transaction Inclusion Below Threshold

```
Error: tx_inclusion_expectation: observed 15 below required 25
```

**Causes:**
- Wallet not seeded
- Transaction rate too high
- Mempool full

**Fixes:**
- Add `.wallets(n)` to scenario
- Reduce `.rate()` in transaction workload
- Increase duration for more blocks

#### Chaos Workload No Targets

```
Error: chaos restart workload has no eligible targets
```

**Causes:**
- No validators or executors configured
- Only one validator (skipped for safety)
- Chaos disabled for both roles

**Fixes:**
- Add more validators (≥2) for chaos
- Enable `.executors(true)` if executors present
- Use different workload for single-validator tests

#### BlockFeed Closed

```
Error: block feed closed while waiting for channel operations
```

**Causes:**
- Source validator crashed
- Network partition
- Run ended prematurely

**Fixes:**
- Check validator logs
- Increase run duration
- Verify readiness completed

### Log Locations

| Runner | Log Location |
|--------|--------------|
| Local | Temp directory (printed at startup), or set `NOMOS_TESTS_KEEP_LOGS=1` |
| Compose | `docker logs <container_name>` |
| K8s | `kubectl logs <pod_name>` |

### Debugging Flow

```
┌─────────────────┐
│ Scenario fails  │
└────────┬────────┘
         ▼
┌────────────────────────────────────────┐
│ 1. Check error message category        │
│    - Readiness? → Check startup logs   │
│    - Workload? → Check workload config │
│    - Expectation? → Check assertions   │
└────────┬───────────────────────────────┘
         ▼
┌────────────────────────────────────────┐
│ 2. Check node logs                     │
│    - Panics? → Bug in node             │
│    - Connection errors? → Network      │
│    - Config errors? → cfgsync issue    │
└────────┬───────────────────────────────┘
         ▼
┌────────────────────────────────────────┐
│ 3. Reproduce with tracing              │
│    NOMOS_TESTS_TRACING=true cargo test │
└────────┬───────────────────────────────┘
         ▼
┌────────────────────────────────────────┐
│ 4. Simplify scenario                   │
│    - Reduce validators                 │
│    - Remove workloads one by one       │
│    - Increase duration                 │
└────────────────────────────────────────┘
```

---

<!-- FILE: reference/dsl-cheat-sheet.md -->

## DSL Cheat Sheet

### Complete Builder Reference

```rust
// ═══════════════════════════════════════════════════════════════
// TOPOLOGY
// ═══════════════════════════════════════════════════════════════

ScenarioBuilder::with_node_counts(validators, executors)

    .topology()
        .network_star()              // Star layout (hub-spoke)
        .validators(count)           // Validator count
        .executors(count)            // Executor count
        .apply()                     // Return to main builder

// ═══════════════════════════════════════════════════════════════
// WALLET SEEDING
// ═══════════════════════════════════════════════════════════════

    .wallets(user_count)             // Uniform: 100 funds/user
    .with_wallet_config(custom)      // Custom WalletConfig

// ═══════════════════════════════════════════════════════════════
// TRANSACTION WORKLOAD
// ═══════════════════════════════════════════════════════════════

    .transactions()
        .rate(txs_per_block)         // NonZeroU64
        .users(actor_count)          // NonZeroUsize
        .apply()

// ═══════════════════════════════════════════════════════════════
// DA WORKLOAD
// ═══════════════════════════════════════════════════════════════

    .da()
        .channel_rate(ops_per_block) // Channel inscriptions
        .blob_rate(blobs_per_chan)   // Blobs per channel
        .apply()

// ═══════════════════════════════════════════════════════════════
// CHAOS WORKLOAD (requires .enable_node_control())
// ═══════════════════════════════════════════════════════════════

    .enable_node_control()           // Required first!

    .chaos_random_restart()
        .validators(bool)            // Restart validators?
        .executors(bool)             // Restart executors?
        .min_delay(Duration)         // Min between restarts
        .max_delay(Duration)         // Max between restarts
        .target_cooldown(Duration)   // Per-node cooldown
        .apply()

// ═══════════════════════════════════════════════════════════════
// DURATION & EXPECTATIONS
// ═══════════════════════════════════════════════════════════════

    .with_run_duration(Duration)     // Clamped to ≥2 blocks

    .expect_consensus_liveness()     // Default 80% tolerance

    .with_expectation(custom)        // Add custom Expectation
    .with_workload(custom)           // Add custom Workload

// ═══════════════════════════════════════════════════════════════
// BUILD
// ═══════════════════════════════════════════════════════════════

    .build()                         // Returns Scenario<Caps>
```

### Quick Patterns

```rust
// Minimal smoke test
ScenarioBuilder::with_node_counts(2, 0)
    .with_run_duration(Duration::from_secs(30))
    .expect_consensus_liveness()
    .build()

// Transaction throughput
ScenarioBuilder::with_node_counts(2, 0)
    .wallets(64)
    .transactions().rate(10).users(8).apply()
    .with_run_duration(Duration::from_secs(120))
    .expect_consensus_liveness()
    .build()

// DA + transactions
ScenarioBuilder::with_node_counts(1, 1)
    .wallets(64)
    .transactions().rate(5).users(4).apply()
    .da().channel_rate(1).blob_rate(1).apply()
    .with_run_duration(Duration::from_secs(180))
    .expect_consensus_liveness()
    .build()

// Chaos resilience
ScenarioBuilder::with_node_counts(3, 1)
    .enable_node_control()
    .wallets(64)
    .transactions().rate(3).users(4).apply()
    .chaos_random_restart()
        .validators(true).executors(true)
        .min_delay(Duration::from_secs(45))
        .max_delay(Duration::from_secs(75))
        .target_cooldown(Duration::from_secs(120))
        .apply()
    .with_run_duration(Duration::from_secs(300))
    .expect_consensus_liveness()
    .build()
```

---

<!-- FILE: reference/api-reference.md -->

## API Quick Reference

### RunContext

```rust
impl RunContext {
    // ─────────────────────────────────────────────────────────────
    // TOPOLOGY ACCESS
    // ─────────────────────────────────────────────────────────────

    /// Static topology configuration
    pub fn descriptors(&self) -> &GeneratedTopology;

    /// Live node handles (if available)
    pub fn topology(&self) -> Option<&Topology>;

    // ─────────────────────────────────────────────────────────────
    // CLIENT ACCESS
    // ─────────────────────────────────────────────────────────────

    /// All node clients
    pub fn node_clients(&self) -> &NodeClients;

    /// Random node client
    pub fn random_node_client(&self) -> Option<&ApiClient>;

    /// Cluster client with retry logic
    pub fn cluster_client(&self) -> ClusterClient<'_>;

    // ─────────────────────────────────────────────────────────────
    // WALLET ACCESS
    // ─────────────────────────────────────────────────────────────

    /// Seeded wallet accounts
    pub fn wallet_accounts(&self) -> &[WalletAccount];

    // ─────────────────────────────────────────────────────────────
    // OBSERVABILITY
    // ─────────────────────────────────────────────────────────────

    /// Block observation stream
    pub fn block_feed(&self) -> BlockFeed;

    /// Prometheus metrics (if configured)
    pub fn telemetry(&self) -> &Metrics;

    // ─────────────────────────────────────────────────────────────
    // TIMING
    // ─────────────────────────────────────────────────────────────

    /// Configured run duration
    pub fn run_duration(&self) -> Duration;

    /// Expected block count for this run
    pub fn expected_blocks(&self) -> u64;

    /// Full timing metrics
    pub fn run_metrics(&self) -> RunMetrics;

    // ─────────────────────────────────────────────────────────────
    // NODE CONTROL (CHAOS)
    // ─────────────────────────────────────────────────────────────

    /// Node control handle (if enabled)
    pub fn node_control(&self) -> Option<Arc<dyn NodeControlHandle>>;
}
```

### NodeClients

```rust
impl NodeClients {
    pub fn validator_clients(&self) -> &[ApiClient];
    pub fn executor_clients(&self) -> &[ApiClient];
    pub fn random_validator(&self) -> Option<&ApiClient>;
    pub fn random_executor(&self) -> Option<&ApiClient>;
    pub fn all_clients(&self) -> impl Iterator<Item = &ApiClient>;
    pub fn any_client(&self) -> Option<&ApiClient>;
    pub fn cluster_client(&self) -> ClusterClient<'_>;
}
```

### ApiClient

```rust
impl ApiClient {
    // Consensus
    pub async fn consensus_info(&self) -> reqwest::Result<CryptarchiaInfo>;

    // Network
    pub async fn network_info(&self) -> reqwest::Result<Libp2pInfo>;

    // Transactions
    pub async fn submit_transaction(&self, tx: &SignedMantleTx) -> reqwest::Result<()>;

    // Storage
    pub async fn storage_block(&self, id: &HeaderId)
        -> reqwest::Result<Option<Block<SignedMantleTx>>>;

    // DA
    pub async fn balancer_stats(&self) -> reqwest::Result<BalancerStats>;
    pub async fn monitor_stats(&self) -> reqwest::Result<MonitorStats>;
    pub async fn da_get_membership(&self, session: &SessionNumber)
        -> reqwest::Result<MembershipResponse>;

    // URLs
    pub fn base_url(&self) -> &Url;
}
```

### CryptarchiaInfo

```rust
pub struct CryptarchiaInfo {
    pub height: u64,      // Current block height
    pub slot: Slot,       // Current slot number
    pub tip: HeaderId,    // Tip of the chain
    // ... additional fields
}
```

### Key Traits

```rust
#[async_trait]
pub trait Workload: Send + Sync {
    fn name(&self) -> &str;
    fn expectations(&self) -> Vec<Box<dyn Expectation>> { vec![] }
    fn init(&mut self, topology: &GeneratedTopology, metrics: &RunMetrics)
        -> Result<(), DynError> { Ok(()) }
    async fn start(&self, ctx: &RunContext) -> Result<(), DynError>;
}

#[async_trait]
pub trait Expectation: Send + Sync {
    fn name(&self) -> &str;
    fn init(&mut self, topology: &GeneratedTopology, metrics: &RunMetrics)
        -> Result<(), DynError> { Ok(()) }
    async fn start_capture(&mut self, ctx: &RunContext) -> Result<(), DynError> { Ok(()) }
    async fn evaluate(&mut self, ctx: &RunContext) -> Result<(), DynError>;
}

#[async_trait]
pub trait Deployer<Caps = ()>: Send + Sync {
    type Error;
    async fn deploy(&self, scenario: &Scenario<Caps>) -> Result<Runner, Self::Error>;
}

#[async_trait]
pub trait NodeControlHandle: Send + Sync {
    async fn restart_validator(&self, index: usize) -> Result<(), DynError>;
    async fn restart_executor(&self, index: usize) -> Result<(), DynError>;
}
```

---

<!-- FILE: reference/glossary.md -->

## Glossary

### Protocol Terms

| Term | Definition |
|------|------------|
| **Slot** | Fixed time interval in the consensus protocol (default: 2 seconds) |
| **Block** | Unit of consensus; contains transactions and header |
| **Active Slot Coefficient** | Probability of block production per slot (default: 0.5) |
| **Protocol Interval** | Expected time between blocks: `slot_duration / active_slot_coeff` |

### Framework Terms

| Term | Definition |
|------|------------|
| **Topology** | Declarative description of cluster shape, roles, and parameters |
| **GeneratedTopology** | Concrete topology with generated configs, ports, and keys |
| **Scenario** | Plan combining topology + workloads + expectations + duration |
| **Workload** | Traffic/behavior generator during a run |
| **Expectation** | Post-run assertion judging success/failure |
| **BlockFeed** | Stream of block observations for workloads/expectations |
| **RunContext** | Shared context with clients, metrics, observability |
| **RunMetrics** | Computed timing: expected blocks, block interval, duration |
| **NodeClients** | Collection of API clients for validators and executors |
| **ApiClient** | HTTP client for node consensus, network, and DA endpoints |
| **cfgsync** | Dynamic configuration injection for distributed runners |

### Runner Terms

| Term | Definition |
|------|------------|
| **Deployer** | Creates a `Runner` from a `Scenario` |
| **Runner** | Manages execution: workloads, expectations, cleanup |
| **RunHandle** | Returned on success; holds context and cleanup |
| **CleanupGuard** | Ensures resources are reclaimed on drop |
| **NodeControlHandle** | Interface for restarting nodes (chaos) |

---

<!-- FILE: recipes/index.md -->

# Part V — Scenario Recipes

Complete, copy-paste runnable scenarios.

## Recipe 1: Minimal Smoke Test

**Goal**: Verify basic consensus works with minimal setup.

```rust
use std::time::Duration;
use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder};
use testing_framework_runner_local::LocalDeployer;

#[tokio::test]
async fn smoke_test_consensus() {
    // Minimal: 2 validators, no workloads, just check blocks produced
    let mut plan = ScenarioBuilder::with_node_counts(2, 0)
        .topology()
            .network_star()
            .validators(2)
            .executors(0)
            .apply()
        .with_run_duration(Duration::from_secs(30))
        .expect_consensus_liveness()
        .build();

    let deployer = LocalDeployer::default();
    let runner = deployer.deploy(&plan).await.expect("deployment");
    runner.run(&mut plan).await.expect("scenario passed");
}
```

**Expected output**:
```
[INFO] consensus_liveness: target=4, observed heights=[6, 5] ✓
```

**Common failures**:
- `height 0 below target`: Nodes didn't start, check binaries exist
- Timeout: Increase to 60s or set `SLOW_TEST_ENV=true`

---

## Recipe 2: Transaction Throughput Baseline

**Goal**: Measure transaction inclusion under load.

```rust
use std::time::Duration;
use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder};
use testing_framework_runner_local::LocalDeployer;
use tests_workflows::ScenarioBuilderExt as _;

const VALIDATORS: usize = 2;
const TX_RATE: u64 = 10;
const USERS: usize = 8;
const WALLETS: usize = 64;
const DURATION: Duration = Duration::from_secs(120);

#[tokio::test]
async fn transaction_throughput_baseline() {
    let mut plan = ScenarioBuilder::with_node_counts(VALIDATORS, 0)
        .topology()
            .network_star()
            .validators(VALIDATORS)
            .executors(0)
            .apply()
        .wallets(WALLETS)
        .transactions()
            .rate(TX_RATE)
            .users(USERS)
            .apply()
        .with_run_duration(DURATION)
        .expect_consensus_liveness()
        .build();

    let deployer = LocalDeployer::default();
    let runner = deployer.deploy(&plan).await.expect("deployment");

    let handle = runner.run(&mut plan).await.expect("scenario passed");

    // Optional: Check stats
    let stats = handle.context().block_feed().stats();
    println!("Total transactions included: {}", stats.total_transactions());
}
```

**Expected output**:
```
[INFO] tx_inclusion_expectation: 180/200 included (90%) ✓
[INFO] consensus_liveness: target=15, observed heights=[18, 17] ✓
Total transactions included: 180
```

**Common failures**:
- `observed 0 below required`: Forgot `.wallets()`
- Low inclusion: Reduce `TX_RATE` or increase `DURATION`

---

## Recipe 3: DA + Transaction Combined Stress

**Goal**: Exercise both transaction and data-availability paths.

```rust
use std::time::Duration;
use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder};
use testing_framework_runner_local::LocalDeployer;
use tests_workflows::ScenarioBuilderExt as _;

#[tokio::test]
async fn da_tx_combined_stress() {
    let mut plan = ScenarioBuilder::with_node_counts(1, 1)  // Need executor for DA
        .topology()
            .network_star()
            .validators(1)
            .executors(1)
            .apply()
        .wallets(64)
        .transactions()
            .rate(5)
            .users(4)
            .apply()
        .da()
            .channel_rate(2)   // 2 channel inscriptions per block
            .blob_rate(1)      // 1 blob per channel
            .apply()
        .with_run_duration(Duration::from_secs(180))
        .expect_consensus_liveness()
        .build();

    let deployer = LocalDeployer::default();
    let runner = deployer.deploy(&plan).await.expect("deployment");
    runner.run(&mut plan).await.expect("scenario passed");
}
```

**Expected output**:
```
[INFO] da_workload_inclusions: 2/2 channels inscribed ✓
[INFO] tx_inclusion_expectation: 45/50 included (90%) ✓
[INFO] consensus_liveness: target=22, observed heights=[25, 24] ✓
```

**Common failures**:
- `da workload requires at least one executor`: Add executor to topology
- Blob publish failures: Check DA balancer readiness

---

## Recipe 4: Chaos Resilience Test

**Goal**: Verify system recovers from node restarts.

```rust
use std::time::Duration;
use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder};
use testing_framework_runner_local::LocalDeployer;
use tests_workflows::{ChaosBuilderExt as _, ScenarioBuilderExt as _};

#[tokio::test]
async fn chaos_resilience_test() {
    let mut plan = ScenarioBuilder::with_node_counts(3, 1)  // Need >1 validator for chaos
        .enable_node_control()  // Required for chaos!
        .topology()
            .network_star()
            .validators(3)
            .executors(1)
            .apply()
        .wallets(64)
        .transactions()
            .rate(3)  // Lower rate for stability during chaos
            .users(4)
            .apply()
        .chaos_random_restart()
            .validators(true)
            .executors(true)
            .min_delay(Duration::from_secs(45))
            .max_delay(Duration::from_secs(75))
            .target_cooldown(Duration::from_secs(120))
            .apply()
        .with_run_duration(Duration::from_secs(300))  // 5 minutes
        .expect_consensus_liveness()
        .build();

    let deployer = LocalDeployer::default();
    let runner = deployer.deploy(&plan).await.expect("deployment");
    runner.run(&mut plan).await.expect("chaos scenario passed");
}
```

**Expected output**:
```
[INFO] Restarting validator-1
[INFO] Restarting executor-0
[INFO] Restarting validator-2
[INFO] consensus_liveness: target=35, observed heights=[42, 38, 40, 39] ✓
```

**Common failures**:
- `no eligible targets`: Need ≥2 validators (safety skips single validator)
- Liveness violation: Increase `target_cooldown`, reduce restart frequency

---

## Recipe 5: Docker Compose Reproducible Test

**Goal**: Run in containers for CI reproducibility.

```rust
use std::time::Duration;
use testing_framework_core::scenario::{Deployer as _, ScenarioBuilder};
use testing_framework_runner_compose::ComposeRunner;
use tests_workflows::ScenarioBuilderExt as _;

#[tokio::test]
#[ignore = "requires Docker"]
async fn compose_reproducible_test() {
    let mut plan = ScenarioBuilder::with_node_counts(2, 1)
        .topology()
            .network_star()
            .validators(2)
            .executors(1)
            .apply()
        .wallets(64)
        .transactions()
            .rate(5)
            .users(8)
            .apply()
        .da()
            .channel_rate(1)
            .blob_rate(1)
            .apply()
        .with_run_duration(Duration::from_secs(120))
        .expect_consensus_liveness()
        .build();

    let deployer = ComposeRunner::default();
    let runner = deployer.deploy(&plan).await.expect("compose deployment");

    // Verify Prometheus is available
    assert!(runner.context().telemetry().is_configured());

    runner.run(&mut plan).await.expect("compose scenario passed");
}
```

**Required environment**:
```bash
# Build the Docker image first
docker build -t nomos-testnet:local .

# Or use custom image
export NOMOS_TESTNET_IMAGE=myregistry/nomos-testnet:v1.0
```

**Common failures**:
- `cfgsync connection refused`: Check port 4400 is accessible
- Image not found: Build or pull `nomos-testnet:local`

---

<!-- FILE: reference/faq.md -->

## FAQ

**Q: Why does chaos skip validators when only one is configured?**

A: Restarting the only validator would halt consensus entirely. The framework protects against this by requiring ≥2 validators for chaos to restart validators. See `RandomRestartWorkload::targets()`.

**Q: Can I run the same scenario on different runners?**

A: Yes! The `Scenario` is runner-agnostic. Just swap the deployer:

```rust
let plan = build_my_scenario();  // Same plan

// Local
let runner = LocalDeployer::default().deploy(&plan).await?;

// Or Compose
let runner = ComposeRunner::default().deploy(&plan).await?;

// Or K8s
let runner = K8sRunner::new().deploy(&plan).await?;
```

**Q: How do I debug a flaky scenario?**

A:
1. Enable tracing: `NOMOS_TESTS_TRACING=true`
2. Keep logs: `NOMOS_TESTS_KEEP_LOGS=1`
3. Increase duration
4. Simplify (remove workloads one by one)

**Q: Why are expectations evaluated after all workloads, not during?**

A: This ensures the system has reached steady state. If you need continuous assertions, implement them inside your workload using `BlockFeed`.

**Q: How long should my scenario run?**

A: See the [Duration Heuristics](#duration-heuristics) table. Rule of thumb: enough blocks to observe your workload's effects plus margin for variability.

**Q: What's the difference between `Plan` and `Scenario`?**

A: In the code, `ScenarioBuilder` builds a `Scenario`. The term "plan" is informal shorthand for "fully constructed scenario ready for deployment."

---

## Changelog

### v3 (Current)

**New sections:**
- 5-Minute Quickstart
- Reading Guide by Role
- Duration Heuristics table
- BlockFeed Deep Dive
- Configuration Sync (cfgsync) documentation
- Environment Variables reference
- Complete Scenario Recipes (5 recipes)
- Common Expectation Mistakes table
- Debugging Flow diagram
- GitBook structure markers

**Fixes from v2:**
- All API method names verified against codebase
- Error messages taken from actual error types
- Environment variables verified in source

**Improvements:**
- More diagrams (timeline, readiness phases, type flow)
- Troubleshooting with actual error messages
- FAQ expanded with common questions