logos-blockchain-testing/book/src/best-practices.md

# Best Practices

This page collects proven patterns for authoring, running, and maintaining test scenarios that are reliable, maintainable, and actionable.

## Scenario Design

**State your intent**
- Document the goal of each scenario (throughput, DA validation, resilience) so expectation choices are obvious
- Use descriptive variable names that explain topology purpose (e.g., `star_topology_3val_2exec` vs `topology`)
- Add comments explaining why specific rates or durations were chosen

**Keep runs meaningful**
- Choose durations that allow multiple blocks and make timing-based assertions trustworthy
- Use [FAQ: Run Duration Calculator](faq.md#how-long-should-a-scenario-run) to estimate minimum duration
- Avoid runs shorter than 30 seconds unless testing startup behavior specifically

**Separate concerns**
- Start with deterministic workloads for functional checks
- Add chaos in dedicated resilience scenarios to avoid noisy failures
- Don't mix high transaction load with aggressive chaos in the same test (hard to debug)

**Start small, scale up**
- Begin with minimal topology (1-2 validators) to validate scenario logic
- Gradually increase topology size and workload rates
- Use Host runner for fast iteration, then validate on Compose before production

## Code Organization

**Reuse patterns**
- Standardize on shared topology and workload presets so results are comparable across environments and teams
- Extract common topology builders into helper functions
- Create workspace-level constants for standard rates and durations

**Example: Topology preset**

```rust
pub fn standard_da_topology() -> GeneratedTopology {
    TopologyBuilder::new()
        .network_star()
        .validators(3)
        .executors(2)
        .generate()
}
```

**Example: Shared constants**

```rust
pub const STANDARD_TX_RATE: f64 = 10.0;
pub const STANDARD_DA_CHANNEL_RATE: f64 = 2.0;
pub const SHORT_RUN_DURATION: Duration = Duration::from_secs(60);
pub const LONG_RUN_DURATION: Duration = Duration::from_secs(300);
```

## Debugging & Observability

**Observe first, tune second**
- Rely on liveness and inclusion signals to interpret outcomes before tweaking rates or topology
- Enable detailed logging (`RUST_LOG=debug`, `NOMOS_LOG_LEVEL=debug`) only after initial failure
- Use `NOMOS_TESTS_KEEP_LOGS=1` to persist logs when debugging failures

**Use BlockFeed effectively**
- Subscribe to BlockFeed in expectations for real-time block monitoring
- Track block production rate to detect liveness issues early
- Use block statistics (`block_feed.stats().total_transactions()`) to verify inclusion

**Collect metrics**
- Set up Prometheus/Grafana via `scripts/observability/deploy.sh -t compose -a up` for visualizing node behavior
- Use metrics to identify bottlenecks before adding more load
- Monitor mempool size, block size, and consensus timing

## Environment & Runner Selection

**Environment fit**
- Pick runners that match the feedback loop you need:
  - **Host**: Fast iteration during development, quick CI smoke tests
  - **Compose**: Reproducible environments (recommended for CI), chaos testing
  - **K8s**: Production-like fidelity, large topologies (10+ nodes)

**Runner-specific considerations**

| Runner | When to Use | When to Avoid |
|--------|-------------|---------------|
| Host | Development iteration, fast feedback | Chaos testing, container-specific issues |
| Compose | CI pipelines, chaos tests, reproducibility | Very large topologies (>10 nodes) |
| K8s | Production-like testing, cluster behaviors | Local development, fast iteration |

**Minimal surprises**
- Seed only necessary wallets and keep configuration deltas explicit when moving between CI and developer machines
- Use `versions.env` to pin node versions consistently across environments
- Document non-default environment variables in scenario comments or README

## CI/CD Integration

**Use matrix builds**

```yaml
strategy:
  matrix:
    runner: [host, compose]
    topology: [small, medium]
```

**Cache aggressively**
- Cache Rust build artifacts (`target/`)
- Cache circuit parameters (`assets/stack/kzgrs_test_params/`)
- Cache Docker layers (use BuildKit cache)

**Collect logs on failure**

```yaml
- name: Collect logs on failure
  if: failure()
  run: |
    mkdir -p test-logs
    find /tmp -name "nomos-*.log" -exec cp {} test-logs/ \;
- uses: actions/upload-artifact@v3
  if: failure()
  with:
    name: test-logs-${{ matrix.runner }}
    path: test-logs/
```

**Time limits**
- Set job timeout to prevent hung runs: `timeout-minutes: 30`
- Use shorter durations in CI (60s) vs local testing (300s)
- Run expensive tests (k8s, large topologies) only on main branch or release tags

**See also:** [CI Integration](ci-integration.md) for complete workflow examples

## Anti-Patterns to Avoid

**DON'T: Run without POL_PROOF_DEV_MODE**
```bash
# BAD: Will hang/timeout on proof generation
cargo run -p runner-examples --bin local_runner

# GOOD: Fast mode for testing
POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin local_runner
```

**DON'T: Use tiny durations**
```rust
// BAD: Not enough time for blocks to propagate
.with_run_duration(Duration::from_secs(5))

// GOOD: Allow multiple consensus rounds
.with_run_duration(Duration::from_secs(60))
```

**DON'T: Ignore cleanup failures**
```rust
// BAD: Next run inherits leaked state
runner.run(&mut scenario).await?;
// forgot to call cleanup or use CleanupGuard

// GOOD: Cleanup via guard (automatic on panic)
let _cleanup = CleanupGuard::new(runner.clone());
runner.run(&mut scenario).await?;
```

**DON'T: Mix concerns in one scenario**
```rust
// BAD: Hard to debug when it fails
.transactions_with(|tx| tx.rate(50).users(100))  // high load
.chaos_with(|c| c.restart().min_delay(...))        // AND chaos
.da_with(|da| da.channel_rate(10).blob_rate(20))  // AND DA stress

// GOOD: Separate tests for each concern
// Test 1: High transaction load only
// Test 2: Chaos resilience only
// Test 3: DA stress only
```

**DON'T: Hardcode paths or ports**
```rust
// BAD: Breaks on different machines
let path = PathBuf::from("/home/user/circuits/kzgrs_test_params");
let port = 9000; // might conflict

// GOOD: Use env vars and dynamic allocation
let path = std::env::var("NOMOS_KZGRS_PARAMS_PATH")
    .unwrap_or_else(|_| "assets/stack/kzgrs_test_params/kzgrs_test_params".to_string());
let port = get_available_tcp_port();
```

**DON'T: Ignore resource limits**
```bash
# BAD: Large topology without checking resources
scripts/run/run-examples.sh -v 20 -e 10 compose
# (might OOM or exhaust ulimits)

# GOOD: Scale gradually and monitor resources
scripts/run/run-examples.sh -v 3 -e 2 compose  # start small
docker stats  # monitor resource usage
# then increase if resources allow
```

## Scenario Design Heuristics

**Minimal viable topology**
- Consensus: 3 validators (minimum for Byzantine fault tolerance)
- DA: 2+ executors (test dispersal and sampling)
- Network: Star topology (simplest for debugging)

**Workload rate selection**
- Start with 1-5 tx/s per user, then increase
- DA: 1-2 channels, 1-3 blobs/channel initially
- Chaos: 30s+ intervals between restarts (allow recovery)

**Duration guidelines**

| Test Type | Minimum Duration | Typical Duration |
|-----------|------------------|------------------|
| Smoke test | 30s | 60s |
| Integration test | 60s | 120s |
| Load test | 120s | 300s |
| Resilience test | 120s | 300s |
| Soak test | 600s (10m) | 3600s (1h) |

**Expectation selection**

| Test Goal | Expectations |
|-----------|--------------|
| Basic functionality | `expect_consensus_liveness()` |
| Transaction handling | `expect_consensus_liveness()` + custom inclusion check |
| DA correctness | `expect_consensus_liveness()` + DA dispersal/sampling checks |
| Resilience | `expect_consensus_liveness()` + recovery time measurement |

## Testing the Tests

**Validate scenarios before committing**
1. Run on Host runner first (fast feedback)
2. Run on Compose runner (reproducibility check)
3. Check logs for warnings or errors
4. Verify cleanup (no leaked processes/containers)
5. Run 2-3 times to check for flakiness

**Handling flaky tests**
- Increase run duration (timing-sensitive assertions need longer runs)
- Reduce workload rates (might be saturating nodes)
- Check resource limits (CPU/RAM/ulimits)
- Add debugging output to identify race conditions
- Consider if test is over-specified (too strict expectations)

**See also:**
- [Troubleshooting](troubleshooting.md) for common failure patterns
- [FAQ](faq.md) for design decisions and gotchas