mirror of
https://github.com/logos-blockchain/logos-blockchain-testing.git
synced 2026-01-03 13:53:09 +00:00
249 lines
8.4 KiB
Markdown
249 lines
8.4 KiB
Markdown
# Best Practices
|
|
|
|
This page collects proven patterns for authoring, running, and maintaining test scenarios that are reliable, maintainable, and actionable.
|
|
|
|
## Scenario Design
|
|
|
|
**State your intent**
|
|
- Document the goal of each scenario (throughput, DA validation, resilience) so expectation choices are obvious
|
|
- Use descriptive variable names that explain topology purpose (e.g., `star_topology_3val_2exec` vs `topology`)
|
|
- Add comments explaining why specific rates or durations were chosen
|
|
|
|
**Keep runs meaningful**
|
|
- Choose durations that allow multiple blocks and make timing-based assertions trustworthy
|
|
- Use [FAQ: Run Duration Calculator](faq.md#how-long-should-a-scenario-run) to estimate minimum duration
|
|
- Avoid runs shorter than 30 seconds unless testing startup behavior specifically
|
|
|
|
**Separate concerns**
|
|
- Start with deterministic workloads for functional checks
|
|
- Add chaos in dedicated resilience scenarios to avoid noisy failures
|
|
- Don't mix high transaction load with aggressive chaos in the same test (hard to debug)
|
|
|
|
**Start small, scale up**
|
|
- Begin with minimal topology (1-2 validators) to validate scenario logic
|
|
- Gradually increase topology size and workload rates
|
|
- Use Host runner for fast iteration, then validate on Compose before production
|
|
|
|
## Code Organization
|
|
|
|
**Reuse patterns**
|
|
- Standardize on shared topology and workload presets so results are comparable across environments and teams
|
|
- Extract common topology builders into helper functions
|
|
- Create workspace-level constants for standard rates and durations
|
|
|
|
**Example: Topology preset**
|
|
|
|
```rust
|
|
pub fn standard_da_topology() -> GeneratedTopology {
|
|
TopologyBuilder::new()
|
|
.network_star()
|
|
.validators(3)
|
|
.executors(2)
|
|
.generate()
|
|
}
|
|
```
|
|
|
|
**Example: Shared constants**
|
|
|
|
```rust
|
|
pub const STANDARD_TX_RATE: f64 = 10.0;
|
|
pub const STANDARD_DA_CHANNEL_RATE: f64 = 2.0;
|
|
pub const SHORT_RUN_DURATION: Duration = Duration::from_secs(60);
|
|
pub const LONG_RUN_DURATION: Duration = Duration::from_secs(300);
|
|
```
|
|
|
|
## Debugging & Observability
|
|
|
|
**Observe first, tune second**
|
|
- Rely on liveness and inclusion signals to interpret outcomes before tweaking rates or topology
|
|
- Enable detailed logging (`RUST_LOG=debug`, `NOMOS_LOG_LEVEL=debug`) only after initial failure
|
|
- Use `NOMOS_TESTS_KEEP_LOGS=1` to persist logs when debugging failures
|
|
|
|
**Use BlockFeed effectively**
|
|
- Subscribe to BlockFeed in expectations for real-time block monitoring
|
|
- Track block production rate to detect liveness issues early
|
|
- Use block statistics (`block_feed.stats().total_transactions()`) to verify inclusion
|
|
|
|
**Collect metrics**
|
|
- Set up Prometheus/Grafana via `scripts/observability/deploy.sh -t compose -a up` for visualizing node behavior
|
|
- Use metrics to identify bottlenecks before adding more load
|
|
- Monitor mempool size, block size, and consensus timing
|
|
|
|
## Environment & Runner Selection
|
|
|
|
**Environment fit**
|
|
- Pick runners that match the feedback loop you need:
|
|
- **Host**: Fast iteration during development, quick CI smoke tests
|
|
- **Compose**: Reproducible environments (recommended for CI), chaos testing
|
|
- **K8s**: Production-like fidelity, large topologies (10+ nodes)
|
|
|
|
**Runner-specific considerations**
|
|
|
|
| Runner | When to Use | When to Avoid |
|
|
|--------|-------------|---------------|
|
|
| Host | Development iteration, fast feedback | Chaos testing, container-specific issues |
|
|
| Compose | CI pipelines, chaos tests, reproducibility | Very large topologies (>10 nodes) |
|
|
| K8s | Production-like testing, cluster behaviors | Local development, fast iteration |
|
|
|
|
**Minimal surprises**
|
|
- Seed only necessary wallets and keep configuration deltas explicit when moving between CI and developer machines
|
|
- Use `versions.env` to pin node versions consistently across environments
|
|
- Document non-default environment variables in scenario comments or README
|
|
|
|
## CI/CD Integration
|
|
|
|
**Use matrix builds**
|
|
|
|
```yaml
|
|
strategy:
|
|
matrix:
|
|
runner: [host, compose]
|
|
topology: [small, medium]
|
|
```
|
|
|
|
**Cache aggressively**
|
|
- Cache Rust build artifacts (`target/`)
|
|
- Cache circuit parameters (`assets/stack/kzgrs_test_params/`)
|
|
- Cache Docker layers (use BuildKit cache)
|
|
|
|
**Collect logs on failure**
|
|
|
|
```yaml
|
|
- name: Collect logs on failure
|
|
if: failure()
|
|
run: |
|
|
mkdir -p test-logs
|
|
find /tmp -name "nomos-*.log" -exec cp {} test-logs/ \;
|
|
- uses: actions/upload-artifact@v3
|
|
if: failure()
|
|
with:
|
|
name: test-logs-${{ matrix.runner }}
|
|
path: test-logs/
|
|
```
|
|
|
|
**Time limits**
|
|
- Set job timeout to prevent hung runs: `timeout-minutes: 30`
|
|
- Use shorter durations in CI (60s) vs local testing (300s)
|
|
- Run expensive tests (k8s, large topologies) only on main branch or release tags
|
|
|
|
**See also:** [CI Integration](ci-integration.md) for complete workflow examples
|
|
|
|
## Anti-Patterns to Avoid
|
|
|
|
**DON'T: Run without POL_PROOF_DEV_MODE**
|
|
```bash
|
|
# BAD: Will hang/timeout on proof generation
|
|
cargo run -p runner-examples --bin local_runner
|
|
|
|
# GOOD: Fast mode for testing
|
|
POL_PROOF_DEV_MODE=true cargo run -p runner-examples --bin local_runner
|
|
```
|
|
|
|
**DON'T: Use tiny durations**
|
|
```rust
|
|
// BAD: Not enough time for blocks to propagate
|
|
.with_run_duration(Duration::from_secs(5))
|
|
|
|
// GOOD: Allow multiple consensus rounds
|
|
.with_run_duration(Duration::from_secs(60))
|
|
```
|
|
|
|
**DON'T: Ignore cleanup failures**
|
|
```rust
|
|
// BAD: Next run inherits leaked state
|
|
runner.run(&mut scenario).await?;
|
|
// forgot to call cleanup or use CleanupGuard
|
|
|
|
// GOOD: Cleanup via guard (automatic on panic)
|
|
let _cleanup = CleanupGuard::new(runner.clone());
|
|
runner.run(&mut scenario).await?;
|
|
```
|
|
|
|
**DON'T: Mix concerns in one scenario**
|
|
```rust
|
|
// BAD: Hard to debug when it fails
|
|
.transactions_with(|tx| tx.rate(50).users(100)) // high load
|
|
.chaos_with(|c| c.restart().min_delay(...)) // AND chaos
|
|
.da_with(|da| da.channel_rate(10).blob_rate(20)) // AND DA stress
|
|
|
|
// GOOD: Separate tests for each concern
|
|
// Test 1: High transaction load only
|
|
// Test 2: Chaos resilience only
|
|
// Test 3: DA stress only
|
|
```
|
|
|
|
**DON'T: Hardcode paths or ports**
|
|
```rust
|
|
// BAD: Breaks on different machines
|
|
let path = PathBuf::from("/home/user/circuits/kzgrs_test_params");
|
|
let port = 9000; // might conflict
|
|
|
|
// GOOD: Use env vars and dynamic allocation
|
|
let path = std::env::var("NOMOS_KZGRS_PARAMS_PATH")
|
|
.unwrap_or_else(|_| "assets/stack/kzgrs_test_params/kzgrs_test_params".to_string());
|
|
let port = get_available_tcp_port();
|
|
```
|
|
|
|
**DON'T: Ignore resource limits**
|
|
```bash
|
|
# BAD: Large topology without checking resources
|
|
scripts/run/run-examples.sh -v 20 -e 10 compose
|
|
# (might OOM or exhaust ulimits)
|
|
|
|
# GOOD: Scale gradually and monitor resources
|
|
scripts/run/run-examples.sh -v 3 -e 2 compose # start small
|
|
docker stats # monitor resource usage
|
|
# then increase if resources allow
|
|
```
|
|
|
|
## Scenario Design Heuristics
|
|
|
|
**Minimal viable topology**
|
|
- Consensus: 3 validators (minimum for Byzantine fault tolerance)
|
|
- DA: 2+ executors (test dispersal and sampling)
|
|
- Network: Star topology (simplest for debugging)
|
|
|
|
**Workload rate selection**
|
|
- Start with 1-5 tx/s per user, then increase
|
|
- DA: 1-2 channels, 1-3 blobs/channel initially
|
|
- Chaos: 30s+ intervals between restarts (allow recovery)
|
|
|
|
**Duration guidelines**
|
|
|
|
| Test Type | Minimum Duration | Typical Duration |
|
|
|-----------|------------------|------------------|
|
|
| Smoke test | 30s | 60s |
|
|
| Integration test | 60s | 120s |
|
|
| Load test | 120s | 300s |
|
|
| Resilience test | 120s | 300s |
|
|
| Soak test | 600s (10m) | 3600s (1h) |
|
|
|
|
**Expectation selection**
|
|
|
|
| Test Goal | Expectations |
|
|
|-----------|--------------|
|
|
| Basic functionality | `expect_consensus_liveness()` |
|
|
| Transaction handling | `expect_consensus_liveness()` + custom inclusion check |
|
|
| DA correctness | `expect_consensus_liveness()` + DA dispersal/sampling checks |
|
|
| Resilience | `expect_consensus_liveness()` + recovery time measurement |
|
|
|
|
## Testing the Tests
|
|
|
|
**Validate scenarios before committing**
|
|
1. Run on Host runner first (fast feedback)
|
|
2. Run on Compose runner (reproducibility check)
|
|
3. Check logs for warnings or errors
|
|
4. Verify cleanup (no leaked processes/containers)
|
|
5. Run 2-3 times to check for flakiness
|
|
|
|
**Handling flaky tests**
|
|
- Increase run duration (timing-sensitive assertions need longer runs)
|
|
- Reduce workload rates (might be saturating nodes)
|
|
- Check resource limits (CPU/RAM/ulimits)
|
|
- Add debugging output to identify race conditions
|
|
- Consider if test is over-specified (too strict expectations)
|
|
|
|
**See also:**
|
|
- [Troubleshooting](troubleshooting.md) for common failure patterns
|
|
- [FAQ](faq.md) for design decisions and gotchas
|