# Chaos Workloads

Chaos in the framework uses node control to introduce failures and validate
recovery. The built-in restart workload lives in
`testing_framework_workflows::workloads::chaos::RandomRestartWorkload`.

## How it works
- Requires `NodeControlCapability` (`enable_node_control()` in the scenario
  builder) and a runner that provides a `NodeControlHandle`.
- Randomly selects nodes (validators, executors) to restart based on your
  include/exclude flags.
- Respects min/max delay between restarts and a target cooldown to avoid
  flapping the same node too frequently.
- Runs alongside other workloads; expectations should account for the added
  disruption.
- Support varies by runner: node control is not provided by the local runner
  and is not yet implemented for the k8s runner. Use a runner that advertises
  `NodeControlHandle` support (e.g., compose) for chaos workloads.

## Usage
```rust
use std::time::Duration;
use testing_framework_core::scenario::ScenarioBuilder;
use testing_framework_workflows::workloads::chaos::RandomRestartWorkload;

let plan = ScenarioBuilder::topology_with(|t| {
        t.network_star()
            .validators(2)
            .executors(1)
    })
    .enable_node_control()
    .with_workload(RandomRestartWorkload::new(
        Duration::from_secs(45),  // min delay
        Duration::from_secs(75),  // max delay
        Duration::from_secs(120), // target cooldown
        true,                     // include validators
        true,                     // include executors
    ))
    .expect_consensus_liveness()
    .with_run_duration(Duration::from_secs(150))
    .build();
// deploy with a runner that supports node control and run the scenario
```

## Expectations to pair
- **Consensus liveness**: ensure blocks keep progressing despite restarts.
- **Height convergence**: optionally check all nodes converge after the chaos
  window.
- Any workload-specific inclusion checks if you’re also driving tx/DA traffic.

## Best practices
- Keep delays/cooldowns realistic; avoid back-to-back restarts that would never
  happen in production.
- Limit chaos scope: toggle validators vs executors based on what you want to
  test.
- Combine with observability: monitor metrics/logs to explain failures.