# FFI benchmarks This directory holds Nim micro/stress benchmarks. Neither is part of `nimble test`. - `bench_codec.nim` — `cbor` vs `c` (cwire) wire-format codec microbenchmark (documented below). Pure measurement, not a gate. - `bench_ffi_submit.nim` — concurrent-submit stress test + throughput benchmark for `sendRequestToFFIThread` (documented next). Carries a **scaling gate** that fails CI until the per-request submit lock is replaced. ## `sendRequestToFFIThread` concurrent-submit stress / throughput `bench_ffi_submit.nim` motivates [issue #90](https://github.com/logos-messaging/nim-ffi/issues/90): every foreign-thread call serialises the whole `trySend + reqSignal.fireSync + reqReceivedSignal.waitSync` cycle under a single `ctx.lock`. The lock is load-bearing because `reqChannel` is single-slot and the accept handshake waits on a *shared* `reqReceivedSignal`, so producers cannot overlap. The bench fans **K producer threads (1 → 8)** at one context, each firing the same per-thread volume of no-op requests. It times the **submit phase only** — from the start gate until every producer returns from its last `sendRequestToFFIThread` — because that is the path the fix parallelises; completion is bounded by the single FFI thread and deliberately excluded. Each thread count runs `FFI_SUBMIT_ITERS` times (default 5) and the **median** submit/sec is reported, so run-to-run noise can't move the verdict. It is also a correctness stress test: the aggregate callback count must match the submit count **exactly** (no drops or double-fires), with zero submit errors and (under asan/lsan/tsan) zero leaks or races. ```sh nimble bench_ffi_submit # smaller / faster (handy under sanitizers — they distort timing, so disable the gate): FFI_SUBMIT_PER_THREAD=2000 FFI_SUBMIT_ITERS=1 FFI_SCALING_GATE=0 nimble bench_ffi_submit # under a sanitizer (proves no leaks/races; gate off — see below): NIM_FFI_SAN=tsan FFI_SUBMIT_PER_THREAD=2000 FFI_SCALING_GATE=0 nimble bench_ffi_submit ``` Env knobs: `FFI_SUBMIT_PER_THREAD` (volume per producer, default 20000), `FFI_SUBMIT_ITERS` (median sample count, default 5), `FFI_SCALING_GATE` (default `1`; set `0` to report numbers without failing). ### Scaling gate — red until the lock is replaced By default the bench **fails** (non-zero exit) unless submit throughput at 8 threads is at least `1.5x` the 1-thread rate. This is a forcing function: it cannot pass while `sendRequestToFFIThread` holds `ctx.lock` across the synchronous `reqReceivedSignal` accept, because that serialises every submit no matter how many producers run. Baseline measured 2026-06-24 (16-core Linux, orc, `-d:danger`, median of 5): submit scaling held at **0.98–1.16x** across threads — flat, as the lock dictates. `1.5x` sits above that noise ceiling (so the lock-bound code fails reliably) and well below the `>=2x` that parallel lock-free MPSC ingress yields on any multicore host (so the fix clears it with margin). Once it lands and this turns green, keep the gate as a regression guard. The gate runs in the non-sanitized **Submit Scaling Gate** CI job (`.github/workflows/ci.yml`); the sanitized jobs run the same bench with `FFI_SCALING_GATE=0` for leak/race coverage only, since sanitizer instrumentation makes throughput scaling meaningless. # FFI wire-format codec benchmark `bench_codec.nim` is a single-process Nim microbenchmark comparing the two FFI wire-format codecs head-to-head on identical payloads: - **cbor** — `cborEncode` / `cborDecode`, self-describing bytes over `seq[byte]`. The codec the `cbor` ABI uses on every boundary crossing. - **c (cwire)** — `cwirePack` / `cwireUnpack` / `cwireFree`, flat C-struct shared-memory packing. The codec the `c` ABI uses, emitted for every `{.ffi: "abi = c".}` type as its `_CWire` companion. Both paths run in the same process on the same values, so the numbers isolate **codec cost only** — no thread hop, no callback dispatch, no chronos work. The full FFI round-trip (thread channel + callback) is identical for both ABIs, so the codec is where the ABI difference actually lives. ## Running ```sh nimble bench_codec # or directly, with the size sweep extended to 1 MiB: nim c -r --mm:orc -d:danger tests/bench/bench_codec.nim --include-1mib ``` Build with `-d:danger` (the nimble task does) so the figures reflect optimized codegen rather than a debug build. ## Payload shapes covered | Type | Shape | |------------------|----------------------------------------------------| | `EchoRequest` | 1 string + 1 int (small struct) | | `EchoResponse` | 2 strings | | `ComplexRequest` | `seq[EchoRequest]`, `seq[string]`, `Option[...]` | | `BytesPayload` | `seq[byte]`, swept 100 B → 150 KiB (`--include-1mib` adds 1 MiB) | ## Interpreting Small structs are dominated by CBOR's per-field tag/length framing, so cwire wins by a large factor. As payloads grow into big `seq[byte]` blobs, both codecs become `memcpy`-bound and the ratio converges toward ~1×. The byte-blob sweep also reports throughput (MiB/s) for each codec. > Note: this benchmark exercises the `c` ABI **codec** (the cwire companions on > the Nim side). Wiring the `c` ABI through the full proc-dispatch path and the > foreign (C++/Rust) generators is tracked separately; only `cbor` currently > generates working end-to-end bindings.