Chrysostomos Nanakos
b59ddaf648
feat(k8s): add DO (Digital Ocean) Kubernetes cluster terraform
...
Adds Terraform configuration to provision a DO K8s cluster with
auto-scaling node pools for running Codex benchmarks.
Signed-off-by: Chrysostomos Nanakos <chris@include.gr>
2025-10-21 18:17:26 +03:00
Chrysostomos Nanakos
200c749cb5
feat(workflows): add Vector log parsing workflow template and synchronization
...
Add workflow template for parsing logs collected by Vector from Kubernetes
pods, with semaphore synchronization to prevent concurrent access conflicts.
- log-parsing-workflow-template-vector: New workflow template that scales
down Vector aggregator to access RWO PVC, parses JSONL logs, then scales
aggregator back up
- vector-log-parsing-semaphore: ConfigMap semaphore limiting to one log
parsing workflow at a time (prevents RWO PVC mount conflicts)
- codex-workflows-rbac: Added deployment get/patch/update permissions to
executor role (required for scaling Vector aggregator)
Signed-off-by: Chrysostomos Nanakos <chris@include.gr>
2025-10-21 13:25:00 +03:00
Chrysostomos Nanakos
0c8e28fa46
feat(k8s): add Vector logging infrastructure for benchmarks
...
Add Vector agent/aggregator deployment for collecting logs from Codex
benchmark experiments in K8s. Includes PVC for log storage, S3 secret
template and RBAC.
Vector collects logs from benchmark pods and writes JSONL files for
post-processing by the log-parsing workflow.
Signed-off-by: Chrysostomos Nanakos <chris@include.gr>
2025-10-21 13:13:49 +03:00
Chrysostomos Nanakos
8d11207e73
fix(k8s): make codex-node memory resources conditional
...
Make the resources block for codex-node container conditional on
experiment.memory being set and non-empty. If codexMemory is not
provided in the workflow parameters or is set to an empty string,
no resource limits will be set on the pod.
Signed-off-by: Chrysostomos Nanakos <chris@include.gr>
2025-10-20 11:46:37 +03:00
gmega
04828514e4
feat: port bug fixes and features from swarm branch
2025-06-09 20:06:19 -03:00
gmega
4cbb401d12
feat: add optional data removal with adjusted quotas
2025-06-09 19:59:00 -03:00
gmega
67ca362ee7
misc: minor refactor, add simple network perf test deploy
2025-04-16 12:56:34 -03:00
gmega
b2491c26f9
fix: fix workflow expressions
2025-02-27 18:49:48 -03:00
gmega
81cda58a9d
feat: add download speed plot, dedup experiment datasets
2025-02-27 18:47:36 -03:00
gmega
a366f04e7c
feat: allow re-running failed experiments from previous workflow runs
2025-02-25 12:14:15 -03:00
gmega
5a9543259b
feat: add support for region k8s annotations
2025-02-24 14:16:59 -03:00
gmega
8dbc3faed8
feat: add tunable parallelism
2025-02-23 11:33:49 -03:00
gmega
73219922f6
feat: add Codex chart values for cluster experiments
2025-02-20 12:16:05 -03:00
gmega
48e71a315a
feat: add support for setting the node tag in benchmark workflow
2025-02-20 12:14:49 -03:00
gmega
688091c965
feat: allow use of custom runner and node tags for Codex
2025-02-20 11:59:24 -03:00
gmega
a8c19364b7
fix: minikube env param in workflow
2025-02-20 10:21:45 -03:00
gmega
0d08814929
feat: generalize benchmark workflow to run Codex in addition to Deluge
2025-02-17 10:44:00 -03:00
gmega
38434f4590
fix container label for codex experiment runner
2025-02-14 15:59:59 -03:00
gmega
e8441b7bea
fix: respect logger increments even when stream returns less data than expected
2025-02-14 15:59:28 -03:00
gmega
f7adf878eb
feat: add memory parameter to Deluge values file
2025-02-14 14:30:56 -03:00
gmega
205f926f89
feat: add stable bootstrap node
2025-02-14 14:30:18 -03:00
gmega
f336df8da7
fix: adjust Codex logging cooldown, insert polling backoff on download completion, define default Codex experiment
2025-02-14 12:14:52 -03:00
gmega
68ee1bad87
feat: add working Codex helm chart
2025-02-14 11:00:17 -03:00
gmega
74ee71889e
feat: add Codex node and initial integration tests
2025-02-04 19:18:58 -03:00
gmega
99992d2e7e
fix: enable cleanup on failure by default
2025-02-03 15:46:26 -03:00
gmega
61f2172304
feat: add workflow for the final experiment
2025-01-30 11:48:09 -03:00
gmega
94893c0f93
fix: conditional expression for cleanup
2025-01-29 20:35:26 -03:00
gmega
a29c010e7a
feat: allow keeping pods around on failure, add optional log parsing at end of experiment run
2025-01-29 08:47:01 -03:00
gmega
7ed29ddb4c
fix: add RAM settings on deluge node
2025-01-28 20:33:13 -03:00
gmega
1b83f8047c
feat: update RBAC for codex workflows
2025-01-28 18:20:47 -03:00
gmega
ee67a92726
feat: grant codex runner permissions to launch subworkflows
2025-01-27 18:07:56 -03:00
gmega
ba1b93d77c
feat: add structured experiment iteration logs
2025-01-27 17:26:09 -03:00
gmega
90dda4f932
fix: add -C so tars do not include parent folders
2025-01-24 19:19:54 -03:00
gmega
4d4d06e7a9
feat: add log parsing workflow with upload to hetzner storage bucket
2025-01-24 18:28:28 -03:00
gmega
fdac384ad8
fix: add autoscaler eviction annotations to prevent pods from being relocated mid-experiment
2025-01-23 12:12:42 -03:00
gmega
a9b9fd8332
fix: quotation so argo does not screw up the value array
2025-01-23 08:06:43 -03:00
gmega
8096c9f4e0
feat: add ordering to parameter matrix expander
2025-01-22 17:12:46 -03:00
gmega
d70b87d2bb
fix: production values for Argo workflows and RBAC
2025-01-22 10:31:08 -03:00
gmega
aeb2f044c8
chore: remove leftover values from chart
2025-01-20 20:01:48 -03:00
gmega
882392bef2
fix: add missing parameters to cleanup hook
2025-01-20 18:41:11 -03:00
gmega
6ae5b1620f
chore: add missing EOL
2025-01-20 17:59:07 -03:00
gmega
7e07eda3c2
feat: allow running workflows from locally loaded images under Minikube
2025-01-20 17:57:21 -03:00
gmega
5a203fad18
chore: eliminate 5GB experiment for now
2025-01-20 15:29:27 -03:00
gmega
ab100c4841
feat: runnable experiment with working test runner and agents
2025-01-20 15:24:03 -03:00
gmega
94556d7a53
working deployment of agents on minikube
2025-01-20 11:39:43 -03:00
gmega
60fd274b18
feat: add node affinity/anti-affinity and storage class knobs to run this on a cluster
2025-01-15 11:52:32 -03:00
gmega
fc0630224f
fix: remove redundant group suffix from node ID
2025-01-10 16:31:12 -03:00
gmega
b505e7a3e1
fix: fix README link, add missing precommit config, bump ruff
2025-01-09 16:48:44 -03:00
gmega
bfabd1c4c8
feat: label components with /component label, use /name to refer to benchmark pods; add README
2025-01-09 09:27:21 -03:00
gmega
a4fe12e620
feat: add new Helm chart parameters to workflow
2025-01-08 16:43:01 -03:00