18 Commits

Author SHA1 Message Date
E M
2a94718b37
Inc runner node disk space
Attempt to avoid insufficient disk space errors in tests
2026-05-01 12:47:29 +10:00
E M
a9cb218d59
Keep 1 node alive so autoscaler doesn't scale to 0
Should help speed up startup, avoiding errors like "pod couldn't be scheduled"
2026-05-01 11:36:53 +10:00
E M
c470b3e102
use on demand VMs instead of spot instances
Attempting to fix a lot of errors in the console relating to spot instances being unschedulable.
2026-04-30 18:35:46 +10:00
E M
2872a2800f
Move from a single zone to multiple zones to increase spot instance availability 2026-04-30 17:59:36 +10:00
E M
c889c0283f
Reduce nodes in pool from 10 to 5
Reduces resource contention. 2 parallel tests x 10 containers => 2-3 nodes needed, 5 gives room
2026-04-30 17:58:56 +10:00
E M
5da74edda0
cap boot drive size to 20gb (default is 100gb) to avoid resource exhaustion 2026-04-29 21:26:47 +10:00
E M
4fd5bdca92
refactor: remove allow-tests-pods node label from GKE node pools
The `allow-tests-pods` boolean label was used by the test framework to steer pods away from runner nodes via a node affinity exclusion. Pod scheduling now uses the existing `workload-type` label directly as a nodeSelector, making the boolean label redundant.
2026-04-29 16:46:24 +10:00
E M
6c86e8a9ed
set cluster creation timeout to 20mins
temporary timeout so we can see if the latest commits work without waiting too long between tries
2026-04-28 17:57:15 +10:00
E M
6bc28b68d7
change monitoring to default service
Cluster deployment seems to be stalling because the metrics service is not started. So returning it to default to see if that fixes the issue.
2026-04-28 17:50:43 +10:00
E M
f2b26ae5eb
inline node pools so they can be created in parallel
speeds up cluster creation
2026-04-28 16:46:22 +10:00
E M
70ae988c9b
remove unneeded setup 2026-04-28 16:28:04 +10:00
E M
8f13be1dc4
chore: reduce GKE release test cluster provisioning time and cost
- Configure runners-ci node pool inline in the cluster resource instead
  of using remove_default_node_pool=true, eliminating the
  provision-then-delete cycle that added ~5 min to terraform apply
- Remove the separate infra pool; runners-ci is now the only pool on
  the critical path of cluster creation
- Set tests-pods pool min_node_count=0 so no node is provisioned at
  apply time — nodes scale up only when test pods are scheduled
- Enable spot instances on the tests-pods pool for ~60-91% cost saving
- Add 60 min job timeout to release-tests to bound hung cluster cost
- Add Terraform plugin cache keyed on the lock file to skip provider
  re-downloads on subsequent runs (~30-60s saved)
- Install gke-gcloud-auth-plugin via setup-gcloud to fix kubectl auth

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 09:46:59 +10:00
E M
00a6264030
chore: use zonal GKE cluster to reduce provisioning time
Switch cluster and all node pools from regional to zonal (`europe-west4-b`) to avoid the 40+ minute provisioning time of a regional (multi-zone) cluster. Adds a `zone` variable to the GKE module and cluster config, and updates the workflow's `gcloud get-credentials` call to use `--zone` instead of `--region`.
2026-04-23 17:10:08 +10:00
E M
9a45095a98
rename cluster to match previous change 2026-04-23 16:01:28 +10:00
E M
5cb9cc3379
reduce length of cluster name 2026-04-23 16:01:28 +10:00
E M
5da037fb11
Port terraform cluster creation/destruction from digital ocean to gcp 2026-04-23 16:01:28 +10:00
E M
661308deb1
chore: rename Codex references to Logos Storage in release tests
Replace all "Codex" branding in the release test workflow and supporting
files: rename the K8s cluster, Terraform state key, secret, log paths,
env var (CODEXDOCKERIMAGE → STORAGEDOCKERIMAGE), and test runner image
(cs-codex-dist-tests → logos-storage-dist-tests) to align with the
already-updated logos-storage-nim-cs-dist-tests repo in https://github.com/logos-storage/logos-storage-nim-cs-dist-tests/pull/124. Also fix the
dotnet test path to the correct Tests/LogosStorageReleaseTests directory.
2026-04-23 16:01:28 +10:00
E M
99eb388362
Add release tests workflow
Adds a workflow for release tests:
- builds a docker image for launching nodes in the tests (basically has additional nimflags set)
- creates a K8s cluster in Digital Ocean
- one pod in the cluster is dedicated as the test runner (uses the logos-storage-nim-cs-dist-tests:latest image)
- the release will fail if the docker image build or the release tests fail
- the K8s cluster is torn down after the tests finish (failure or not)
2026-04-23 16:01:26 +10:00