4 Commits

Author SHA1 Message Date
E M
8f13be1dc4
chore: reduce GKE release test cluster provisioning time and cost
- Configure runners-ci node pool inline in the cluster resource instead
  of using remove_default_node_pool=true, eliminating the
  provision-then-delete cycle that added ~5 min to terraform apply
- Remove the separate infra pool; runners-ci is now the only pool on
  the critical path of cluster creation
- Set tests-pods pool min_node_count=0 so no node is provisioned at
  apply time — nodes scale up only when test pods are scheduled
- Enable spot instances on the tests-pods pool for ~60-91% cost saving
- Add 60 min job timeout to release-tests to bound hung cluster cost
- Add Terraform plugin cache keyed on the lock file to skip provider
  re-downloads on subsequent runs (~30-60s saved)
- Install gke-gcloud-auth-plugin via setup-gcloud to fix kubectl auth

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 09:46:59 +10:00
E M
00a6264030
chore: use zonal GKE cluster to reduce provisioning time
Switch cluster and all node pools from regional to zonal (`europe-west4-b`) to avoid the 40+ minute provisioning time of a regional (multi-zone) cluster. Adds a `zone` variable to the GKE module and cluster config, and updates the workflow's `gcloud get-credentials` call to use `--zone` instead of `--region`.
2026-04-23 17:10:08 +10:00
E M
5da037fb11
Port terraform cluster creation/destruction from digital ocean to gcp 2026-04-23 16:01:28 +10:00
E M
99eb388362
Add release tests workflow
Adds a workflow for release tests:
- builds a docker image for launching nodes in the tests (basically has additional nimflags set)
- creates a K8s cluster in Digital Ocean
- one pod in the cluster is dedicated as the test runner (uses the logos-storage-nim-cs-dist-tests:latest image)
- the release will fail if the docker image build or the release tests fail
- the K8s cluster is torn down after the tests finish (failure or not)
2026-04-23 16:01:26 +10:00