When the workflow is cancelled, either manually, or automatically from a long-running step (timeout), the terraform state lock had to be manually deleted, or else the next workflow run would never succeed. This change ensures that the state lock file is always deleted after each run.
The `allow-tests-pods` boolean label was used by the test framework to steer pods away from runner nodes via a node affinity exclusion. Pod scheduling now uses the existing `workload-type` label directly as a nodeSelector, making the boolean label redundant.
Now, STORAGEDOCKERIMAGE is:
- logosstorage/logos-storage-nim:latest-dist-tests for workflow_dispatch on a branch
- logosstorage/logos-storage-nim:v0.1.8-dist-tests for a v0.1.8 tag push
- Configure runners-ci node pool inline in the cluster resource instead
of using remove_default_node_pool=true, eliminating the
provision-then-delete cycle that added ~5 min to terraform apply
- Remove the separate infra pool; runners-ci is now the only pool on
the critical path of cluster creation
- Set tests-pods pool min_node_count=0 so no node is provisioned at
apply time — nodes scale up only when test pods are scheduled
- Enable spot instances on the tests-pods pool for ~60-91% cost saving
- Add 60 min job timeout to release-tests to bound hung cluster cost
- Add Terraform plugin cache keyed on the lock file to skip provider
re-downloads on subsequent runs (~30-60s saved)
- Install gke-gcloud-auth-plugin via setup-gcloud to fix kubectl auth
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch cluster and all node pools from regional to zonal (`europe-west4-b`) to avoid the 40+ minute provisioning time of a regional (multi-zone) cluster. Adds a `zone` variable to the GKE module and cluster config, and updates the workflow's `gcloud get-credentials` call to use `--zone` instead of `--region`.