963 Commits

Author SHA1 Message Date
E M
5da74edda0
cap boot drive size to 20gb (default is 100gb) to avoid resource exhaustion 2026-04-29 21:26:47 +10:00
E M
9e732b16b9
Add a "Delete PVCs before cluster teardown" step to the workflow to prevent future PVC leaks 2026-04-29 20:37:18 +10:00
E M
c177225677
try zone a one more time 2026-04-29 18:35:08 +10:00
E M
0965220c6d
hanging at 64% deploying again, trying zone c 2026-04-29 17:48:54 +10:00
E M
ebf0abb35c
fix encoding of logging url 2026-04-29 17:13:57 +10:00
E M
a4750824bc
move back to europe-west4-b zone due to exhausted quota 2026-04-29 17:13:44 +10:00
E M
4fd5bdca92
refactor: remove allow-tests-pods node label from GKE node pools
The `allow-tests-pods` boolean label was used by the test framework to steer pods away from runner nodes via a node affinity exclusion. Pod scheduling now uses the existing `workload-type` label directly as a nodeSelector, making the boolean label redundant.
2026-04-29 16:46:24 +10:00
E M
b7b01f92e8
Logging URL filters by RUNID instead of namespace/container name 2026-04-29 15:20:44 +10:00
E M
f5150109f7
fix: avoid building in parallel
Avoids "file in use" errors while building in CI
2026-04-28 20:02:00 +10:00
E M
970012aa04
remove unneeded priority request 2026-04-28 18:07:43 +10:00
E M
6c86e8a9ed
set cluster creation timeout to 20mins
temporary timeout so we can see if the latest commits work without waiting too long between tries
2026-04-28 17:57:15 +10:00
E M
11cb97e97d
Try changing zones in case the cluster deployment stall is due to a zonal unavailability. 2026-04-28 17:51:01 +10:00
E M
6bc28b68d7
change monitoring to default service
Cluster deployment seems to be stalling because the metrics service is not started. So returning it to default to see if that fixes the issue.
2026-04-28 17:50:43 +10:00
E M
f2b26ae5eb
inline node pools so they can be created in parallel
speeds up cluster creation
2026-04-28 16:46:22 +10:00
E M
70ae988c9b
remove unneeded setup 2026-04-28 16:28:04 +10:00
E M
9f46e1ce8a
move state bucket from gh secret to variable 2026-04-24 17:20:03 +10:00
E M
93fc629706
create the terraform cache dir first 2026-04-24 17:11:54 +10:00
E M
0839bd0301
add debug output 2026-04-24 17:01:52 +10:00
E M
580a424086
change script so it doesn't non-zero exit when no pods exist 2026-04-24 15:45:47 +10:00
E M
ffde5e0fdc
fix terraform cache, should remove warning 2026-04-24 15:33:32 +10:00
E M
aebe3a4262
fix polling script 2026-04-24 15:30:56 +10:00
E M
073dc7c408
check pod phase instead 2026-04-24 15:12:39 +10:00
E M
df79cb097b
refactor polling loop 2026-04-24 15:01:45 +10:00
E M
a72e933d38
temp comment out releasee workflow 2026-04-24 14:41:55 +10:00
E M
df25b12356
temp comment out build workflow 2026-04-24 14:40:57 +10:00
E M
c114a54851
temp comment out build to make testing ci changes faster 2026-04-24 14:40:11 +10:00
E M
b05c345143
Keeps timing out waiting for start, so try polling loop 2026-04-24 14:37:56 +10:00
E M
9c0e749e99
wait for runners-ci node to be ready before continuing workflow 2026-04-24 13:28:01 +10:00
E M
7994d88996
reorder wait command flags 2026-04-24 12:35:40 +10:00
E M
8527868c45
Show storage node logs URL in workflow summary 2026-04-24 12:25:45 +10:00
E M
c85c658c48
move RELEASE_TESTS_GCP_PROJECT from secret to var for logging URL 2026-04-24 12:25:17 +10:00
E M
937f3c88c0
bump kubectl to latest 2026-04-24 12:24:56 +10:00
E M
627d795e67
change pod condition to wait for (create) 2026-04-24 12:01:27 +10:00
E M
4e8c781299
reusable workflow outputs can silently fail to propagate in certain conditions
Now, STORAGEDOCKERIMAGE is:
- logosstorage/logos-storage-nim:latest-dist-tests for workflow_dispatch on a branch
- logosstorage/logos-storage-nim:v0.1.8-dist-tests for a v0.1.8 tag push
2026-04-24 10:27:59 +10:00
E M
5c22e5d7bd
wait for an existing pod before completing step 2026-04-24 10:27:12 +10:00
E M
8f13be1dc4
chore: reduce GKE release test cluster provisioning time and cost
- Configure runners-ci node pool inline in the cluster resource instead
  of using remove_default_node_pool=true, eliminating the
  provision-then-delete cycle that added ~5 min to terraform apply
- Remove the separate infra pool; runners-ci is now the only pool on
  the critical path of cluster creation
- Set tests-pods pool min_node_count=0 so no node is provisioned at
  apply time — nodes scale up only when test pods are scheduled
- Enable spot instances on the tests-pods pool for ~60-91% cost saving
- Add 60 min job timeout to release-tests to bound hung cluster cost
- Add Terraform plugin cache keyed on the lock file to skip provider
  re-downloads on subsequent runs (~30-60s saved)
- Install gke-gcloud-auth-plugin via setup-gcloud to fix kubectl auth

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 09:46:59 +10:00
E M
00a6264030
chore: use zonal GKE cluster to reduce provisioning time
Switch cluster and all node pools from regional to zonal (`europe-west4-b`) to avoid the 40+ minute provisioning time of a regional (multi-zone) cluster. Adds a `zone` variable to the GKE module and cluster config, and updates the workflow's `gcloud get-credentials` call to use `--zone` instead of `--region`.
2026-04-23 17:10:08 +10:00
E M
229cff0065
ignore claude files 2026-04-23 16:01:29 +10:00
E M
9a45095a98
rename cluster to match previous change 2026-04-23 16:01:28 +10:00
E M
5cb9cc3379
reduce length of cluster name 2026-04-23 16:01:28 +10:00
E M
1bc1336268
add permissions to gcp auth 2026-04-23 16:01:28 +10:00
E M
5da037fb11
Port terraform cluster creation/destruction from digital ocean to gcp 2026-04-23 16:01:28 +10:00
E M
661308deb1
chore: rename Codex references to Logos Storage in release tests
Replace all "Codex" branding in the release test workflow and supporting
files: rename the K8s cluster, Terraform state key, secret, log paths,
env var (CODEXDOCKERIMAGE → STORAGEDOCKERIMAGE), and test runner image
(cs-codex-dist-tests → logos-storage-dist-tests) to align with the
already-updated logos-storage-nim-cs-dist-tests repo in https://github.com/logos-storage/logos-storage-nim-cs-dist-tests/pull/124. Also fix the
dotnet test path to the correct Tests/LogosStorageReleaseTests directory.
2026-04-23 16:01:28 +10:00
E M
117bc74099
Update workflow success condition 2026-04-23 16:01:27 +10:00
E M
c6a9320648
export kubeconfig values so template works 2026-04-23 16:01:27 +10:00
E M
fc50479c1e
Create static kubeconfig with bearer token
Replace the use of doctl as a credential manager for executing k8s calls with a freshly created bearer token (expires after 2h). Avoids passing a DO personal access token to the cs-dist-tests runner pod.
2026-04-23 16:01:27 +10:00
E M
fdb47887d2
restore using latest cs-dist-tests image 2026-04-23 16:01:27 +10:00
E M
2b13ed9c07
WIP update cs-dist-tests docker image tag
The change to the cs-dist-tests image name was to test if installing doctl to the image would fix the release tests not being able to authenticate into the cluster. This is mainly due to the kubeconfig being generated and stored in a DO secret, as opposed to a static kubeconfig for a permanent cluster as before.

IMPORTANT:
The image tag should be changed back to 'latest'!
2026-04-23 16:01:27 +10:00
E M
122ad42038
wait for pod to start before streaming logs 2026-04-23 16:01:26 +10:00
E M
df880b959b
prefix repository secret names with RELEASE_TESTS_ 2026-04-23 16:01:26 +10:00