76 Commits

Author SHA1 Message Date
E M
2a514e379c
update workflow run summary
- add retention date
- update titles and links for readability
2026-05-28 15:41:29 +10:00
E M
a021908788
Read test result ConfigMap instead of trying to scrape logs for test result info 2026-05-28 15:41:29 +10:00
E M
640d7f0943
more accurate name for step 2026-05-28 15:41:29 +10:00
E M
01677bf9cb
move test summary generation above check job status
if check job status was a failure (eg when a test failed), then the test summary generation was being skipped. Moving the test summary generation step above the job status check avoids this.
2026-05-28 15:41:29 +10:00
E M
1ad70fec22
don't wait for pvc disks to be deleted, delete all at end in case runner crashes 2026-05-28 15:41:29 +10:00
E M
2d7aca1054
wait for pvcs to be deleted before destroying the cluster 2026-05-28 15:41:29 +10:00
E M
f84fd7f25c
do not generate test summary if previous steps were skipped/cancelled 2026-05-28 15:41:29 +10:00
E M
5203cf93e4
fix error in "print storage logs url" step 2026-05-28 15:41:29 +10:00
E M
b4180c471b
delete terraform state lock
When the workflow is cancelled, either manually, or automatically from a long-running step (timeout), the terraform state lock had to be manually deleted, or else the next workflow run would never succeed. This change ensures that the state lock file is always deleted after each run.
2026-05-28 15:41:28 +10:00
E M
0e46c9f684
generate test summary to show in workflow summary 2026-05-28 15:41:28 +10:00
E M
c1855fb13a
put cluster name in an env var 2026-05-28 15:41:28 +10:00
E M
10ca94261b
avoid sleeping a full 60s to wait for job completion
Instead, wait for a job condition using kubectl wait
2026-05-28 15:41:28 +10:00
E M
3679040178
try to ensure the log stream survives long silences 2026-05-28 15:41:28 +10:00
E M
e58c8f93c7
add starttime param to logging URL 2026-05-28 15:41:28 +10:00
E M
b04672ebce
Add a "Delete PVCs before cluster teardown" step to the workflow to prevent future PVC leaks 2026-05-28 15:41:28 +10:00
E M
eac099b819
try zone a one more time 2026-05-28 15:41:28 +10:00
E M
2c627c9ed2
hanging at 64% deploying again, trying zone c 2026-05-28 15:41:28 +10:00
E M
c520e79383
fix encoding of logging url 2026-05-28 15:41:27 +10:00
E M
be582eca17
move back to europe-west4-b zone due to exhausted quota 2026-05-28 15:41:27 +10:00
E M
68c319863a
Logging URL filters by RUNID instead of namespace/container name 2026-05-28 15:41:27 +10:00
E M
97750a47ca
Try changing zones in case the cluster deployment stall is due to a zonal unavailability. 2026-05-28 15:41:27 +10:00
E M
bbc4b1caf3
remove unneeded setup 2026-05-28 15:41:27 +10:00
E M
898010d58f
move state bucket from gh secret to variable 2026-05-28 15:41:27 +10:00
E M
77e8d6d64a
create the terraform cache dir first 2026-05-28 15:41:26 +10:00
E M
0e298bddbd
add debug output 2026-05-28 15:41:26 +10:00
E M
48b444d8fe
change script so it doesn't non-zero exit when no pods exist 2026-05-28 15:41:26 +10:00
E M
7a9b93a981
fix terraform cache, should remove warning 2026-05-28 15:41:26 +10:00
E M
d4d52c008a
fix polling script 2026-05-28 15:41:26 +10:00
E M
3ed677c9d1
check pod phase instead 2026-05-28 15:41:26 +10:00
E M
cd972ef9bb
refactor polling loop 2026-05-28 15:41:26 +10:00
E M
1696aa83a9
temp comment out releasee workflow 2026-05-28 15:41:26 +10:00
E M
a901e1495c
temp comment out build workflow 2026-05-28 15:41:26 +10:00
E M
7f782cf6a1
temp comment out build to make testing ci changes faster 2026-05-28 15:41:26 +10:00
E M
dabdc6d3e9
Keeps timing out waiting for start, so try polling loop 2026-05-28 15:41:26 +10:00
E M
3cb3a176b2
wait for runners-ci node to be ready before continuing workflow 2026-05-28 15:41:26 +10:00
E M
ea5110246c
reorder wait command flags 2026-05-28 15:41:25 +10:00
E M
4732b44e36
Show storage node logs URL in workflow summary 2026-05-28 15:41:25 +10:00
E M
a9f878effb
move RELEASE_TESTS_GCP_PROJECT from secret to var for logging URL 2026-05-28 15:41:25 +10:00
E M
ba49d8b223
bump kubectl to latest 2026-05-28 15:41:25 +10:00
E M
ca138cf2bf
change pod condition to wait for (create) 2026-05-28 15:41:25 +10:00
E M
c93b8c0ec2
reusable workflow outputs can silently fail to propagate in certain conditions
Now, STORAGEDOCKERIMAGE is:
- logosstorage/logos-storage-nim:latest-dist-tests for workflow_dispatch on a branch
- logosstorage/logos-storage-nim:v0.1.8-dist-tests for a v0.1.8 tag push
2026-05-28 15:41:25 +10:00
E M
159e1def65
wait for an existing pod before completing step 2026-05-28 15:41:25 +10:00
E M
bc7a277d9b
chore: reduce GKE release test cluster provisioning time and cost
- Configure runners-ci node pool inline in the cluster resource instead
  of using remove_default_node_pool=true, eliminating the
  provision-then-delete cycle that added ~5 min to terraform apply
- Remove the separate infra pool; runners-ci is now the only pool on
  the critical path of cluster creation
- Set tests-pods pool min_node_count=0 so no node is provisioned at
  apply time — nodes scale up only when test pods are scheduled
- Enable spot instances on the tests-pods pool for ~60-91% cost saving
- Add 60 min job timeout to release-tests to bound hung cluster cost
- Add Terraform plugin cache keyed on the lock file to skip provider
  re-downloads on subsequent runs (~30-60s saved)
- Install gke-gcloud-auth-plugin via setup-gcloud to fix kubectl auth

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 15:41:25 +10:00
E M
cf6d93f52c
chore: use zonal GKE cluster to reduce provisioning time
Switch cluster and all node pools from regional to zonal (`europe-west4-b`) to avoid the 40+ minute provisioning time of a regional (multi-zone) cluster. Adds a `zone` variable to the GKE module and cluster config, and updates the workflow's `gcloud get-credentials` call to use `--zone` instead of `--region`.
2026-05-28 15:41:25 +10:00
E M
2e82a4a15c
rename cluster to match previous change 2026-05-28 15:41:25 +10:00
E M
582576675e
add permissions to gcp auth 2026-05-28 15:41:24 +10:00
E M
7c74437bb7
Port terraform cluster creation/destruction from digital ocean to gcp 2026-05-28 15:41:24 +10:00
E M
1a376d80db
chore: rename Codex references to Logos Storage in release tests
Replace all "Codex" branding in the release test workflow and supporting
files: rename the K8s cluster, Terraform state key, secret, log paths,
env var (CODEXDOCKERIMAGE → STORAGEDOCKERIMAGE), and test runner image
(cs-codex-dist-tests → logos-storage-dist-tests) to align with the
already-updated logos-storage-nim-cs-dist-tests repo in https://github.com/logos-storage/logos-storage-nim-cs-dist-tests/pull/124. Also fix the
dotnet test path to the correct Tests/LogosStorageReleaseTests directory.
2026-05-28 15:41:24 +10:00
E M
b3e88603bd
Update workflow success condition 2026-05-28 15:41:24 +10:00
E M
d7ad50a924
export kubeconfig values so template works 2026-05-28 15:41:24 +10:00