Added required pod anti-affinity (kubernetes.io/hostname, pod-uuid Exists) to K8sController.cs so all pods within a test namespace are forced onto distinct nodes (Pending if unsatisfiable) to cover the worst-case pod counts per namespace.
The anti-affinity TopologyKey of "kubernetes.io/hostname" specifies that the domain is the node, along with the "Exists" operator, specifies that "this Pod should not run on this node if this node is already running one or mode pods that have a "pod-uuid" key (they all do).
* fix: delete PVCs after stopping containers
* Didn't work, instead try to delete all PVCs just before the namespace is deleted, after all pods destroyed.
* Didn't work, force kill pods, then delete pvcs
Force kill pods, wait for them to be killed. Then remove the pvc finaliser that protects the pvc from deletion. Finally, delete the pvc. The finaliser deletion step is there in case the force kill pod times out.
* try without waiting for pods to be killed
* prevent double delete race
* remove unneeded method, improve log output in pvc deletion
* Use emptyDir ephemeral volumes instead of PVCs
* fix dist tests workflow summary
After kubeconfig was replaced with an in-cluster service account, k8sClient was returning null and thus no test summaries were being written to ConfigMaps. This change returns a default Kubeconfig for the k8sClient when one is not passed in an environment variable.
* remove PVC volume deletion since PVCs are no longer created
Replace the indirect `SetSchedulingAffinity(notIn: "false")` / `allow-tests-pods` mechanism with `ScheduleInPoolsWithLabel(key, value)` and `AddToleration(key, value, effect)` in ContainerRecipeFactory. This is much more readable from an API perspective. `SetSchedulingAffinity(notIn: "false")` was a double-negative (hard to reason about) and it was not clear that this was meant to schedule on pools with labels `allow-tests-pods=true`.
Previously, pods were steered to the spot node pool via a node affinity exclusion on a boolean label (`allow-tests-pods NotIn ["false"]`), and spot taint toleration was added implicitly by using the `system-node-critical` priority class. The priority class was removed earlier because it caused a ResourceQuota admission error in GCP, which silently broke spot node scheduling.
The new API is explicit: recipes call `ScheduleInPoolsWithLabel` to set a nodeSelector label that targets the intended pool, and `AddToleration` to declare any taints the pool carries. Tolerations are set at the recipe level to allow for the recipe to move back to Digital Ocean if needed (removing the unneeded toleration). All four recipes (storage, prometheus, discord bot, rewarder bot) now call both.
Cleanup applied alongside:
- `PodToleration` converted to a record for structural equality and simpler deduplication
- `ExposedPorts`, `InternalPorts`, `EnvVars`, `Volumes` on `ContainerRecipe` changed to
`IReadOnlyList<T>` for consistent immutable typing
- `SetCriticalPriority` property renamed to `IsCriticalPriority`
- `GetPriorityClassName` returns `string?` instead of `null!`
- `Reset()` extracted in `ContainerRecipeFactory` to consolidate post-create state reset
- Fixed bug: `nodePoolLabels` and `tolerations` were passed by reference and then cleared,
leaving the recipe with empty collections; now snapshotted before clearing
- `SchedulingAffinity.cs` deleted (no remaining callers)
* ci(docker): build dist-tests images
* Update to .net 10, kubernetes client 18.0.13
Kubernetes client 18.0.13 is compatible with Kubernetes 1.34.x. The Kubernetes version is selected automatically by kubeadm in docker desktop (v1.34.1). See https://github.com/kubernetes-client/csharp#version-compatibility for a compatibility table.
* Updates to support Kubernetes upgrade
* bump openapi.yaml to match openapi.yaml in the logos-storage-nim docker image
* bump doc to .net 10
* bump docker to .net 10
* Build image with latest tag always
Always build an image with a latest tag (as well as a sha commit hash) when there's a push to master
* docker image tag as "latest" only when pushing to master
* Update docker image to install doctl
* Remove doctl install
kubeconfig is now created and uses a plain bearer token instead of using doctl as a credential mgr
* Rename and remove all instances of Codex
* Further remove CodexNetDeployer as it is no longer needed
---------
Co-authored-by: Adam Uhlíř <adam@uhlir.dev>