103 Commits

Author SHA1 Message Date
E M
5f2b537fcd
refactor: replace scheduling affinity with explicit node pool label selection
Replace the indirect `SetSchedulingAffinity(notIn: "false")` / `allow-tests-pods` mechanism with `ScheduleInPoolsWithLabel(key, value)` and `AddToleration(key, value, effect)` in ContainerRecipeFactory. This is much more readable from an API perspective. `SetSchedulingAffinity(notIn: "false")` was a double-negative (hard to reason about) and it was not clear that this was meant to schedule on pools with labels `allow-tests-pods=true`.

Previously, pods were steered to the spot node pool via a node affinity exclusion on a boolean label (`allow-tests-pods NotIn ["false"]`), and spot taint toleration was added implicitly by using the `system-node-critical` priority class. The priority class was removed earlier because it caused a ResourceQuota admission error in GCP, which silently broke spot node scheduling.

The new API is explicit: recipes call `ScheduleInPoolsWithLabel` to set a nodeSelector label that targets the intended pool, and `AddToleration` to declare any taints the pool carries. Tolerations are set at the recipe level to allow for the recipe to move back to Digital Ocean if needed (removing the unneeded toleration). All four recipes (storage, prometheus, discord bot, rewarder bot) now call both.

Cleanup applied alongside:
- `PodToleration` converted to a record for structural equality and simpler deduplication
- `ExposedPorts`, `InternalPorts`, `EnvVars`, `Volumes` on `ContainerRecipe` changed to
  `IReadOnlyList<T>` for consistent immutable typing
- `SetCriticalPriority` property renamed to `IsCriticalPriority`
- `GetPriorityClassName` returns `string?` instead of `null!`
- `Reset()` extracted in `ContainerRecipeFactory` to consolidate post-create state reset
- Fixed bug: `nodePoolLabels` and `tolerations` were passed by reference and then cleared,
  leaving the recipe with empty collections; now snapshotted before clearing
- `SchedulingAffinity.cs` deleted (no remaining callers)
2026-04-29 16:45:55 +10:00
E M
5b17395380
fix: "pod IP unknown" failures while pod still scheduling 2026-04-28 21:56:25 +10:00
Eric
13d453d5ed
chore: Docker updates to support release tests in logos-storage-nim, and remove Codex references (#124)
* ci(docker): build dist-tests images

* Update to .net 10, kubernetes client 18.0.13

Kubernetes client 18.0.13 is compatible with Kubernetes 1.34.x. The Kubernetes version is selected automatically by kubeadm in docker desktop (v1.34.1). See https://github.com/kubernetes-client/csharp#version-compatibility for a compatibility table.

* Updates to support Kubernetes upgrade

* bump openapi.yaml to match openapi.yaml in the logos-storage-nim docker image

* bump doc to .net 10

* bump docker to .net 10

* Build image with latest tag always

Always build an image with a latest tag (as well as a sha commit hash) when there's a push to master

* docker image tag as "latest" only when pushing to master

* Update docker image to install doctl

* Remove doctl install

kubeconfig is now created and uses a plain bearer token instead of using doctl as a credential mgr

* Rename and remove all instances of Codex

* Further remove CodexNetDeployer as it is no longer needed

---------

Co-authored-by: Adam Uhlíř <adam@uhlir.dev>
2026-04-17 15:03:22 +10:00
Ben
e03f5982d3
Requires new contracts image with configurable marketplace config 2025-08-25 11:11:56 +02:00
ThatBen
8817ce56ae
increases pod deployment timeout 2025-07-09 10:22:41 +02:00
ThatBen
c1fa309271
Applies long timesets for cluster runs 2025-06-20 08:40:48 +02:00
ThatBen
c98cf1ffc4
fixes naming for parameterised fixtures 2025-06-05 15:27:06 +02:00
Ben
6d02285d9a
Merge branch 'master' into feature/proofs-and-frees
# Conflicts:
#	Framework/KubernetesWorkflow/LogHandler.cs
#	Framework/Utils/EthAddress.cs
#	ProjectPlugins/CodexContractsPlugin/Marketplace/Marketplace.cs
#	Tests/CodexReleaseTests/MarketTests/FailTest.cs
#	Tests/CodexReleaseTests/MarketTests/FinishTest.cs
#	Tests/CodexReleaseTests/Utils/MarketplaceAutoBootstrapDistTest.cs
#	Tools/AutoClient/Modes/FolderStore/FileSaver.cs
2025-06-04 12:31:19 +02:00
ThatBen
9e9b147b68
wip log downloading 2025-05-20 14:16:33 +02:00
ThatBen
0e66e8e94a
Merge branch 'master' into feature/extended-marketplace-testing
# Conflicts:
#	Tests/CodexReleaseTests/Parallelism.cs
2025-05-20 09:00:33 +02:00
Ben
5e62c3520c
Prevents downloading of crash log in retry loop 2025-05-14 11:45:15 +02:00
ThatBen
76ea98e783
reverse disable pod ip 2025-04-29 12:03:13 +02:00
ThatBen
8d50c008b8
disables pod ips for cluster testing, fix when not at offsite 2025-04-28 15:44:46 +02:00
ThatBen
3bb9a29054
sets up multiple successfulcontract tests 2025-04-24 12:53:08 +02:00
Ben
a676e0463d
Sets up asserting of balances in case of failed contract. 2025-03-12 14:06:17 +01:00
Ben
fe75609ecb
Fix cid formatting in bot. Fix autoclient folder-uploader crash. Fix several openapi alignments. 2025-02-26 11:59:50 +01:00
ThatBen
92a5a1e361
fixes crashwatcher stop 2025-02-21 15:21:42 +01:00
ThatBen
0c961b8348
updates marketplace contract, fixes type conversions 2025-02-21 14:17:13 +01:00
ThatBen
c73fa186fc
wip 2025-01-16 13:24:57 +01:00
ThatBen
4a151880d4
Extracts codexClient assembly 2025-01-16 11:31:50 +01:00
ThatBen
ec644eed4a
Moving downloadedlog to logging module 2025-01-16 10:15:02 +01:00
ThatBen
e45ed0c21e
pushes container concepts out of codexAccess 2025-01-15 15:00:25 +01:00
Ben
e45b8bde54
adds container names to container log filenames 2024-12-09 15:57:10 +01:00
Ben
3e2cad3c17
Fixes issue where occassionally a CID contains 'ERR' and fails the log assert. 2024-12-04 15:57:49 +01:00
benbierens
04f087efe4
Bump to dotnet 8 2024-10-03 14:02:28 +02:00
Ben
ffb5eb294a
Fix kubernetes port name issue 2024-09-30 15:23:14 +02:00
Ben
c9fedac592
Adds call to get availabilities 2024-09-23 10:52:12 +02:00
Ben
ecd0e70261
fixes serialization issue of containerAdditionals 2024-08-21 10:45:17 +02:00
benbierens
d16b8cb011
Fixes identity issue for runningpod/runningcontainer and log saving for stopped containers 2024-08-01 10:39:06 +02:00
benbierens
9d9f65c5a3
Fixes missing name and null events 2024-07-29 11:02:24 +02:00
benbierens
87271f4f37
Sets up starting event and bootstrap event 2024-07-29 10:16:37 +02:00
Ben
5ca135646a
Container crash detection during start-up 2024-06-19 10:39:14 +02:00
benbierens
54f053cfcc
Improves crash detection 2024-06-14 09:05:56 +02:00
Ben
4fc9835f43
Attempt to log disk space before and after uploads and downloads 2024-06-12 10:48:52 +02:00
benbierens
aa416d50b3
ensuring enough mounted disk space 2024-06-08 10:36:23 +02:00
benbierens
3a61fc89c6
Adds WaitForCleanup test attribute to allow tests to wait for resources to be cleaned up 2024-06-06 15:09:52 +02:00
Ben
0ec43a9325
Shuts up json-serialization codex logging 2024-05-07 11:04:32 +02:00
benbierens
e187bfc941
Changes time.retry to fixed timelength instead of fixed number of retries 2024-05-02 08:41:20 +02:00
benbierens
015d8da21d
Better logging for Time.WaitUntil. 2024-04-14 09:22:55 +02:00
benbierens
d847c4f3ec
Adds names to kube wait functions 2024-04-14 08:56:22 +02:00
gmega
e3b16fd742
add ability to stop single containers 2024-04-13 17:12:14 +03:00
benbierens
69666d3fee
Setting up future-containers 2024-04-09 09:30:45 +02:00
Ben
c5fb066c75
Allows for non-blocking stop of containers 2024-03-13 10:57:26 +01:00
Ben
e42f1ddbd7
Adds support for command overrides to container recipes. 2024-03-13 10:01:37 +01:00
benbierens
5dc918287c
Merge branch 'master' into feature/public-testnet-deploying 2023-12-11 08:30:25 +01:00
Slava
4c46a708ab
Change affinity label (#85)
https://github.com/codex-storage/infra-codex/issues/100
2023-11-27 19:06:04 +02:00
benbierens
3761e236a3
Sets node critical priority for codex and geth nodes 2023-11-23 14:50:54 +01:00
benbierens
ff52e8e841
Defines public IP fetching service 2023-11-23 14:15:37 +01:00
benbierens
485e3cf02e
Merge branch 'master' into feature/public-testnet-deploying 2023-11-14 10:50:41 +01:00
benbierens
b47b596062
Fetches used external ports in order to guarantee no collisions. 2023-11-14 10:49:14 +01:00