From daf6afda9c9d725aa2f8314dc50685032a2c6a57 Mon Sep 17 00:00:00 2001
From: Eric <5089238+emizzle@users.noreply.github.com>
Date: Mon, 5 May 2025 08:38:07 +0200
Subject: [PATCH] add phases, start on phase 2 analysis
---
analysis/recoverability_analysis.md | 104 +++++++++++++++++++---------
design/sales2.md | 17 ++---
2 files changed, 81 insertions(+), 40 deletions(-)
diff --git a/analysis/recoverability_analysis.md b/analysis/recoverability_analysis.md
index 578bbe3..fe8f422 100644
--- a/analysis/recoverability_analysis.md
+++ b/analysis/recoverability_analysis.md
@@ -44,11 +44,28 @@ RepoStore will add support for explicit deletes.
- Marketplace calls `onStore` with a `requestId, slotIdx` (and a manifest cid if needed)
- Marketplace calls `delete` with a `requestId, slotIdx` (and a manifest cid if needed)
+#### `onStore` considerations
+
+`onStore` is a long-running operation that may not finish before a request
+cancellation. If this happens, `SaleDownloading.run` will be cancelled, which
+will cancel the `onStore` call, so `onStore` will need to handle this exception
+appropriately.
+
### Phase III
-Phase III will include additional Marketplace features that go beyond the
-"baseline" [simplification of the Marketplace sales
-design](https://github.com/codex-storage/codex-research/blob/refactor/simplified-sales-and-purchasing/design/sales2.md).
+Phase III will include [resumption of local state in the Marketplace](https://github.com/codex-storage/codex-research/blob/refactor/simplified-sales-and-purchasing/design/sales2.md).
+
+#### Resumable downloads support
+
+Depends on: [Marketplace resumption of local state](https://github.com/codex-storage/codex-research/blob/refactor/simplified-sales-and-purchasing/design/sales2.md#resuming-local-state-eg-downloading)
+
+- `onStore` will need to internally track the state of building a slot, to allow
+ for recovery (Download Manager), remembering to handle cancellations
+ appropriately.
+- When the Marketplace restores a local sale to the downloading state, `onStore`
+ will be called and it will resume its operation based on the state.
+
+### Phase IV and beyond
#### Marketplace concurrency support
@@ -58,22 +75,6 @@ Depends on: [Marketplace support for concurrent workers](https://github.com/code
stores/deletes (due to locking) by implementing a Marketplace-level dataset
ref count.
-#### Resumable downloads support
-
-Depends on: [Marketplace resumption of local state](https://github.com/codex-storage/codex-research/blob/refactor/simplified-sales-and-purchasing/design/sales2.md#resuming-local-state-eg-downloading)
-
-- `onStore` will need to internally track the state of building a slot, to allow
- for recovery (Download Manager)
-- When the Marketplace restores a local sale to the downloading state, `onStore`
- will be called and it will resume its operation based on the state.
-
-#### `onStore` considerations
-
-`onStore` is a long-running operation that may not finish before a request
-cancellation. If this happens, `SaleDownloading.run` will be cancelled, which
-will cancel the `onStore` call, so `onStore` will need to handle this exception
-appropriately.
-
## Phase I: Expiries but no delete
Identify points in the sales state machine with crashes, exceptions, or
@@ -140,6 +141,11 @@ flowchart LR
### Downloading
+It is important to note that the Marketplace does not update the expiry
+directly. Updating of the dataset expiry is done indirectly via
+`onStore(expiry)`. However, consideration of expiration update success is still
+needed to analyse recoverability.
+
```mermaid
---
config:
@@ -179,7 +185,7 @@ config:
flowchart LR
WaitForStableChallenge["Wait for stable challenge"] -->
GetChallenge["Get challenge"] -- challenge -->
- onProve["onProve(slot, challege)"] -->
+ onProve["onProve(slot, challenge)"] -->
SaleFilling
WaitForStableChallenge -- Exception --> SaleErrored
@@ -193,7 +199,7 @@ flowchart LR
| Situation | Outcome |
|--------------------------------------------------------------------------------|---------------------------------------------|
| CRASH at any point | The slot is forfeited. |
-| EXCEPTION at any point | Goes to SaleErrored. The slot is forfeited. |
+| EXCEPTION at any point | Goes to `SaleErrored`. The slot is forfeited. |
| CANCELLATION during "wait for stable challenge", "get challenge", or `onProve` | The slot is forfeited. |
### Filling
@@ -290,10 +296,10 @@ flowchart TB
WaitUntilNextPeriod -- Exception --> SaleErrored
```
-| Situation | Outcome | Solution |
-|------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| CRASH at any point | On chain state is restored at startup, starting in `SaleFilled` state which extends the expiry to request end (no op), then moves back to `SaleProving`. | |
-| EXCEPTION during `getSlotState`, `waitForNextPeriod`, `getCurrentPeriod`, or `getChallenge` | Goes to `SaleErrored`, all proving is stopped and SP becomes at risk of being slashed. | For network exceptions, implement retry functionality, with exponential backoff. For other exceptions, go to `SaleErrored`. Exceptions may include any errors resulting from the JSON RPC call, or errors originating in ethers. |
+| Situation | Outcome | Solution |
+|------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| CRASH at any point | On chain state is restored at startup, starting in `SaleFilled` state which extends the expiry to request end (no op), then moves back to `SaleProving`. | |
+| EXCEPTION during `getSlotState`, `waitForNextPeriod`, `getCurrentPeriod`, or `getChallenge` | Goes to `SaleErrored`, all proving is stopped and SP becomes at risk of being slashed. | For network exceptions, implement retry functionality, with exponential backoff. For other exceptions, go to `SaleErrored`. Exceptions may include any errors resulting from the JSON RPC call, or errors originating in ethers. |
| CANCELLATION during `getSlotState`, `waitForNextPeriod`, `getCurrentPeriod`, or `getChallenge` | Goes to `SaleErrored`, all proving is stopped and SP becomes at risk of being slashed. | Possible options include:
1. Raise a Defect, crashing the node and forcing a restart. On startup, slot will enter the `SaleFilled` state, then move to `SaleProving`.
2. Log an error and increment a metric counter.
2. Do not propagate the cancellation. The sale will continue in the `SaleProving` state. Waiters (eg `cancelAndWait` may wait indefinitely). |
### Payout
@@ -313,13 +319,13 @@ flowchart LR
OnFailed["RequestFailed contract event"] --> SaleFailed
```
-| Situation | Outcome | Solution |
-|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
-| `freeSlot` is successful | SUCCESS | |
-| CRASH before `freeSlot` completes | On chain state is restored at startup, starting in `SalePayout`, where `freeSlot` will be tried again. | |
-| CRASH after `freeSlot` completes | Slot is no longer part of `mySlots`, so on startup, slot state will not be restored. | |
+| Situation | Outcome | Solution |
+|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
+| `freeSlot` is successful | SUCCESS | |
+| CRASH before `freeSlot` completes | On chain state is restored at startup, starting in `SalePayout`, where `freeSlot` will be tried again. | |
+| CRASH after `freeSlot` completes | Slot is no longer part of `mySlots`, so on startup, slot state will not be restored. | |
| EXCEPTION during `freeSlot` | Goes to `SaleErrored`. If `freeSlot` succeeded before the exception, no clean up routines will be performed. If `freeSlot` did not succeed before the exception, no funds will have been paid out and there are no recovery options. | For network exceptions, implement retry functionality with exponential backoff. |
-| CANCELLATION during `freeSlot` | Goes to `SaleErrored`. If `freeSlot` succeeded before the cancellation, no clean up routines will be performed. If `freeSlot` did not succeed before the cancellation, no funds will have been paid out and there are no recovery options. | |
+| CANCELLATION during `freeSlot` | Goes to `SaleErrored`. If `freeSlot` succeeded before the cancellation, no clean up routines will be performed. If `freeSlot` did not succeed before the cancellation, no funds will have been paid out and there are no recovery options. | |
### Finished
@@ -348,13 +354,47 @@ flowchart LR
## Phase II: No expiries but deletes
-Depends on: `SalesOrder` implementation
+Depends on: `SalesOrder` "baseline" design implementation.
Identify points of `SalesOrder` updates and `RepoStore` writes, with crashes,
exceptions, or cancellations in between.
### Downloading
+```mermaid
+---
+config:
+ layout: elk
+---
+flowchart LR
+ FetchSlotState["Fetch slot state"] -- isRepairing -->
+ CreateSalesOrder["Create SalesOrder"] -->
+ OnStore["onStore(isRepairing)"] -->
+ SaleInitialProving
+
+ OnStore -- Exception --> SaleErrored
+ FetchSlotState -- Exception --> SaleErrored
+ CreateSalesOrder -- Exception --> SaleErrored
+
+ OnCancelled["Cancelled timer elapsed"] --> SaleCancelled
+ OnFailed["RequestFailed contract event"] --> SaleFailed
+```
+
+| Situation | Outcome |
+|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| No crash/exception in `onStore(expiry)` | Success |
+| CRASH before create `SalesOrder` | The slot is forfeited, with no recovery on startup since the slot was not filled. |
+| CRASH before `onStore(expiry)` | The slot is forfeited, with no recovery on startup since the slot was not filled. On restart, the `SalesOrder` will be archived. |
+| CRASH in `onStore(expiry)` | The slot is forfeited, with no recovery on startup since the slot was not filled. On restart, the dataset will be cleaned up (corrective cleanup) and the `SalesOrder` archived. |
+| CRASH after `onStore(expiry)` but before the transition to `SaleInitialProving` | The slot is forfeited, with no recovery on startup since the slot was not filled. On restart, the dataset will be cleaned up (corrective cleanup) and the `SalesOrder` archived. |
+| EXCEPTION before create `SalesOrder` | Goes to `SaleErrored`. The slot is forfeited. |
+| EXCEPTION before `onStore(expiry)` | Goes to `SaleErrored`. The slot is forfeited. The `SalesOrder` will be archived. |
+| EXCEPTION in `onStore(expiry)` | Goes to `SaleErrored`. The slot is forfeited. The dataset will be cleaned up (active cleanup) and the `SalesOrder` archived. |
+| EXCEPTION after `onStore(expiry)` but before the transition to `SaleInitialProving` | The slot is forfeited. The dataset will be cleaned up (active cleanup) and the `SalesOrder` archived. |
+| CANCELLATION while fetching slot state | The slot is forfeited. |
+| CANCELLATION during `onStore` | The slot is forfeited. The dataset will be cleaned up (active cleanup) and the `SalesOrder` archived. |
+
+
`SalesOrders` are created in the downloading state before any data is written to
disk.
diff --git a/design/sales2.md b/design/sales2.md
index f22cc1f..c993a77 100644
--- a/design/sales2.md
+++ b/design/sales2.md
@@ -192,8 +192,9 @@ until it is deleted. The properties of a created `Availability` can be updated
at any time.
Because availability(ies) represents *future* sales (and not active sales), and
-because fields of the matching `Availability` are persisted in a `SalesOrder`,
-availabilities are not tied to active sales and can be manipulated at any time.
+because fields of the matching `Availability` can be persisted in a `SalesOrder`
+(if needed), availabilities are not tied to active sales and can be manipulated
+at any time.
### `SalesOrder` object
@@ -328,17 +329,17 @@ The underlying `RepoStore` of the `SalesRepo` is responsible to reading and
writing datasets to storage. Its API will include:
```nim
-proc store(id: DatasetId)
- ## Stores blocks of the dataset, incrementing their ref count.
-proc delete(id: DatasetId)
- ## Decreases the ref count of blocks of the dataset, deleting if the ref count is 0.
+proc onStore(id: DatasetId)
+ ## Stores a dataset, incrementing its ref count.
+proc onClear(id: DatasetId)
+ ## Decreases the ref count of the dataset, deleting if the ref count is 0.
```
Datasets will be tracked by a particular id, but it is TBD as to what that ID
will be:
-- Preferred option for MP is `manifestCID + slotIndex`.
-- Alternative options discussed: `treeCid + slotIndex`, `slotRoot`.
+- Preferred option for MP: `requestId + slotIndex`.
+- Alternative options discussed: `treeCid + slotIndex`, `slotRoot`, `requestId + slotIndex + manifestCid`
## Total collateral