diff --git a/analysis/recoverability_analysis.md b/analysis/recoverability_analysis.md index 578bbe3..fe8f422 100644 --- a/analysis/recoverability_analysis.md +++ b/analysis/recoverability_analysis.md @@ -44,11 +44,28 @@ RepoStore will add support for explicit deletes. - Marketplace calls `onStore` with a `requestId, slotIdx` (and a manifest cid if needed) - Marketplace calls `delete` with a `requestId, slotIdx` (and a manifest cid if needed) +#### `onStore` considerations + +`onStore` is a long-running operation that may not finish before a request +cancellation. If this happens, `SaleDownloading.run` will be cancelled, which +will cancel the `onStore` call, so `onStore` will need to handle this exception +appropriately. + ### Phase III -Phase III will include additional Marketplace features that go beyond the -"baseline" [simplification of the Marketplace sales -design](https://github.com/codex-storage/codex-research/blob/refactor/simplified-sales-and-purchasing/design/sales2.md). +Phase III will include [resumption of local state in the Marketplace](https://github.com/codex-storage/codex-research/blob/refactor/simplified-sales-and-purchasing/design/sales2.md). + +#### Resumable downloads support + +Depends on: [Marketplace resumption of local state](https://github.com/codex-storage/codex-research/blob/refactor/simplified-sales-and-purchasing/design/sales2.md#resuming-local-state-eg-downloading) + +- `onStore` will need to internally track the state of building a slot, to allow + for recovery (Download Manager), remembering to handle cancellations + appropriately. +- When the Marketplace restores a local sale to the downloading state, `onStore` + will be called and it will resume its operation based on the state. + +### Phase IV and beyond #### Marketplace concurrency support @@ -58,22 +75,6 @@ Depends on: [Marketplace support for concurrent workers](https://github.com/code stores/deletes (due to locking) by implementing a Marketplace-level dataset ref count. -#### Resumable downloads support - -Depends on: [Marketplace resumption of local state](https://github.com/codex-storage/codex-research/blob/refactor/simplified-sales-and-purchasing/design/sales2.md#resuming-local-state-eg-downloading) - -- `onStore` will need to internally track the state of building a slot, to allow - for recovery (Download Manager) -- When the Marketplace restores a local sale to the downloading state, `onStore` - will be called and it will resume its operation based on the state. - -#### `onStore` considerations - -`onStore` is a long-running operation that may not finish before a request -cancellation. If this happens, `SaleDownloading.run` will be cancelled, which -will cancel the `onStore` call, so `onStore` will need to handle this exception -appropriately. - ## Phase I: Expiries but no delete Identify points in the sales state machine with crashes, exceptions, or @@ -140,6 +141,11 @@ flowchart LR ### Downloading +It is important to note that the Marketplace does not update the expiry +directly. Updating of the dataset expiry is done indirectly via +`onStore(expiry)`. However, consideration of expiration update success is still +needed to analyse recoverability. + ```mermaid --- config: @@ -179,7 +185,7 @@ config: flowchart LR WaitForStableChallenge["Wait for stable challenge"] --> GetChallenge["Get challenge"] -- challenge --> - onProve["onProve(slot, challege)"] --> + onProve["onProve(slot, challenge)"] --> SaleFilling WaitForStableChallenge -- Exception --> SaleErrored @@ -193,7 +199,7 @@ flowchart LR | Situation | Outcome | |--------------------------------------------------------------------------------|---------------------------------------------| | CRASH at any point | The slot is forfeited. | -| EXCEPTION at any point | Goes to SaleErrored. The slot is forfeited. | +| EXCEPTION at any point | Goes to `SaleErrored`. The slot is forfeited. | | CANCELLATION during "wait for stable challenge", "get challenge", or `onProve` | The slot is forfeited. | ### Filling @@ -290,10 +296,10 @@ flowchart TB WaitUntilNextPeriod -- Exception --> SaleErrored ``` -| Situation | Outcome | Solution | -|------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| CRASH at any point | On chain state is restored at startup, starting in `SaleFilled` state which extends the expiry to request end (no op), then moves back to `SaleProving`. | | -| EXCEPTION during `getSlotState`, `waitForNextPeriod`, `getCurrentPeriod`, or `getChallenge` | Goes to `SaleErrored`, all proving is stopped and SP becomes at risk of being slashed. | For network exceptions, implement retry functionality, with exponential backoff. For other exceptions, go to `SaleErrored`. Exceptions may include any errors resulting from the JSON RPC call, or errors originating in ethers. | +| Situation | Outcome | Solution | +|------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| CRASH at any point | On chain state is restored at startup, starting in `SaleFilled` state which extends the expiry to request end (no op), then moves back to `SaleProving`. | | +| EXCEPTION during `getSlotState`, `waitForNextPeriod`, `getCurrentPeriod`, or `getChallenge` | Goes to `SaleErrored`, all proving is stopped and SP becomes at risk of being slashed. | For network exceptions, implement retry functionality, with exponential backoff. For other exceptions, go to `SaleErrored`. Exceptions may include any errors resulting from the JSON RPC call, or errors originating in ethers. | | CANCELLATION during `getSlotState`, `waitForNextPeriod`, `getCurrentPeriod`, or `getChallenge` | Goes to `SaleErrored`, all proving is stopped and SP becomes at risk of being slashed. | Possible options include:
1. Raise a Defect, crashing the node and forcing a restart. On startup, slot will enter the `SaleFilled` state, then move to `SaleProving`.
2. Log an error and increment a metric counter.
2. Do not propagate the cancellation. The sale will continue in the `SaleProving` state. Waiters (eg `cancelAndWait` may wait indefinitely). | ### Payout @@ -313,13 +319,13 @@ flowchart LR OnFailed["RequestFailed contract event"] --> SaleFailed ``` -| Situation | Outcome | Solution | -|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--| -| `freeSlot` is successful | SUCCESS | | -| CRASH before `freeSlot` completes | On chain state is restored at startup, starting in `SalePayout`, where `freeSlot` will be tried again. | | -| CRASH after `freeSlot` completes | Slot is no longer part of `mySlots`, so on startup, slot state will not be restored. | | +| Situation | Outcome | Solution | +|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------| +| `freeSlot` is successful | SUCCESS | | +| CRASH before `freeSlot` completes | On chain state is restored at startup, starting in `SalePayout`, where `freeSlot` will be tried again. | | +| CRASH after `freeSlot` completes | Slot is no longer part of `mySlots`, so on startup, slot state will not be restored. | | | EXCEPTION during `freeSlot` | Goes to `SaleErrored`. If `freeSlot` succeeded before the exception, no clean up routines will be performed. If `freeSlot` did not succeed before the exception, no funds will have been paid out and there are no recovery options. | For network exceptions, implement retry functionality with exponential backoff. | -| CANCELLATION during `freeSlot` | Goes to `SaleErrored`. If `freeSlot` succeeded before the cancellation, no clean up routines will be performed. If `freeSlot` did not succeed before the cancellation, no funds will have been paid out and there are no recovery options. | | +| CANCELLATION during `freeSlot` | Goes to `SaleErrored`. If `freeSlot` succeeded before the cancellation, no clean up routines will be performed. If `freeSlot` did not succeed before the cancellation, no funds will have been paid out and there are no recovery options. | | ### Finished @@ -348,13 +354,47 @@ flowchart LR ## Phase II: No expiries but deletes -Depends on: `SalesOrder` implementation +Depends on: `SalesOrder` "baseline" design implementation. Identify points of `SalesOrder` updates and `RepoStore` writes, with crashes, exceptions, or cancellations in between. ### Downloading +```mermaid +--- +config: + layout: elk +--- +flowchart LR + FetchSlotState["Fetch slot state"] -- isRepairing --> + CreateSalesOrder["Create SalesOrder"] --> + OnStore["onStore(isRepairing)"] --> + SaleInitialProving + + OnStore -- Exception --> SaleErrored + FetchSlotState -- Exception --> SaleErrored + CreateSalesOrder -- Exception --> SaleErrored + + OnCancelled["Cancelled timer elapsed"] --> SaleCancelled + OnFailed["RequestFailed contract event"] --> SaleFailed +``` + +| Situation | Outcome | +|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| No crash/exception in `onStore(expiry)` | Success | +| CRASH before create `SalesOrder` | The slot is forfeited, with no recovery on startup since the slot was not filled. | +| CRASH before `onStore(expiry)` | The slot is forfeited, with no recovery on startup since the slot was not filled. On restart, the `SalesOrder` will be archived. | +| CRASH in `onStore(expiry)` | The slot is forfeited, with no recovery on startup since the slot was not filled. On restart, the dataset will be cleaned up (corrective cleanup) and the `SalesOrder` archived. | +| CRASH after `onStore(expiry)` but before the transition to `SaleInitialProving` | The slot is forfeited, with no recovery on startup since the slot was not filled. On restart, the dataset will be cleaned up (corrective cleanup) and the `SalesOrder` archived. | +| EXCEPTION before create `SalesOrder` | Goes to `SaleErrored`. The slot is forfeited. | +| EXCEPTION before `onStore(expiry)` | Goes to `SaleErrored`. The slot is forfeited. The `SalesOrder` will be archived. | +| EXCEPTION in `onStore(expiry)` | Goes to `SaleErrored`. The slot is forfeited. The dataset will be cleaned up (active cleanup) and the `SalesOrder` archived. | +| EXCEPTION after `onStore(expiry)` but before the transition to `SaleInitialProving` | The slot is forfeited. The dataset will be cleaned up (active cleanup) and the `SalesOrder` archived. | +| CANCELLATION while fetching slot state | The slot is forfeited. | +| CANCELLATION during `onStore` | The slot is forfeited. The dataset will be cleaned up (active cleanup) and the `SalesOrder` archived. | + + `SalesOrders` are created in the downloading state before any data is written to disk. diff --git a/design/sales2.md b/design/sales2.md index f22cc1f..c993a77 100644 --- a/design/sales2.md +++ b/design/sales2.md @@ -192,8 +192,9 @@ until it is deleted. The properties of a created `Availability` can be updated at any time. Because availability(ies) represents *future* sales (and not active sales), and -because fields of the matching `Availability` are persisted in a `SalesOrder`, -availabilities are not tied to active sales and can be manipulated at any time. +because fields of the matching `Availability` can be persisted in a `SalesOrder` +(if needed), availabilities are not tied to active sales and can be manipulated +at any time. ### `SalesOrder` object @@ -328,17 +329,17 @@ The underlying `RepoStore` of the `SalesRepo` is responsible to reading and writing datasets to storage. Its API will include: ```nim -proc store(id: DatasetId) - ## Stores blocks of the dataset, incrementing their ref count. -proc delete(id: DatasetId) - ## Decreases the ref count of blocks of the dataset, deleting if the ref count is 0. +proc onStore(id: DatasetId) + ## Stores a dataset, incrementing its ref count. +proc onClear(id: DatasetId) + ## Decreases the ref count of the dataset, deleting if the ref count is 0. ``` Datasets will be tracked by a particular id, but it is TBD as to what that ID will be: -- Preferred option for MP is `manifestCID + slotIndex`. -- Alternative options discussed: `treeCid + slotIndex`, `slotRoot`. +- Preferred option for MP: `requestId + slotIndex`. +- Alternative options discussed: `treeCid + slotIndex`, `slotRoot`, `requestId + slotIndex + manifestCid` ## Total collateral