Erasure coding specification

Erasure coding spec (https://github.com/codex-storage/codex-spec/issues/19)
2026-02-17 11:43:11 +00:00 · 2025-08-13 18:24:17 +03:00 · 2025-08-13 18:24:17 +03:00 · 33588d8d49
commit 33588d8d49
parent 112bb06faf
1 changed files with 421 additions and 0 deletions
--- a/Notes/Component
+++ b/Notes/Component
@ -0,0 +1,421 @@
+# Erasure Coding Module Specification
+
+## 1. Purpose and Scope
+
+### Purpose
+The erasure coding module provides data redundancy and recovery capabilities. It encodes original data blocks into erasure-protected datasets using systematic Reed-Solomon erasure coding, enabling data recovery even when some blocks are unavailable or corrupted.
+
+The Codex implementation uses the Leopard library, but alternative Reed-Solomon implementations may be used as well.
+
+### Scope
+- Encoding manifests into erasure-protected manifests using K data blocks + M parity blocks
+- Decoding erasure-protected manifests back to original data
+- Repairing incomplete or damaged datasets
+- Merkle tree generation and verification for data integrity
+
+### Boundaries and Limitations
+- Requires at least K blocks to reconstruct original data
+- Dataset must be padded to have blocks aligned for interleaving
+- Maximum recovery capability limited to M missing blocks per column
+- All blocks must be of uniform size within a manifest
+
+## 2. Interfaces
+
+| Interface | Description | Input | Output |
+|-----------|-------------|-------|---------|
+| `encode()` | Encodes a manifest into erasure-protected format | `manifest: Manifest`<br>`blocks: Natural` (K)<br>`parity: Natural` (M)<br>`strategy: StrategyType` (default: SteppedStrategy) | `Future[?!Manifest]` - Protected manifest with erasure coding metadata |
+| `decode()` | Decodes protected manifest to original | `encoded: Manifest` (must be protected) | `Future[?!Manifest]` - Original manifest reconstructed from available blocks |
+| `repair()` | Repairs a damaged protected manifest by reconstructing full dataset | `encoded: Manifest` (protected) | `Future[?!void]` - Success/failure status |
+| `new()` | Creates new Erasure instance | `store: BlockStore`<br>`encoderProvider: EncoderProvider`<br>`decoderProvider: DecoderProvider`<br>`taskPool: Taskpool` | `Erasure` instance |
+| `start()` | Starts the erasure service | None | `Future[void]` |
+| `stop()` | Stops the erasure service | None | `Future[void]` |
+
+### Internal Helper Interfaces
+
+| Interface | Description | Input | Output |
+|-----------|-------------|-------|---------|
+| `asyncEncode()` | Performs async encoding using thread pool | `blockSize: int`<br>`blocksLen: int`<br>`parityLen: int`<br>`blocks: ref seq[seq[byte]]`<br>`parity: ptr UncheckedArray` | `Future[?!void]` |
+| `asyncDecode()` | Performs async decoding using thread pool | `blockSize: int`<br>`blocksLen: int`<br>`parityLen: int`<br>`blocks, parity: ref seq[seq[byte]]`<br>`recovered: ptr UncheckedArray` | `Future[?!void]` |
+| `getPendingBlocks()` | Creates async iterator for block retrieval | `manifest: Manifest`<br>`indices: seq[int]` | `AsyncIter[(?!Block, int)]` |
+
+## 3. Functional Requirements
+
+### 3.1 Systematic Reed-Solomon Erasure Coding
+- Generate M parity blocks from K data blocks using Reed-Solomon erasure coding
+- Use interleaving to encode blocks at the same position across shards
+- Support recovery with any K blocks from K+M total blocks
+- Maintain systematic property where original data is readable without decoding
+- Ensure deterministic encoding
+
+### 3.2 Block Organization
+Erasure coding uses interleaving (SteppedStrategy) to encode data blocks. This works by taking blocks from the same position across multiple data shards and encoding them together:
+
+```
+Interleaving Process
+--------------------
+
+Data shards (K=4 in this example):
+
+    -------------    -------------    -------------    -------------
+    |x| | | | | |    |x| | | | | |    |x| | | | | |    |x| | | | | |
+    -------------    -------------    -------------    -------------
+     |                /                /                |
+      \___________   |   _____________/                 |
+                  \  |  /  ____________________________/
+                   | | |  /
+                   v v v v
+                  
+                  ---------         ---------
+            data  |x|x|x|x|   -->   |p|p|p|p|  parity
+                  ---------         ---------
+                  
+                                     | | | |
+       _____________________________/ /  |  \_________
+      /                 _____________/   |             \
+     |                 /                /               |
+     v                v                v                v
+    -------------    -------------    -------------    -------------
+    |p| | | | | |    |p| | | | | |    |p| | | | | |    |p| | | | | |
+    -------------    -------------    -------------    -------------
+    
+Parity shards (M parity shards generated)
+```
+
+**Key concepts**:
+
+- **K**: Number of original data shards
+- **M**: Number of parity shards
+- **N**: Total shards (K + M)
+- **Steps**: Number of encoding iterations, one for each data block position
+
+The dataset is organized as:
+- **Rows**: K + M total
+- **Columns**: B blocks per row, where B = Steps
+- **Total blocks**: (K + M) x B blocks in the encoded dataset
+
+### 3.3 Data Recovery
+- Reconstruct missing data blocks from any K available blocks
+- Support partial column recovery - up to M blocks missing per column
+- Verify recovered data against original tree root (originalTreeCid)
+- Complete partial block downloads when recovery succeeds
+- Store recovered blocks back to BlockStore for future access
+
+### 3.4 Manifest Management
+- Transform unprotected manifests to protected manifests with erasure coding metadata
+- Preserve original metadata (filename, mimetype, dimensions)
+- Generate and store merkle tree proofs for all blocks
+- Support verifiable manifests with slot roots for proof generation
+- Track both original and encoded dataset sizes
+
+## 4. Non-Functional Requirements
+
+### Performance
+- **Latency**: Yield control periodically (10ms sleep) to prevent blocking
+- **Scalability**: Linear complexity with dataset size O(n)
+
+### Reliability
+- **Error Handling**: 
+  - Detailed error types (ErasureError, InsufficientBlocksError)
+  - Proper error propagation through Result types
+- **Data Integrity**:
+  - Merkle tree verification for all operations
+  - CID-based content addressing
+  - Tree root comparison for validation
+
+### Scalability
+- Handle datasets of arbitrary size - limited only by storage
+- Support configurable erasure coding parameters (K, M)
+- Thread pool size configurable based on system resources
+- Streaming block retrieval to avoid memory exhaustion
+
+## 5. Internal Behavior
+
+### 5.1 Encoding Workflow
+
+```mermaid
+flowchart TD
+    A[Start Encoding] --> B{Validate Parameters}
+    B -->|Invalid| C[Return InsufficientBlocksError]
+    B -->|Valid| D[Calculate Dimensions]
+    D --> E[Initialize EncodingParams]
+    E --> F[Call encodeData]
+    F --> G[For each step 0..steps-1]
+    G --> H[prepareEncodingData]
+    H --> I[Retrieve K blocks via strategy]
+    I --> J[Pad with empty blocks if needed]
+    J --> K[asyncEncode]
+    K --> L[Spawn leopardEncodeTask]
+    L --> M[Wait for thread completion]
+    M --> N[Store M parity blocks]
+    N --> O{More steps?}
+    O -->|Yes| G
+    O -->|No| P[Build Merkle Tree]
+    P --> Q[Store tree proofs]
+    Q --> R[Create Protected Manifest]
+    R --> S[Return Success]
+```
+
+### 5.2 Decoding Workflow
+
+```mermaid
+flowchart TD
+    A[Start Decoding] --> B[Parse Protected Manifest]
+    B --> C[Call decodeInternal]
+    C --> D[For each step 0..steps-1]
+    D --> E[prepareDecodingData]
+    E --> F[Retrieve available blocks]
+    F --> G{dataPieces >= K?}
+    G -->|Yes| H[Skip decoding]
+    G -->|No| I[asyncDecode]
+    I --> J[Spawn leopardDecodeTask]
+    J --> K[Wait for completion]
+    K --> L[Store recovered blocks]
+    H --> M{More steps?}
+    L --> M
+    M -->|Yes| D
+    M -->|No| N[Build Original Tree]
+    N --> O{Verify Tree Root}
+    O -->|Match| P[Store proofs]
+    O -->|Mismatch| Q[Return Error]
+    P --> R[Create Decoded Manifest]
+    R --> S[Return Success]
+```
+
+### 5.3 Repair Workflow
+
+```mermaid
+flowchart TD
+    A[Start Repair] --> B[Call decodeInternal]
+    B --> C[Recover all blocks]
+    C --> D[Build Original Tree]
+    D --> E{Verify Original Tree Root}
+    E -->|Mismatch| F[Return Error]
+    E -->|Match| G[Store all proofs]
+    G --> H[Create Decoded Manifest]
+    H --> I[Call encode with same params]
+    I --> J[Re-encode dataset]
+    J --> K{Verify Repaired Tree}
+    K -->|Mismatch| L[Return Error]
+    K -->|Match| M[Return Success]
+```
+
+### 5.4 Data organization with interleaving
+The encoded dataset uses interleaving, where data blocks at the same position acrosss shards are processed together:
+
+```
+Logical Organization with interleaving:
+
+        Position 0   Position 1   ...   Position B-1
+        ----------------------------------------
+Shard 0  | Block 0  | Block 1   | ... | Block B-1      | Data
+Shard 1  | Block B  | Block B+1 | ... | Block 2B-1     | Data
+...      | ...      | ...       | ... | ...            | Data
+Shard K-1| Block (K-1)×B | ...  | ... | Block K×B-1    | Data
+         |----------|-----------|-----|----------------|
+Shard K  | Parity 0 | Parity 1  | ... | Parity B-1     | Parity
+Shard K+1| Parity B | Parity B+1| ... | Parity 2B-1    | Parity
+...      | ...      | ...       | ... | ...            | Parity
+Shard K+M-1| Parity (M-1)×B |...| ... | Parity M×B-1   | Parity
+
+where:
+- K = number of data shards
+- M = number of parity shards
+- B = number of positions (steps) per shard
+- Each column represents one encoding step
+- Elements at the same position form an encoding group
+```
+
+## 6. Dependencies
+
+### 6.1 External Libraries
+| Dependency | Purpose | Version/Module |
+|------------|---------|----------------|
+| `pkg/chronos` | Async/await runtime, Future types | Threading, async operations |
+| `pkg/libp2p` | CID, multicodec, multihash operations | Content addressing |
+| `pkg/taskpools` | Thread pool management | Parallel processing |
+| `pkg/stew` | Utility functions (byteutils) | Helper operations |
+| `pkg/leopard` | Reed-Solomon erasure coding | Erasure coding module |
+
+### 6.2 Internal Components
+| Component | Purpose | Interface |
+|-----------|---------|-----------|
+| `BlockStore` | Block storage and retrieval | `getBlock(cid/treeCid,index/address)`, `putBlock(blk,ttl)`, `completeBlock(address,blk)`, `putCidAndProof()`, `getCidAndProof()`, `hasBlock()`, `delBlock()` |
+| `Manifest` | Dataset metadata representation | Verifiable/Protected/unprotected manifests with erasure coding metadata |
+| `IndexingStrategy` | Block organization strategies | `getIndices(iteration)`, `init(strategy,firstIndex,lastIndex,iterations)` |
+| `Backend` | Leopard erasure coding | `encode()`, `decode()` interfaces provided via EncoderProvider/DecoderProvider |
+| `CodexTree` | Merkle tree operations | Tree generation, proof creation, root CID calculation |
+| `MerkleTree[H,K]` | Generic merkle tree | `getProof(index)`, `reconstructRoot()`, `verify()` with configurable hash function |
+| `BlockType` | Block data structure | CID-based block representation with data payload |
+
+
+
+### 6.3 Helper Functions
+| Function | Purpose | Input | Output |
+|----------|---------|-------|---------|
+| `putSomeProofs()` | Store merkle proofs for specific indices | `store: BlockStore`<br>`tree: CodexTree`<br>`iter: Iter[int/Natural]` | `Future[?!void]` |
+| `putAllProofs()` | Store all merkle proofs for a tree | `store: BlockStore`<br>`tree: CodexTree` | `Future[?!void]` |
+
+## 7. Data Models
+
+### 7.1 Core Types
+
+#### Erasure Object
+```nim
+type Erasure* = ref object
+  taskPool: Taskpool
+  encoderProvider*: EncoderProvider
+  decoderProvider*: DecoderProvider
+  store*: BlockStore
+```
+
+#### EncodingParams
+```nim
+type EncodingParams = object
+  ecK: Natural           # Number of data blocks (K)
+  ecM: Natural           # Number of parity blocks (M)
+  rounded: Natural       # Dataset rounded to multiple of K
+  steps: Natural         # Number of encoding iterations (steps)
+  blocksCount: Natural   # Total blocks after encoding
+  strategy: StrategyType # Indexing strategy used
+```
+
+### 7.2 Task Types
+
+#### EncodeTask
+```nim
+type EncodeTask = object
+  success: Atomic[bool]                                # Operation success flag
+  erasure: ptr Erasure                                 # Erasure instance
+  blocks: ptr UncheckedArray[ptr UncheckedArray[byte]] # Input data blocks
+  parity: ptr UncheckedArray[ptr UncheckedArray[byte]] # Output parity blocks
+  blockSize: int                                       # Size of each block
+  blocksLen: int                                       # Number of data blocks (K)
+  parityLen: int                                       # Number of parity blocks (M)
+  signal: ThreadSignalPtr                              # Completion signal
+```
+
+#### DecodeTask
+```nim
+type DecodeTask = object
+  success: Atomic[bool]                                   # Operation success flag
+  erasure: ptr Erasure                                    # Erasure instance
+  blocks: ptr UncheckedArray[ptr UncheckedArray[byte]]    # Available data blocks
+  parity: ptr UncheckedArray[ptr UncheckedArray[byte]]    # Available parity blocks
+  recovered: ptr UncheckedArray[ptr UncheckedArray[byte]] # Recovered blocks output
+  blockSize: int                                          # Size of each block
+  blocksLen: int                                          # Number of data blocks (K)
+  parityLen: int                                          # Number of parity blocks (M)
+  recoveredLen: int                                       # Number of recovered blocks
+  signal: ThreadSignalPtr                                 # Completion signal
+```
+
+### 7.3 Error Types
+
+```nim
+type
+  ErasureError* = object of CodexError
+    # Base error type for erasure coding operations
+  
+  InsufficientBlocksError* = object of ErasureError
+    # Raised when insufficient blocks for encoding
+    minSize*: NBytes  # Minimum dataset size required
+```
+
+### 7.4 Manifest
+
+#### Protected and Verifiable Manifest Fields
+```nim
+case protected: bool
+of true:
+  ecK: int                        # Number of data blocks
+  ecM: int                        # Number of parity blocks
+  originalTreeCid: Cid            # CID of original dataset tree
+  originalDatasetSize: NBytes     # Size before erasure coding
+  protectedStrategy: StrategyType # Strategy used for encoding
+  
+  case verifiable: bool
+  of true:
+    verifyRoot: Cid                  # Root of verification tree
+    slotRoots: seq[Cid]              # Individual slot roots
+    cellSize: NBytes                 # Size of verification cells
+    verifiableStrategy: StrategyType # Strategy for verification
+```
+
+### 7.5 Indexing Strategy
+
+```nim
+type
+  StrategyType* = enum
+    LinearStrategy    # Sequential block grouping
+    SteppedStrategy   # Interleaved block grouping
+  
+  IndexingStrategy* = object
+    strategyType*: StrategyType
+    firstIndex*: int      # Start of index range
+    lastIndex*: int       # End of index range
+    iterations*: int      # Number of iteration steps
+    step*: int            # Step size between indices
+    groupCount*: int      # Number of groups
+    padBlockCount*: int   # Padding blocks per group
+```
+
+### 7.6 Supporting Types
+
+#### BlockAddress
+```nim
+type BlockAddress* = object
+  case leaf*: bool
+  of true:
+    treeCid*: Cid    # CID of the merkle tree
+    index*: Natural  # Index of block in the tree
+  of false:
+    cid*: Cid        # Direct CID reference
+```
+
+#### Empty Block Handling
+```nim
+proc emptyCid*(version: CidVersion, hcodec: MultiCodec, dcodec: MultiCodec): ?!Cid
+  # Returns CID representing empty content for padding
+```
+
+#### Merkle Tree Types
+```nim
+type
+  CodexTree* = MerkleTree[Poseidon2Hash, PoseidonKeysEnum]
+  CodexProof* = MerkleProof[Poseidon2Hash, PoseidonKeysEnum]
+  
+  MerkleTree*[H, K] = ref object of RootObj
+    layers*: seq[seq[H]]        # Tree layers from leaves to root
+    compress*: CompressFn[H, K] # Hash compression function
+    zero*: H                    # Zero/empty value
+    
+  MerkleProof*[H, K] = ref object of RootObj
+    index*: int                 # Leaf index, starting from 0
+    path*: seq[H]               # Proof path from bottom to top
+    nleaves*: int               # Total number of leaves
+    compress*: CompressFn[H, K] # Compress function
+    zero*: H                    # Zero value
+```
+
+### 7.7 System Constants
+- **DefaultBlockSize**: 65536 bytes
+- **DefaultCellSize**: 2048 bytes
+- **DefaultMaxSlotDepth**: 32
+- **DefaultMaxDatasetDepth**: 8
+- **DefaultBlockDepth**: 5
+- **DefaultCellElms**: 67
+- **DefaultSamplesNum**: 5
+
+### 7.8 Supported Hash Codecs
+- **Sha256HashCodec**: SHA2-256 hash function
+- **Sha512HashCodec**: SHA2-512 hash function
+- **Pos2Bn128SpngCodec**: Poseidon2 sponge construction
+- **Pos2Bn128MrklCodec**: Poseidon2 merkle tree construction
+
+### 7.9 Codex-Specific Codecs
+- **ManifestCodec**: For manifest encoding
+- **DatasetRootCodec**: For dataset root CIDs
+- **BlockCodec**: For block encoding
+- **SlotRootCodec**: For slot root CIDs
+- **SlotProvingRootCodec**: For proving root CIDs
+- **CodexSlotCellCodec**: For slot cell encoding