Processed review comments.

Co-Authored-By: Leonardo Bautista-Gomez <leobago@gmail.com> Co-Authored-By: Csaba Kiraly <csaba.kiraly@gmail.com>
2025-02-23 11:48:19 +00:00 · 2023-10-02 17:13:35 +02:00 · 2023-10-02 17:13:35 +02:00 · faf2d3dd29
commit faf2d3dd29
parent 767e4256ea
1 changed files with 52 additions and 43 deletions
--- a/design/proof-erasure-coding.md
+++ b/design/proof-erasure-coding.md
@ -10,23 +10,30 @@ Erasure coding is used for multiple purposes in Codex:
 - To speed up downloads
 - To increase the probability of detecting missing data on a host

-For the first two items we'll use a different erasure coding scheme than we do
-for the last. In this document we focus on the last item; an erasure coding
-scheme that makes it easier to detect missing or corrupted data on a host
-through storage proofs.
+The first two purposes can be handled quite effectively by expanding and
+splitting a dataset using a standard erasure coding scheme, whereby each of the
+resulting pieces is distributed to a different host. These hosts enter into a
+contract with a client to store their piece. Their part of the contract [is
+called a 'slot'][0], so we'll refer to the piece that a single hosts stores as
+its 'slot data'.
+
+In the rest of this document we will ignore these two first purposes and dive
+deeper into the third purpose; increasing the probabily of finding missing slot
+data on a host. For this reason we introduce a secondary erasure coding scheme
+that makes it easier to detect missing or corrupted slot data on a host through
+storage proofs.

 Storage proofs
 --------------

 Our proofs of storage allow a host to prove that they are still in possession of
-the data that they promised to hold. A proof is generated by sampling a number
-of blocks and providing a Merkle proof for those blocks. The Merkle proof is
-generated inside a SNARK to compress it to a small size to allow for
+the slot data that they promised to hold. A proof is generated by sampling a
+number of blocks and providing a Merkle proof for those blocks. The Merkle proof
+is generated inside a SNARK to compress it to a small size to allow for
 cost-effective verification on a blockchain.

-These storage proofs depend on erasure coding to ensure that a large part of the
-data needs to be missing before the original dataset can no longer be restored.
-This makes it easier to detect when a dataset is no longer recoverable.
+Erasure coding increases the odds of detecting missing slot data with these
+proofs.

 Consider this example without erasure coding:

@ -38,8 +45,8 @@ Consider this example without erasure coding:
                                missing


-When we query a block from this dataset, we have a low chance of detecting the
-missing block. But the dataset is no longer recoverable, because a single block
+When we query a block, we have a low chance of detecting the missing block. But
+the slot data can no longer be considered to be complete, because a single block
 is missing.

 When we add erasure coding:
@ -50,10 +57,10 @@ When we add erasure coding:
            original data                             parity data

 In this example, more than 50% of the erasure coded data needs to be missing
-before the dataset can no longer be recovered. When we now query a block from
-this dataset, we have a more than 50% chance of detecting a missing block. And
-when we query multiple blocks, the odds of detecting a missing block increase
-dramatically.
+before the slot data can no longer be considered complete. When we now query a
+block from this dataset, we have a more than 50% chance of detecting a missing
+block. And when we query multiple blocks, the odds of detecting a missing block
+increase exponentially.

 Erasure coding
 --------------
@ -124,8 +131,8 @@ This is repeated for each element inside the shards. In this manner, we can
 employ erasure coding on a Galois field of 2^8 to encode 256 shards of data, no
 matter how big the shards are.

-The amount of original data shards is typically called K, the amount of parity
-shards M, and the total amount of shards N.
+The number of original data shards is typically called K, the number of parity
+shards M, and the total number of shards N.

 Adversarial erasure
 -------------------
@ -133,11 +140,12 @@ Adversarial erasure
 The disadvantage of interleaving is that it weakens the protection against
 adversarial erasure that Reed-Solomon provides.

-An adversary can now strategically remove only the first element from more than
-half of the shards, and the dataset will be damaged beyond repair. For example,
-with a dataset of 1TB erasure coded into 256 data and parity shards, an
-adversary could strategically remove 129 bytes, and the data can no longer be
-fully recovered.
+An adversarial host can now strategically remove only the first element from
+more than half of the shards, and the slot data can no longer be recovered from
+the data that the host stores. For example, with 1TB of slot data erasure coded
+into 256 data and parity shards, an adversary could strategically remove 129
+bytes, and the data can no longer be fully recovered with the erasure coded data
+that is present on the host.

 Implications for storage proofs
 -------------------------------
@ -155,10 +163,10 @@ iterations of the hashing algorithm, which also leads to a larger circuit. A
 larger circuit means longer computation and higher memory consumption.

 Ideally, we'd like to have small blocks to keep Merkle proofs inside SNARKs
-relatively performant, but we are limited by the maximum amount of shards that a
+relatively performant, but we are limited by the maximum number of shards that a
 particular Reed-Solomon algorithm supports. For instance, the [leopard][1]
 library can create at most 65536 shards, because it uses a Galois field of 2^16.
-Should we use this to encode a 1TB file, we'd end up with shards of 16MB, far
+Should we use this to encode a 1TB slot, we'd end up with shards of 16MB, far
 too large to be practical in a SNARK.

 Design space
@ -168,8 +176,8 @@ This limits the choices that we can make. The limiting factors seem to be:

 - Maximum number of shards, determined by the field size of the erasure coding
  algorithm
- Number of blocks per proof, which determines how likely we are to detect
-  missing blocks
+- Number of shards per proof, which determines how likely we are to detect
+  missing shards
 - Capacity of the SNARK algorithm; how many bytes can we hash in a reasonable
  time inside the SNARK

@ -278,8 +286,8 @@ columns:

                        column parity

-This allows us to use the maximum amount of shards for our rows, and the maximum
-amount of shards for our columns. When we erasure code using a Galois field of
+This allows us to use the maximum number of shards for our rows, and the maximum
+number of shards for our columns. When we erasure code using a Galois field of
 2^16 in a two-dimensional structure, we can now have a maximum of 2^16 x 2^16 =
 2^32 shards. Or we could go up another two dimensions and have a maximum of 2^64
 shards in a four-dimensional structure.
@ -294,8 +302,8 @@ There are however a number of drawbacks to adding more dimensions.

 ##### Data corrupted sooner #####

-In a one-dimensional scheme, corrupting an amount of shards just larger than the
-amount of parity shards ( M + 1 ) will render data lost:
+In a one-dimensional scheme, corrupting a number of shards just larger than the
+number of parity shards ( M + 1 ) will render the slot data incomplete:


                                 <--------- missing: M + 1---------------->
@ -305,7 +313,7 @@ amount of parity shards ( M + 1 ) will render data lost:
    <-------- original: K ---------->     <-------- parity: M ------------>

 In a two-dimensional scheme, we only need to lose an amount much smaller than
-the total amount of parity before data is lost:
+the total amount of parity before the slot data becomes incomplete:


    <-------- original: K ---------->   <- parity: M ->
@ -338,27 +346,27 @@ the total amount of parity before data is lost:

                                <-- missing: M + 1 -->

-This is only (M + 1)² shards from a total of N² blocks. This gets worse when you
+This is only (M + 1)² shards from a total of N² shards. This gets worse when you
 go to three, four or higher dimensions. This means that our chances of detecting
-whether the data is corrupted beyond repair go down, which means that we need to
-check more shards in our Merkle storage proofs. This is exacerbated by the the
-need to counter parity blowup.
+whether the data is incomplete go down, which means that we need to check more
+shards in our Merkle storage proofs. This is exacerbated by the need to counter
+parity blowup.

 ##### Parity blowup ######

 When we perform a regular one-dimensional erasure coding, we like to use a ratio
 of 1:2 between original data (K) and total data (N), because it gives us a >50%
-chance of detecting critical data loss by checking a single shard. If we were to
+chance of detecting incomplete data by checking a single shard. If we were to
 use the same K and M in a 2-dimensional setting, we'd get a ratio of 1:4 between
 original data and total data. In other words, we would blow up the original data
 by a factor of 4. This gets worse with higher dimensions.

 To counter this blow-up, we can choose an M that is smaller. For two dimensions,
-we could chose M = N / √2. This ensures that the total amount of data N² is
-double that of the original data K². For three dimensions we'd choose K / ∛2,
-etc. This however means that the chances of detecting critical data loss in a
-row or column go down, which means that we'd again have to sample more shards in
-our Merkle storage proofs.
+we could choose K = N / √2, and therefore M = N - N / √2. This ensures that the
+total amount of data N² is double that of the original data K². For three
+dimensions we'd choose K = N / ∛2, etc. This however means that the chances of
+detecting incomplete rows or columns go down, which means that we'd again have
+to sample more shards in our Merkle storage proofs.

 ##### Larger encoding times #####

@ -384,7 +392,7 @@ sample more shards in our Merkle proofs. For example, using a 2 dimensional
 structure of erasure coded shards in a Galois field of 2^16, we can handle 1TB
 of data with shards of size 256 bytes. When we allow parity data to take up to
 half of the total data, we would need to sample 160 shards to have a 0.999999
-chance of detecting critical data loss. This is much more than the amount of
+chance of detecting incomplete slot data. This is much more than the number of
 shards that we need in a one-dimensional setting, but the shards are much
 smaller. This leads to less hashing in a SNARK, just 40 KB.

@ -413,6 +421,7 @@ Two concrete options are:
   use the leopard library for erasure coding and keep memory requirements for
   erasure coding to a negligable level.

+[0]: ./marketplace.md
 [1]: https://github.com/catid/leopard
 [2]: https://github.com/Bulat-Ziganshin/FastECC
 [3]: https://ieeexplore.ieee.org/abstract/document/6545355