mirror of
https://github.com/status-im/dagger-research.git
synced 2025-02-23 11:48:19 +00:00
Processed review comments.
Co-Authored-By: Leonardo Bautista-Gomez <leobago@gmail.com> Co-Authored-By: Csaba Kiraly <csaba.kiraly@gmail.com>
This commit is contained in:
parent
767e4256ea
commit
faf2d3dd29
@ -10,23 +10,30 @@ Erasure coding is used for multiple purposes in Codex:
|
||||
- To speed up downloads
|
||||
- To increase the probability of detecting missing data on a host
|
||||
|
||||
For the first two items we'll use a different erasure coding scheme than we do
|
||||
for the last. In this document we focus on the last item; an erasure coding
|
||||
scheme that makes it easier to detect missing or corrupted data on a host
|
||||
through storage proofs.
|
||||
The first two purposes can be handled quite effectively by expanding and
|
||||
splitting a dataset using a standard erasure coding scheme, whereby each of the
|
||||
resulting pieces is distributed to a different host. These hosts enter into a
|
||||
contract with a client to store their piece. Their part of the contract [is
|
||||
called a 'slot'][0], so we'll refer to the piece that a single hosts stores as
|
||||
its 'slot data'.
|
||||
|
||||
In the rest of this document we will ignore these two first purposes and dive
|
||||
deeper into the third purpose; increasing the probabily of finding missing slot
|
||||
data on a host. For this reason we introduce a secondary erasure coding scheme
|
||||
that makes it easier to detect missing or corrupted slot data on a host through
|
||||
storage proofs.
|
||||
|
||||
Storage proofs
|
||||
--------------
|
||||
|
||||
Our proofs of storage allow a host to prove that they are still in possession of
|
||||
the data that they promised to hold. A proof is generated by sampling a number
|
||||
of blocks and providing a Merkle proof for those blocks. The Merkle proof is
|
||||
generated inside a SNARK to compress it to a small size to allow for
|
||||
the slot data that they promised to hold. A proof is generated by sampling a
|
||||
number of blocks and providing a Merkle proof for those blocks. The Merkle proof
|
||||
is generated inside a SNARK to compress it to a small size to allow for
|
||||
cost-effective verification on a blockchain.
|
||||
|
||||
These storage proofs depend on erasure coding to ensure that a large part of the
|
||||
data needs to be missing before the original dataset can no longer be restored.
|
||||
This makes it easier to detect when a dataset is no longer recoverable.
|
||||
Erasure coding increases the odds of detecting missing slot data with these
|
||||
proofs.
|
||||
|
||||
Consider this example without erasure coding:
|
||||
|
||||
@ -38,8 +45,8 @@ Consider this example without erasure coding:
|
||||
missing
|
||||
|
||||
|
||||
When we query a block from this dataset, we have a low chance of detecting the
|
||||
missing block. But the dataset is no longer recoverable, because a single block
|
||||
When we query a block, we have a low chance of detecting the missing block. But
|
||||
the slot data can no longer be considered to be complete, because a single block
|
||||
is missing.
|
||||
|
||||
When we add erasure coding:
|
||||
@ -50,10 +57,10 @@ When we add erasure coding:
|
||||
original data parity data
|
||||
|
||||
In this example, more than 50% of the erasure coded data needs to be missing
|
||||
before the dataset can no longer be recovered. When we now query a block from
|
||||
this dataset, we have a more than 50% chance of detecting a missing block. And
|
||||
when we query multiple blocks, the odds of detecting a missing block increase
|
||||
dramatically.
|
||||
before the slot data can no longer be considered complete. When we now query a
|
||||
block from this dataset, we have a more than 50% chance of detecting a missing
|
||||
block. And when we query multiple blocks, the odds of detecting a missing block
|
||||
increase exponentially.
|
||||
|
||||
Erasure coding
|
||||
--------------
|
||||
@ -124,8 +131,8 @@ This is repeated for each element inside the shards. In this manner, we can
|
||||
employ erasure coding on a Galois field of 2^8 to encode 256 shards of data, no
|
||||
matter how big the shards are.
|
||||
|
||||
The amount of original data shards is typically called K, the amount of parity
|
||||
shards M, and the total amount of shards N.
|
||||
The number of original data shards is typically called K, the number of parity
|
||||
shards M, and the total number of shards N.
|
||||
|
||||
Adversarial erasure
|
||||
-------------------
|
||||
@ -133,11 +140,12 @@ Adversarial erasure
|
||||
The disadvantage of interleaving is that it weakens the protection against
|
||||
adversarial erasure that Reed-Solomon provides.
|
||||
|
||||
An adversary can now strategically remove only the first element from more than
|
||||
half of the shards, and the dataset will be damaged beyond repair. For example,
|
||||
with a dataset of 1TB erasure coded into 256 data and parity shards, an
|
||||
adversary could strategically remove 129 bytes, and the data can no longer be
|
||||
fully recovered.
|
||||
An adversarial host can now strategically remove only the first element from
|
||||
more than half of the shards, and the slot data can no longer be recovered from
|
||||
the data that the host stores. For example, with 1TB of slot data erasure coded
|
||||
into 256 data and parity shards, an adversary could strategically remove 129
|
||||
bytes, and the data can no longer be fully recovered with the erasure coded data
|
||||
that is present on the host.
|
||||
|
||||
Implications for storage proofs
|
||||
-------------------------------
|
||||
@ -155,10 +163,10 @@ iterations of the hashing algorithm, which also leads to a larger circuit. A
|
||||
larger circuit means longer computation and higher memory consumption.
|
||||
|
||||
Ideally, we'd like to have small blocks to keep Merkle proofs inside SNARKs
|
||||
relatively performant, but we are limited by the maximum amount of shards that a
|
||||
relatively performant, but we are limited by the maximum number of shards that a
|
||||
particular Reed-Solomon algorithm supports. For instance, the [leopard][1]
|
||||
library can create at most 65536 shards, because it uses a Galois field of 2^16.
|
||||
Should we use this to encode a 1TB file, we'd end up with shards of 16MB, far
|
||||
Should we use this to encode a 1TB slot, we'd end up with shards of 16MB, far
|
||||
too large to be practical in a SNARK.
|
||||
|
||||
Design space
|
||||
@ -168,8 +176,8 @@ This limits the choices that we can make. The limiting factors seem to be:
|
||||
|
||||
- Maximum number of shards, determined by the field size of the erasure coding
|
||||
algorithm
|
||||
- Number of blocks per proof, which determines how likely we are to detect
|
||||
missing blocks
|
||||
- Number of shards per proof, which determines how likely we are to detect
|
||||
missing shards
|
||||
- Capacity of the SNARK algorithm; how many bytes can we hash in a reasonable
|
||||
time inside the SNARK
|
||||
|
||||
@ -278,8 +286,8 @@ columns:
|
||||
|
||||
column parity
|
||||
|
||||
This allows us to use the maximum amount of shards for our rows, and the maximum
|
||||
amount of shards for our columns. When we erasure code using a Galois field of
|
||||
This allows us to use the maximum number of shards for our rows, and the maximum
|
||||
number of shards for our columns. When we erasure code using a Galois field of
|
||||
2^16 in a two-dimensional structure, we can now have a maximum of 2^16 x 2^16 =
|
||||
2^32 shards. Or we could go up another two dimensions and have a maximum of 2^64
|
||||
shards in a four-dimensional structure.
|
||||
@ -294,8 +302,8 @@ There are however a number of drawbacks to adding more dimensions.
|
||||
|
||||
##### Data corrupted sooner #####
|
||||
|
||||
In a one-dimensional scheme, corrupting an amount of shards just larger than the
|
||||
amount of parity shards ( M + 1 ) will render data lost:
|
||||
In a one-dimensional scheme, corrupting a number of shards just larger than the
|
||||
number of parity shards ( M + 1 ) will render the slot data incomplete:
|
||||
|
||||
|
||||
<--------- missing: M + 1---------------->
|
||||
@ -305,7 +313,7 @@ amount of parity shards ( M + 1 ) will render data lost:
|
||||
<-------- original: K ----------> <-------- parity: M ------------>
|
||||
|
||||
In a two-dimensional scheme, we only need to lose an amount much smaller than
|
||||
the total amount of parity before data is lost:
|
||||
the total amount of parity before the slot data becomes incomplete:
|
||||
|
||||
|
||||
<-------- original: K ----------> <- parity: M ->
|
||||
@ -338,27 +346,27 @@ the total amount of parity before data is lost:
|
||||
|
||||
<-- missing: M + 1 -->
|
||||
|
||||
This is only (M + 1)² shards from a total of N² blocks. This gets worse when you
|
||||
This is only (M + 1)² shards from a total of N² shards. This gets worse when you
|
||||
go to three, four or higher dimensions. This means that our chances of detecting
|
||||
whether the data is corrupted beyond repair go down, which means that we need to
|
||||
check more shards in our Merkle storage proofs. This is exacerbated by the the
|
||||
need to counter parity blowup.
|
||||
whether the data is incomplete go down, which means that we need to check more
|
||||
shards in our Merkle storage proofs. This is exacerbated by the need to counter
|
||||
parity blowup.
|
||||
|
||||
##### Parity blowup ######
|
||||
|
||||
When we perform a regular one-dimensional erasure coding, we like to use a ratio
|
||||
of 1:2 between original data (K) and total data (N), because it gives us a >50%
|
||||
chance of detecting critical data loss by checking a single shard. If we were to
|
||||
chance of detecting incomplete data by checking a single shard. If we were to
|
||||
use the same K and M in a 2-dimensional setting, we'd get a ratio of 1:4 between
|
||||
original data and total data. In other words, we would blow up the original data
|
||||
by a factor of 4. This gets worse with higher dimensions.
|
||||
|
||||
To counter this blow-up, we can choose an M that is smaller. For two dimensions,
|
||||
we could chose M = N / √2. This ensures that the total amount of data N² is
|
||||
double that of the original data K². For three dimensions we'd choose K / ∛2,
|
||||
etc. This however means that the chances of detecting critical data loss in a
|
||||
row or column go down, which means that we'd again have to sample more shards in
|
||||
our Merkle storage proofs.
|
||||
we could choose K = N / √2, and therefore M = N - N / √2. This ensures that the
|
||||
total amount of data N² is double that of the original data K². For three
|
||||
dimensions we'd choose K = N / ∛2, etc. This however means that the chances of
|
||||
detecting incomplete rows or columns go down, which means that we'd again have
|
||||
to sample more shards in our Merkle storage proofs.
|
||||
|
||||
##### Larger encoding times #####
|
||||
|
||||
@ -384,7 +392,7 @@ sample more shards in our Merkle proofs. For example, using a 2 dimensional
|
||||
structure of erasure coded shards in a Galois field of 2^16, we can handle 1TB
|
||||
of data with shards of size 256 bytes. When we allow parity data to take up to
|
||||
half of the total data, we would need to sample 160 shards to have a 0.999999
|
||||
chance of detecting critical data loss. This is much more than the amount of
|
||||
chance of detecting incomplete slot data. This is much more than the number of
|
||||
shards that we need in a one-dimensional setting, but the shards are much
|
||||
smaller. This leads to less hashing in a SNARK, just 40 KB.
|
||||
|
||||
@ -413,6 +421,7 @@ Two concrete options are:
|
||||
use the leopard library for erasure coding and keep memory requirements for
|
||||
erasure coding to a negligable level.
|
||||
|
||||
[0]: ./marketplace.md
|
||||
[1]: https://github.com/catid/leopard
|
||||
[2]: https://github.com/Bulat-Ziganshin/FastECC
|
||||
[3]: https://ieeexplore.ieee.org/abstract/document/6545355
|
||||
|
Loading…
x
Reference in New Issue
Block a user