codex-research/design/proof-erasure-coding.md

Storage proofs & erasure coding
===============================

Authors: Codex Team

Erasure coding is used for multiple purposes in Codex:

- To restore data when a host drops from the network; other hosts can restore
  the data that the missing host was storing.
- To speed up downloads
- To increase the probability of detecting missing data on a host

The first two purposes can be handled quite effectively by expanding and
splitting a dataset using a standard erasure coding scheme, whereby each of the
resulting pieces is distributed to a different host. These hosts enter into a
contract with a client to store their piece. Their part of the contract [is
called a 'slot'][0], so we'll refer to the piece that a single hosts stores as
its 'slot data'.

In the rest of this document we will ignore these two first purposes and dive
deeper into the third purpose; increasing the probabily of finding missing slot
data on a host. For this reason we introduce a secondary erasure coding scheme
that makes it easier to detect missing or corrupted slot data on a host through
storage proofs.

Storage proofs
--------------

Our proofs of storage allow a host to prove that they are still in possession of
the slot data that they promised to hold. A proof is generated by sampling a
number of blocks and providing a Merkle proof for those blocks. The Merkle proof
is generated inside a SNARK to compress it to a small size to allow for
cost-effective verification on a blockchain.

Erasure coding increases the odds of detecting missing slot data with these
proofs.

Consider this example without erasure coding:

    -------------------------------------
    |///|///|///|///|///|///|///|   |///|
    -------------------------------------
                                  ^
                                  |
                                missing


When we query a block, we have a low chance of detecting the missing block. But
the slot data can no longer be considered to be complete, because a single block
is missing.

When we add erasure coding:

    ---------------------------------     ---------------------------------
    |   |///|   |///|   |   |   |///|     |///|///|   |   |///|///|   |   |
    ---------------------------------     ---------------------------------
            original data                             parity data

In this example, more than 50% of the erasure coded data needs to be missing
before the slot data can no longer be considered complete. When we now query a
block from this dataset, we have a more than 50% chance of detecting a missing
block. And when we query multiple blocks, the odds of detecting a missing block
increase exponentially.

Erasure coding
--------------

Reed-Solomon erasure coding works by representing data as a polynomial, and then
sampling parity data from that polynomial.

                    __
      __           /  \      __                     __
     /  \         /    \    /  \                   /  \
    /    \       /      \__/    \    __           /
          --    /                \__/  \       __/
            \__/                        \     /
                    ^                    \   /       |
                    |                     ---        |
     ^  ^           |  ^        |                    |
     |  |        ^  |  |  ^     |  |  |           |  |
     |  |  ^     |  |  |  |     |  |  |        |  |  |
     |  |  |  ^  |  |  |  |     |  |  |  |  |  |  |  |
     |  |  |  |  |  |  |  |     |  |  |  |  |  |  |  |
     |  |  |  |  |  |  |  |     v  v  v  v  v  v  v  v

    -------------------------  -------------------------
    |//|//|//|//|//|//|//|//|  |//|//|//|//|//|//|//|//|
    -------------------------  -------------------------

         original data                  parity


This only works for small amounts of data. When the polynomial is for instance
defined over byte sized elements from a Galois field of 2^8, you can only encode
2^8 = 256 bytes (data and parity combined).

Interleaving
------------

To encode larger pieces of data with erasure coding, interleaving is used. This
works by taking larger shards of data, and encoding smaller elements from these
shards.

    data shards

    -------------    -------------    -------------    -------------
    |x| | | | | |    |x| | | | | |    |x| | | | | |    |x| | | | | |
    -------------    -------------    -------------    -------------
     |                /                /                |
      \___________   |   _____________/                 |
                  \  |  /  ____________________________/
                   | | |  /
                   v v v v

                  ---------         ---------
            data  |x|x|x|x|   -->   |p|p|p|p|  parity
                  ---------         ---------

                                     | | | |
       _____________________________/ /  |  \_________
      /                 _____________/   |             \
     |                 /                /               |
     v                v                v                v
    -------------    -------------    -------------    -------------
    |p| | | | | |    |p| | | | | |    |p| | | | | |    |p| | | | | |
    -------------    -------------    -------------    -------------

    parity shards

This is repeated for each element inside the shards. In this manner, we can
employ erasure coding on a Galois field of 2^8 to encode 256 shards of data, no
matter how big the shards are.

The number of original data shards is typically called K, the number of parity
shards M, and the total number of shards N.

Adversarial erasure
-------------------

The disadvantage of interleaving is that it weakens the protection against
adversarial erasure that Reed-Solomon provides.

An adversarial host can now strategically remove only the first element from
more than half of the shards, and the slot data can no longer be recovered from
the data that the host stores. For example, with 1TB of slot data erasure coded
into 256 data and parity shards, an adversary could strategically remove 129
bytes, and the data can no longer be fully recovered with the erasure coded data
that is present on the host.

Implications for storage proofs
-------------------------------

This means that when we check for missing data, we should perform our checks on
entire shards to protect against adversarial erasure. In the case of our Merkle
storage proofs, this means that we need to hash the entire shard, and then check
that hash with a Merkle proof. Effectively the block size for Merkle proofs
should equal the shard size of the erasure coding interleaving. This is rather
unfortunate, because hashing large amounts of data is rather expensive to
perform in a SNARK, which is what we use to compress proofs in size.

A large amount of input data in a SNARK leads to a larger circuit, and to more
iterations of the hashing algorithm, which also leads to a larger circuit. A
larger circuit means longer computation and higher memory consumption.

Ideally, we'd like to have small blocks to keep Merkle proofs inside SNARKs
relatively performant, but we are limited by the maximum number of shards that a
particular Reed-Solomon algorithm supports. For instance, the [leopard][1]
library can create at most 65536 shards, because it uses a Galois field of 2^16.
Should we use this to encode a 1TB slot, we'd end up with shards of 16MB, far
too large to be practical in a SNARK.

Design space
------------

This limits the choices that we can make. The limiting factors seem to be:

- Maximum number of shards, determined by the field size of the erasure coding
  algorithm
- Number of shards per proof, which determines how likely we are to detect
  missing shards
- Capacity of the SNARK algorithm; how many bytes can we hash in a reasonable
  time inside the SNARK

From these limiting factors we can derive:

- Block size; equals shard size
- Maximum slot size; the maximum amount of data that can be verified with a
  proof
- Erasure coding memory requirements

For example, when we use the leopard library, with a Galois field of 2^16, and
require 80 blocks to be sampled per proof, and we can implement a SNARK that can
hash 80*64K bytes, then we have:

- Block size: 64KB
- Maximum slot size: 4GB (2^16 * 64KB)
- Erasure coding memory: > 128KB (2^16 * 16 bits)

Which has the disadvantage of having a rather low maximum slot size of 4GB. When
we want to improve on this to support e.g. 1TB slot sizes, we'll need to either
increase the capacity of the SNARK, increase the field size of the erasure
coding algorithm, or decrease the durability guarantees.

> The [accompanying spreadsheet][4] allows you to explore the design space
> yourself

Increasing SNARK capacity
-------------------------

Increasing the computational capacity of SNARKs is an active field of study, but
it is unlikely that we'll see an implementation of SNARKS that is 100-1000x
faster before we launch Codex. Better hashing algorithms are also being designed
for use in SNARKS, but it is equally unlikely that we'll see such a speedup here
either.

Decreasing durability guarantees
--------------------------------

We could reduce the durability guarantees by requiring e.g. 20 instead of 80
blocks per proof. This would still give us a probability of detecting missing
data of 1 - 0.5^20, which is 0.999999046, or "six nines". Arguably this is still
good enough. Choosing 20 blocks per proof allows for slots up to 16GB:

- Block size: 256KB
- Maximum slot size: 16GB (2^16 * 256KB)
- Erasure coding memory: > 128KB (2^16 * 16 bits)

Erasure coding field size
-------------------------

If we could perform erasure coding on a field of around 2^20 to 2^30, then this
would allow us to get to larger slots. For instance, with a field of at least
size 2^24, we could support slot sizes up to 1TB:

- Block size: 64KB
- Maximum slot size: 1TB (2^24 * 64KB)
- Erasure coding memory: > 48MB (2^24 * 24 bits)

We are however unaware of any implementations of reed solomon that use a field
size larger than 2^16 and still be efficient O(N log(N)). [FastECC][2] uses a
prime field of 20 bits, but its decoder isn't released yet, and it is unclear
whether its byte encoding scheme allows for a systematic erasure code. The paper
["An Efficient (n,k) Information Dispersal Algorithm Based on Fermat Number
Transforms"][3] describes a scheme that uses Proth fields of 2^30, but lacks an
implementation, and has the same encoding challenges that FastECC has.

If we were to adopt an erasure coding scheme with a large field, it is likely
that we'll either have to modify Leopard or FastECC, or implement one ourselves.

More dimensions
---------------

Another thing that we could do is to keep using existing erasure coding
implementations, but perform erasure coding in more than one dimension. For
instance with two dimensions you would encode first in rows, and then in
columns:

      original data                        row parity

    ---------------------------------    -------------
    |///|///|///|///|///|///|///|///| -> | p | p | p |
    ---------------------------------    -------------
    |///|///|///|///|///|///|///|///| -> | p | p | p |
    ---------------------------------    -------------
    |///|///|///|///|///|///|///|///| -> | p | p | p |
    ---------------------------------    -------------
    |///|///|///|///|///|///|///|///| -> | p | p | p |
    ---------------------------------    -------------
    |///|///|///|///|///|///|///|///| -> | p | p | p |
    ---------------------------------    -------------
    |///|///|///|///|///|///|///|///| -> | p | p | p |
    ---------------------------------    -------------
    |///|///|///|///|///|///|///|///| -> | p | p | p |
    ---------------------------------    -------------
    |///|///|///|///|///|///|///|///| -> | p | p | p |
    ---------------------------------    -------------
      |   |   |   |   |   |   |   |        |   |   |
      v   v   v   v   v   v   v   v        v   v   v
    ---------------------------------    -------------
    | p | p | p | p | p | p | p | p |    | p | p | p |
    ---------------------------------    -------------
    | p | p | p | p | p | p | p | p |    | p | p | p |
    ---------------------------------    -------------
    | p | p | p | p | p | p | p | p |    | p | p | p |
    ---------------------------------    -------------

                        column parity

This allows us to use the maximum number of shards for our rows, and the maximum
number of shards for our columns. When we erasure code using a Galois field of
2^16 in a two-dimensional structure, we can now have a maximum of 2^16 x 2^16 =
2^32 shards. Or we could go up another two dimensions and have a maximum of 2^64
shards in a four-dimensional structure.

> Note that although we now have multiple dimensions of erasure coding, we do
> not need multiple dimensions of Merkle trees. We can simply unfold the
> multi-dimensional structure into a one-dimensional one (like you would do when
> writing the structure to disk), and then construct a Merkle tree on top of
> that.

There are however a number of drawbacks to adding more dimensions.

##### Data corrupted sooner #####

In a one-dimensional scheme, corrupting a number of shards just larger than the
number of parity shards ( M + 1 ) will render the slot data incomplete:


                                 <--------- missing: M + 1---------------->
    ---------------------------------     ---------------------------------
    |///|///|///|///|///|///|///|   |     |   |   |   |   |   |   |   |   |
    ---------------------------------     ---------------------------------
    <-------- original: K ---------->     <-------- parity: M ------------>

In a two-dimensional scheme, we only need to lose an amount much smaller than
the total amount of parity before the slot data becomes incomplete:


    <-------- original: K ---------->   <- parity: M ->

    ---------------------------------    -------------   ^
    |///|///|///|///|///|///|///|///|    |///|///|///|   |
    ---------------------------------    -------------   |
    |///|///|///|///|///|///|///|///|    |///|///|///|   |
    ---------------------------------    -------------   |
    |///|///|///|///|///|///|///|///|    |///|///|///|   |
    ---------------------------------    -------------   |
    |///|///|///|///|///|///|///|///|    |///|///|///|   |
    ---------------------------------    -------------
    |///|///|///|///|///|///|///|///|    |///|///|///|   K
    ---------------------------------    -------------
    |///|///|///|///|///|///|///|///|    |///|///|///|   |
    ---------------------------------    -------------   |
    |///|///|///|///|///|///|///|///|    |///|///|///|   |
    ---------------------------------    -------------   |    ^
    |///|///|///|///|///|///|///|   |    |   |   |   |   |    |
    ---------------------------------    -------------   v    |

    ---------------------------------    -------------   ^    M
    |///|///|///|///|///|///|///|   |    |   |   |   |   |    +
    ---------------------------------    -------------        1
    |///|///|///|///|///|///|///|   |    |   |   |   |   M
    ---------------------------------    -------------        |
    |///|///|///|///|///|///|///|   |    |   |   |   |   |    |
    ---------------------------------    -------------   v    v

                                <-- missing: M + 1 -->

This is only (M + 1)² shards from a total of N² shards. This gets worse when you
go to three, four or higher dimensions. This means that our chances of detecting
whether the data is incomplete go down, which means that we need to check more
shards in our Merkle storage proofs. This is exacerbated by the need to counter
parity blowup.

##### Parity blowup ######

When we perform a regular one-dimensional erasure coding, we like to use a ratio
of 1:2 between original data (K) and total data (N), because it gives us a >50%
chance of detecting incomplete data by checking a single shard. If we were to
use the same K and M in a 2-dimensional setting, we'd get a ratio of 1:4 between
original data and total data. In other words, we would blow up the original data
by a factor of 4. This gets worse with higher dimensions.

To counter this blow-up, we can choose an M that is smaller. For two dimensions,
we could choose K = N / √2, and therefore M = N - N / √2. This ensures that the
total amount of data N² is double that of the original data K². For three
dimensions we'd choose K = N / ∛2, etc. This however means that the chances of
detecting incomplete rows or columns go down, which means that we'd again have
to sample more shards in our Merkle storage proofs.

##### Larger encoding times #####

Another drawback of multi-dimensional erasure coding is that we now need to
erasure code the original data multiple times, and we also need to erasure code
some of the parity data. For a two-dimensional code this means that encoding
times go up by a factor of at least 2, and for a three-dimensional a factor of
at least 3, etc.

##### Complexity #####

The final drawback of multi-dimensional erasure coding is its complexity. It is
harder to reason about its correctness, and implementations must take great care
to ensure that cornercases when the data is not exactly K² shaped (or K³, or...)
are handled correctly. Decoding is also more involved because it might require
restoring parity data before it is possible to restore the original data.

##### The good news ####

Despite these drawbacks, the multi-dimensional approach allows us to make the
shards almost arbitrarily small. This allows us to compensate for the need to
sample more shards in our Merkle proofs. For example, using a 2 dimensional
structure of erasure coded shards in a Galois field of 2^16, we can handle 1TB
of data with shards of size 256 bytes. When we allow parity data to take up to
half of the total data, we would need to sample 160 shards to have a 0.999999
chance of detecting incomplete slot data. This is much more than the number of
shards that we need in a one-dimensional setting, but the shards are much
smaller. This leads to less hashing in a SNARK, just 40 KB.

> The numbers for multi-dimensional erasure coding schemes can be found in the
> [accompanying spreadsheet][4]

Conclusion
----------

It is likely that with the current state of the art in SNARK design and erasure
coding implementations we can only support slot sizes up to 4GB. There are two
design directions that allow an increase of slot size. One is to extend or
implement an erasure coding implementation to use a larger field size. The other
is to use existing erasure coding implementation in a multi-dimensional setting.

Two concrete options are:

1. Erasure code with a field size that allows for 2^28 shards. Check 20 shards
   per proof. For 1TB this leads to shards of 4KB. This means the SNARK needs to
   hash 80KB plus the Merkle paths for a storage proof. Requires custom
   implementation of Reed-Solomon, and requires at least 1 GB of memory while
   performing erasure coding.
2. Erasure code with a field size of 2^16 in two dimensions. Check 160 shards
   per proof. For 1TB this leads to a shards of 256 bytes. This means that the
   SNARK needs to hash 40KB plus the Merkle paths for a storage proof. We can
   use the leopard library for erasure coding and keep memory requirements for
   erasure coding to a negligable level.

[0]: ./marketplace.md
[1]: https://github.com/catid/leopard
[2]: https://github.com/Bulat-Ziganshin/FastECC
[3]: https://ieeexplore.ieee.org/abstract/document/6545355
[4]: ./proof-erasure-coding.ods