codex-research/design/proof-erasure-coding.md

Storage proofs & erasure coding
===============================

Erasure coding is used for multiple purposes in Codex:

- To restore data when a host drops from the network; other hosts can restore
  the data that the missing host was storing.
- To speed up downloads
- To increase the probability of detecting missing data on a host

For the first two items we'll use a different erasure coding scheme than we do
for the last. In this document we focus on the last item; an erasure coding
scheme that makes it easier to detect missing or corrupted data on a host
through storage proofs.

Storage proofs
--------------

Our proofs of storage allow a host to prove that they are still in possession of
the data that they promised to hold. A proof is generated by sampling a number
of blocks and providing a Merkle proof for those blocks. The Merkle proof is
generated inside a SNARK to compress it to a small size to allow for
cost-effective verification on a blockchain.

These storage proofs depend on erasure coding to ensure that a large part of the
data needs to be missing before the original dataset can no longer be restored.
This makes it easier to detect when a dataset is no longer recoverable.

Consider this example without erasure coding:

    -------------------------------------
    |///|///|///|///|///|///|///|   |///|
    -------------------------------------
                                  ^
                                  |
                                missing


When we query a block from this dataset, we have a low chance of detecting the
missing block. But the dataset is no longer recoverable, because a single block
is missing.

When we add erasure coding:

    ---------------------------------     ---------------------------------
    |   |///|   |///|   |   |   |///|     |///|///|   |   |///|///|   |   |
    ---------------------------------     ---------------------------------
            original data                             parity data

In this example, more than 50% of the erasure coded data needs to be missing
before the dataset can no longer be recovered. When we now query a block from
this dataset, we have a more than 50% chance of detecting a missing block. And
when we query multiple blocks, the odds of detecting a missing block increase
dramatically.

Erasure coding
--------------

Reed-Solomon erasure coding works by representing data as a polynomial, and then
sampling parity data from that polynomial.

                    __
      __           /  \      __                     __
     /  \         /    \    /  \                   /  \
    /    \       /      \__/    \    __           /
          --    /                \__/  \       __/
            \__/                        \     /
                    ^                    \   /       |
                    |                     ---        |
     ^  ^           |  ^        |                    |
     |  |        ^  |  |  ^     |  |  |           |  |
     |  |  ^     |  |  |  |     |  |  |        |  |  |
     |  |  |  ^  |  |  |  |     |  |  |  |  |  |  |  |
     |  |  |  |  |  |  |  |     |  |  |  |  |  |  |  |
     |  |  |  |  |  |  |  |     v  v  v  v  v  v  v  v

    -------------------------  -------------------------
    |//|//|//|//|//|//|//|//|  |//|//|//|//|//|//|//|//|
    -------------------------  -------------------------

         original data                  parity


This only works for small amounts of data. When the polynomial is for instance
defined over byte sized elements from a Galois field of 2^8, you can only encode
2^8 = 256 bytes (data and parity combined).

Interleaving
------------

To encode larger pieces of data with erasure coding, interleaving is used. This
works by taking larger blocks of data, and encoding smaller elements from these
blocks.

    data blocks

    -------------    -------------    -------------    -------------
    |x| | | | | |    |x| | | | | |    |x| | | | | |    |x| | | | | |
    -------------    -------------    -------------    -------------
     |                /                /                |
      \___________   |   _____________/                 |
                  \  |  /  ____________________________/
                   | | |  /
                   v v v v

                  ---------         ---------
            data  |x|x|x|x|   -->   |p|p|p|p|  parity
                  ---------         ---------

                                     | | | |
       _____________________________/ /  |  \_________
      /                 _____________/   |             \
     |                 /                /               |
     v                v                v                v
    -------------    -------------    -------------    -------------
    |p| | | | | |    |p| | | | | |    |p| | | | | |    |p| | | | | |
    -------------    -------------    -------------    -------------

    parity blocks

This is repeated for each element inside the blocks.

Adversarial erasure
-------------------

The disadvantage of interleaving is that it weakens the protection against
adversarial erasure that Reed-Solomon provides.

An adversary can now strategically remove only the first element from more than
half of the blocks, and the dataset will be damaged beyond repair. For example,
with a dataset of 1TB erasure coded into 256 data and parity blocks, an
adversary could strategically remove 129 bytes, and the data can no longer be
fully recovered.

Implications for storage proofs
-------------------------------

This means that when we check for missing data, we should perform our checks on
entire blocks to protect against adversarial erasure. In the case of our Merkle
storage proofs, this means that we need to hash the entire block, and then check
that hash with a Merkle proof. This is rather unfortunate, because hashing large
amounts of data is rather expensive to perform in a SNARK, which is what we use
to compress proofs in size.

A large amount of input data in a SNARK leads to a larger circuit, and to more
iterations of the hashing algorithm, which also leads to a larger circuit. A
larger circuit means longer computation and higher memory consumption.

Ideally, we'd like to have small blocks to keep Merkle proofs inside SNARKs
relatively performant, but we are limited by the maximum amount of blocks that a
particular Reed-Solomon algorithm supports. For instance, the [leopard][1]
library can create at most 65536 blocks, because it uses a Galois field of 2^16.
Should we use this to encode a 1TB file, we'd end up with blocks of 16MB, far
too large to be practical in a SNARK.

Design space
------------

This limits the choices that we can make. The limiting factors seem to be:

- Maximum number of blocks, determined by the field size of the erasure coding
  algorithm
- Number of blocks per proof, which determines how likely we are to detect
  missing blocks
- Capacity of the SNARK algorithm; how many bytes can we hash in a reasonable
  time inside the SNARK

From these limiting factors we can derive:

- Block size
- Maximum slot size; the maximum amount of data that can be verified with a
  proof
- Erasure coding memory requirements

For example, when we use the leopard library, with a Galois field of 2^16, and
require 80 blocks to be sampled per proof, and we can implement a SNARK that can
hash 80*64K bytes, then we have:

- Block size: 64KB
- Maximum slot size: 4GB (2^16 * 64KB)
- Erasure coding memory: > 128KB (2^16 * 16 bits)

Which has the disadvantage of having a rather low maximum slot size of 4GB. When
we want to improve on this to support e.g. 1TB slot sizes, we'll need to either
increase the capacity of the SNARK, increase the field size of the erasure
coding algorithm, or decrease the durability guarantees.

> The [accompanying spreadsheet][4] allows you to explore the design space
> yourself

Increasing SNARK capacity
-------------------------

Increasing the computational capacity of SNARKs is an active field of study, but
it is unlikely that we'll see an implementation of SNARKS that is 100-1000x
faster before we launch Codex. Better hashing algorithms are also being designed
for use in SNARKS, but it is equally unlikely that we'll see such a speedup here
either.

Decreasing durability guarantees
--------------------------------

We could reduce the durability guarantees by requiring e.g. 20 instead of 80
blocks per proof. This would still give us a probability of detecting missing
data of 1 - 0.5^20, which is 0.999999046, or "six nines". Arguably this is still
good enough. Choosing 20 blocks per proof allows for slots up to 16GB:

- Block size: 256KB
- Maximum slot size: 16GB (2^16 * 256KB)
- Erasure coding memory: > 128KB (2^16 * 16 bits)

Erasure coding field size
-------------------------

If we could perform erasure coding on a field of around 2^20 to 2^30, then this
would allow us to get to larger slots. For instance, with a field of at least
size 2^24, we could support slot sizes up to 1TB:

- Block size: 64KB
- Maximum slot size: 1TB (2^24 * 64KB)
- Erasure coding memory: > 48MB (2^24 * 24 bits)

We are however unaware of any implementations of reed solomon that use a field
size larger than 2^16 and still be efficient O(N log(N)). [FastECC][2] uses a
prime field of 20 bits, but it lacks a decoder and a byte encoding scheme. The
paper ["An Efficient (n,k) Information Dispersal Algorithm Based on Fermat
Number Transforms"][3] describes a scheme that uses Proth fields of 2^30, but
lacks an implementation.

If we were to adopt an erasure coding scheme with a large field, it is likely
that we'll have implement one.

Conclusion
----------

It is likely that with the current state of the art in SNARK design and erasure
coding implementations we can only support slot sizes up to 4GB. The most
promising way to increase the supported slot sizes seems to be to implement an
erasure coding algorithm using a field size of around 2^24.

[1]: https://github.com/catid/leopard
[2]: https://github.com/Bulat-Ziganshin/FastECC
[3]: https://ieeexplore.ieee.org/abstract/document/6545355
[4]: ./proof-erasure-coding.ods
erasure coding for storage proofs writeup 2023-09-21 15:45:59 +02:00			`Storage proofs & erasure coding`
			`===============================`

			`Erasure coding is used for multiple purposes in Codex:`

			`- To restore data when a host drops from the network; other hosts can restore`
			`the data that the missing host was storing.`
			`- To speed up downloads`
			`- To increase the probability of detecting missing data on a host`

			`For the first two items we'll use a different erasure coding scheme than we do`
			`for the last. In this document we focus on the last item; an erasure coding`
			`scheme that makes it easier to detect missing or corrupted data on a host`
			`through storage proofs.`

			`Storage proofs`
			`--------------`

			`Our proofs of storage allow a host to prove that they are still in possession of`
			`the data that they promised to hold. A proof is generated by sampling a number`
			`of blocks and providing a Merkle proof for those blocks. The Merkle proof is`
			`generated inside a SNARK to compress it to a small size to allow for`
			`cost-effective verification on a blockchain.`

			`These storage proofs depend on erasure coding to ensure that a large part of the`
			`data needs to be missing before the original dataset can no longer be restored.`
			`This makes it easier to detect when a dataset is no longer recoverable.`

			`Consider this example without erasure coding:`

			`-------------------------------------`
			`\|///\|///\|///\|///\|///\|///\|///\| \|///\|`
			`-------------------------------------`
			`^`
			`\|`
			`missing`


			`When we query a block from this dataset, we have a low chance of detecting the`
			`missing block. But the dataset is no longer recoverable, because a single block`
			`is missing.`

			`When we add erasure coding:`

			`--------------------------------- ---------------------------------`
			`\| \|///\| \|///\| \| \| \|///\| \|///\|///\| \| \|///\|///\| \| \|`
			`--------------------------------- ---------------------------------`
			`original data parity data`

			`In this example, more than 50% of the erasure coded data needs to be missing`
			`before the dataset can no longer be recovered. When we now query a block from`
			`this dataset, we have a more than 50% chance of detecting a missing block. And`
			`when we query multiple blocks, the odds of detecting a missing block increase`
			`dramatically.`

			`Erasure coding`
			`--------------`

			`Reed-Solomon erasure coding works by representing data as a polynomial, and then`
			`sampling parity data from that polynomial.`

			`__`
			`__ / \ __ __`
			`/ \ / \ / \ / \`
			`/ \ / \__/ \ __ /`
			`-- / \__/ \ __/`
			`\__/ \ /`
			`^ \ / \|`
			`\| --- \|`
			`^ ^ \| ^ \| \|`
			`\| \| ^ \| \| ^ \| \| \| \| \|`
			`\| \| ^ \| \| \| \| \| \| \| \| \| \|`
			`\| \| \| ^ \| \| \| \| \| \| \| \| \| \| \| \|`
			`\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|`
			`\| \| \| \| \| \| \| \| v v v v v v v v`

			`------------------------- -------------------------`
			`\|//\|//\|//\|//\|//\|//\|//\|//\| \|//\|//\|//\|//\|//\|//\|//\|//\|`
			`------------------------- -------------------------`

			`original data parity`


			`This only works for small amounts of data. When the polynomial is for instance`
Fix typo 2023-09-26 13:27:24 +02:00			`defined over byte sized elements from a Galois field of 2^8, you can only encode`
erasure coding for storage proofs writeup 2023-09-21 15:45:59 +02:00			`2^8 = 256 bytes (data and parity combined).`

			`Interleaving`
			`------------`

			`To encode larger pieces of data with erasure coding, interleaving is used. This`
			`works by taking larger blocks of data, and encoding smaller elements from these`
			`blocks.`

			`data blocks`

			`------------- ------------- ------------- -------------`
			`\|x\| \| \| \| \| \| \|x\| \| \| \| \| \| \|x\| \| \| \| \| \| \|x\| \| \| \| \| \|`
			`------------- ------------- ------------- -------------`
			`\| / / \|`
			`\___________ \| _____________/ \|`
			`\ \| / ____________________________/`
			`\| \| \| /`
			`v v v v`

			`--------- ---------`
			`data \|x\|x\|x\|x\| --> \|p\|p\|p\|p\| parity`
			`--------- ---------`

			`\| \| \| \|`
			`_____________________________/ / \| \_________`
			`/ _____________/ \| \`
			`\| / / \|`
			`v v v v`
			`------------- ------------- ------------- -------------`
			`\|p\| \| \| \| \| \| \|p\| \| \| \| \| \| \|p\| \| \| \| \| \| \|p\| \| \| \| \| \|`
			`------------- ------------- ------------- -------------`

			`parity blocks`

			`This is repeated for each element inside the blocks.`

			`Adversarial erasure`
			`-------------------`

			`The disadvantage of interleaving is that it weakens the protection against`
			`adversarial erasure that Reed-Solomon provides.`

			`An adversary can now strategically remove only the first element from more than`
			`half of the blocks, and the dataset will be damaged beyond repair. For example,`
			`with a dataset of 1TB erasure coded into 256 data and parity blocks, an`
			`adversary could strategically remove 129 bytes, and the data can no longer be`
			`fully recovered.`

			`Implications for storage proofs`
			`-------------------------------`

			`This means that when we check for missing data, we should perform our checks on`
			`entire blocks to protect against adversarial erasure. In the case of our Merkle`
			`storage proofs, this means that we need to hash the entire block, and then check`
			`that hash with a Merkle proof. This is rather unfortunate, because hashing large`
			`amounts of data is rather expensive to perform in a SNARK, which is what we use`
			`to compress proofs in size.`

			`A large amount of input data in a SNARK leads to a larger circuit, and to more`
			`iterations of the hashing algorithm, which also leads to a larger circuit. A`
			`larger circuit means longer computation and higher memory consumption.`

			`Ideally, we'd like to have small blocks to keep Merkle proofs inside SNARKs`
			`relatively performant, but we are limited by the maximum amount of blocks that a`
			`particular Reed-Solomon algorithm supports. For instance, the [leopard][1]`
			`library can create at most 65536 blocks, because it uses a Galois field of 2^16.`
			`Should we use this to encode a 1TB file, we'd end up with blocks of 16MB, far`
			`too large to be practical in a SNARK.`

			`Design space`
			`------------`

			`This limits the choices that we can make. The limiting factors seem to be:`

			`- Maximum number of blocks, determined by the field size of the erasure coding`
			`algorithm`
			`- Number of blocks per proof, which determines how likely we are to detect`
			`missing blocks`
			`- Capacity of the SNARK algorithm; how many bytes can we hash in a reasonable`
			`time inside the SNARK`

			`From these limiting factors we can derive:`

			`- Block size`
			`- Maximum slot size; the maximum amount of data that can be verified with a`
			`proof`
			`- Erasure coding memory requirements`

			`For example, when we use the leopard library, with a Galois field of 2^16, and`
			`require 80 blocks to be sampled per proof, and we can implement a SNARK that can`
			`hash 80*64K bytes, then we have:`

			`- Block size: 64KB`
			`- Maximum slot size: 4GB (2^16 * 64KB)`
			`- Erasure coding memory: > 128KB (2^16 * 16 bits)`

			`Which has the disadvantage of having a rather low maximum slot size of 4GB. When`
			`we want to improve on this to support e.g. 1TB slot sizes, we'll need to either`
			`increase the capacity of the SNARK, increase the field size of the erasure`
			`coding algorithm, or decrease the durability guarantees.`

			`> The [accompanying spreadsheet][4] allows you to explore the design space`
			`> yourself`

			`Increasing SNARK capacity`
			`-------------------------`

			`Increasing the computational capacity of SNARKs is an active field of study, but`
			`it is unlikely that we'll see an implementation of SNARKS that is 100-1000x`
			`faster before we launch Codex. Better hashing algorithms are also being designed`
			`for use in SNARKS, but it is equally unlikely that we'll see such a speedup here`
			`either.`

			`Decreasing durability guarantees`
			`--------------------------------`

			`We could reduce the durability guarantees by requiring e.g. 20 instead of 80`
			`blocks per proof. This would still give us a probability of detecting missing`
			`data of 1 - 0.5^20, which is 0.999999046, or "six nines". Arguably this is still`
			`good enough. Choosing 20 blocks per proof allows for slots up to 16GB:`

			`- Block size: 256KB`
			`- Maximum slot size: 16GB (2^16 * 256KB)`
			`- Erasure coding memory: > 128KB (2^16 * 16 bits)`

			`Erasure coding field size`
			`-------------------------`

			`If we could perform erasure coding on a field of around 2^20 to 2^30, then this`
			`would allow us to get to larger slots. For instance, with a field of at least`
			`size 2^24, we could support slot sizes up to 1TB:`

			`- Block size: 64KB`
			`- Maximum slot size: 1TB (2^24 * 64KB)`
			`- Erasure coding memory: > 48MB (2^24 * 24 bits)`

			`We are however unaware of any implementations of reed solomon that use a field`
			`size larger than 2^16 and still be efficient O(N log(N)). [FastECC][2] uses a`
			`prime field of 20 bits, but it lacks a decoder and a byte encoding scheme. The`
			`paper ["An Efficient (n,k) Information Dispersal Algorithm Based on Fermat`
			`Number Transforms"][3] describes a scheme that uses Proth fields of 2^30, but`
			`lacks an implementation.`

			`If we were to adopt an erasure coding scheme with a large field, it is likely`
			`that we'll have implement one.`

			`Conclusion`
			`----------`

			`It is likely that with the current state of the art in SNARK design and erasure`
			`coding implementations we can only support slot sizes up to 4GB. The most`
			`promising way to increase the supported slot sizes seems to be to implement an`
			`erasure coding algorithm using a field size of around 2^24.`

			`[1]: https://github.com/catid/leopard`
			`[2]: https://github.com/Bulat-Ziganshin/FastECC`
			`[3]: https://ieeexplore.ieee.org/abstract/document/6545355`
			`[4]: ./proof-erasure-coding.ods`