From 6271a18ae7f48405b9a550453d932f4fa02d8f46 Mon Sep 17 00:00:00 2001
From: Mark Spanbroek <mark@spanbroek.net>
Date: Thu, 21 Sep 2023 15:45:59 +0200
Subject: [PATCH] erasure coding for storage proofs writeup

---
 design/proof-erasure-coding.md | 244 +++++++++++++++++++++++++++++++++
 1 file changed, 244 insertions(+)
 create mode 100644 design/proof-erasure-coding.md

diff --git a/design/proof-erasure-coding.md b/design/proof-erasure-coding.md
new file mode 100644
index 0000000..f397cd3
--- /dev/null
+++ b/design/proof-erasure-coding.md
@@ -0,0 +1,244 @@
+Storage proofs & erasure coding
+===============================
+
+Erasure coding is used for multiple purposes in Codex:
+
+- To restore data when a host drops from the network; other hosts can restore
+  the data that the missing host was storing.
+- To speed up downloads
+- To increase the probability of detecting missing data on a host
+
+For the first two items we'll use a different erasure coding scheme than we do
+for the last. In this document we focus on the last item; an erasure coding
+scheme that makes it easier to detect missing or corrupted data on a host
+through storage proofs.
+
+Storage proofs
+--------------
+
+Our proofs of storage allow a host to prove that they are still in possession of
+the data that they promised to hold. A proof is generated by sampling a number
+of blocks and providing a Merkle proof for those blocks. The Merkle proof is
+generated inside a SNARK to compress it to a small size to allow for
+cost-effective verification on a blockchain.
+
+These storage proofs depend on erasure coding to ensure that a large part of the
+data needs to be missing before the original dataset can no longer be restored.
+This makes it easier to detect when a dataset is no longer recoverable.
+
+Consider this example without erasure coding:
+
+    -------------------------------------
+    |///|///|///|///|///|///|///|   |///|
+    -------------------------------------
+                                  ^
+                                  |
+                                missing
+
+
+When we query a block from this dataset, we have a low chance of detecting the
+missing block. But the dataset is no longer recoverable, because a single block
+is missing.
+
+When we add erasure coding:
+
+    ---------------------------------     ---------------------------------
+    |   |///|   |///|   |   |   |///|     |///|///|   |   |///|///|   |   |
+    ---------------------------------     ---------------------------------
+            original data                             parity data
+
+In this example, more than 50% of the erasure coded data needs to be missing
+before the dataset can no longer be recovered. When we now query a block from
+this dataset, we have a more than 50% chance of detecting a missing block. And
+when we query multiple blocks, the odds of detecting a missing block increase
+dramatically.
+
+Erasure coding
+--------------
+
+Reed-Solomon erasure coding works by representing data as a polynomial, and then
+sampling parity data from that polynomial.
+
+                    __
+      __           /  \      __                     __
+     /  \         /    \    /  \                   /  \
+    /    \       /      \__/    \    __           /
+          --    /                \__/  \       __/
+            \__/                        \     /
+                    ^                    \   /       |
+                    |                     ---        |
+     ^  ^           |  ^        |                    |
+     |  |        ^  |  |  ^     |  |  |           |  |
+     |  |  ^     |  |  |  |     |  |  |        |  |  |
+     |  |  |  ^  |  |  |  |     |  |  |  |  |  |  |  |
+     |  |  |  |  |  |  |  |     |  |  |  |  |  |  |  |
+     |  |  |  |  |  |  |  |     v  v  v  v  v  v  v  v
+
+    -------------------------  -------------------------
+    |//|//|//|//|//|//|//|//|  |//|//|//|//|//|//|//|//|
+    -------------------------  -------------------------
+
+         original data                  parity
+
+
+This only works for small amounts of data. When the polynomial is for instance
+defined over byte sized elements from a Galios field of 2^8, you can only encode
+2^8 = 256 bytes (data and parity combined).
+
+Interleaving
+------------
+
+To encode larger pieces of data with erasure coding, interleaving is used. This
+works by taking larger blocks of data, and encoding smaller elements from these
+blocks.
+
+    data blocks
+
+    -------------    -------------    -------------    -------------
+    |x| | | | | |    |x| | | | | |    |x| | | | | |    |x| | | | | |
+    -------------    -------------    -------------    -------------
+     |                /                /                |
+      \___________   |   _____________/                 |
+                  \  |  /  ____________________________/
+                   | | |  /
+                   v v v v
+
+                  ---------         ---------
+            data  |x|x|x|x|   -->   |p|p|p|p|  parity
+                  ---------         ---------
+
+                                     | | | |
+       _____________________________/ /  |  \_________
+      /                 _____________/   |             \
+     |                 /                /               |
+     v                v                v                v
+    -------------    -------------    -------------    -------------
+    |p| | | | | |    |p| | | | | |    |p| | | | | |    |p| | | | | |
+    -------------    -------------    -------------    -------------
+
+    parity blocks
+
+This is repeated for each element inside the blocks.
+
+Adversarial erasure
+-------------------
+
+The disadvantage of interleaving is that it weakens the protection against
+adversarial erasure that Reed-Solomon provides.
+
+An adversary can now strategically remove only the first element from more than
+half of the blocks, and the dataset will be damaged beyond repair. For example,
+with a dataset of 1TB erasure coded into 256 data and parity blocks, an
+adversary could strategically remove 129 bytes, and the data can no longer be
+fully recovered.
+
+Implications for storage proofs
+-------------------------------
+
+This means that when we check for missing data, we should perform our checks on
+entire blocks to protect against adversarial erasure. In the case of our Merkle
+storage proofs, this means that we need to hash the entire block, and then check
+that hash with a Merkle proof. This is rather unfortunate, because hashing large
+amounts of data is rather expensive to perform in a SNARK, which is what we use
+to compress proofs in size.
+
+A large amount of input data in a SNARK leads to a larger circuit, and to more
+iterations of the hashing algorithm, which also leads to a larger circuit. A
+larger circuit means longer computation and higher memory consumption.
+
+Ideally, we'd like to have small blocks to keep Merkle proofs inside SNARKs
+relatively performant, but we are limited by the maximum amount of blocks that a
+particular Reed-Solomon algorithm supports. For instance, the [leopard][1]
+library can create at most 65536 blocks, because it uses a Galois field of 2^16.
+Should we use this to encode a 1TB file, we'd end up with blocks of 16MB, far
+too large to be practical in a SNARK.
+
+Design space
+------------
+
+This limits the choices that we can make. The limiting factors seem to be:
+
+- Maximum number of blocks, determined by the field size of the erasure coding
+  algorithm
+- Number of blocks per proof, which determines how likely we are to detect
+  missing blocks
+- Capacity of the SNARK algorithm; how many bytes can we hash in a reasonable
+  time inside the SNARK
+
+From these limiting factors we can derive:
+
+- Block size
+- Maximum slot size; the maximum amount of data that can be verified with a
+  proof
+- Erasure coding memory requirements
+
+For example, when we use the leopard library, with a Galois field of 2^16, and
+require 80 blocks to be sampled per proof, and we can implement a SNARK that can
+hash 80*64K bytes, then we have:
+
+- Block size: 64KB
+- Maximum slot size: 4GB (2^16 * 64KB)
+- Erasure coding memory: > 128KB (2^16 * 16 bits)
+
+Which has the disadvantage of having a rather low maximum slot size of 4GB. When
+we want to improve on this to support e.g. 1TB slot sizes, we'll need to either
+increase the capacity of the SNARK, increase the field size of the erasure
+coding algorithm, or decrease the durability guarantees.
+
+> The [accompanying spreadsheet][4] allows you to explore the design space
+> yourself
+
+Increasing SNARK capacity
+-------------------------
+
+Increasing the computational capacity of SNARKs is an active field of study, but
+it is unlikely that we'll see an implementation of SNARKS that is 100-1000x
+faster before we launch Codex. Better hashing algorithms are also being designed
+for use in SNARKS, but it is equally unlikely that we'll see such a speedup here
+either.
+
+Decreasing durability guarantees
+--------------------------------
+
+We could reduce the durability guarantees by requiring e.g. 20 instead of 80
+blocks per proof. This would still give us a probability of detecting missing
+data of 1 - 0.5^20, which is 0.999999046, or "six nines". Arguably this is still
+good enough. Choosing 20 blocks per proof allows for slots up to 16GB:
+
+- Block size: 256KB
+- Maximum slot size: 16GB (2^16 * 256KB)
+- Erasure coding memory: > 128KB (2^16 * 16 bits)
+
+Erasure coding field size
+-------------------------
+
+If we could perform erasure coding on a field of around 2^20 to 2^30, then this
+would allow us to get to larger slots. For instance, with a field of at least
+size 2^24, we could support slot sizes up to 1TB:
+
+- Block size: 64KB
+- Maximum slot size: 1TB (2^24 * 64KB)
+- Erasure coding memory: > 48MB (2^24 * 24 bits)
+
+We are however unaware of any implementations of reed solomon that use a field
+size larger than 2^16 and still be efficient O(N log(N)). [FastECC][2] uses a
+prime field of 20 bits, but it lacks a decoder and a byte encoding scheme. The
+paper ["An Efficient (n,k) Information Dispersal Algorithm Based on Fermat
+Number Transforms"][3] describes a scheme that uses Proth fields of 2^30, but
+lacks an implementation.
+
+If we were to adopt an erasure coding scheme with a large field, it is likely
+that we'll have implement one.
+
+Conclusion
+----------
+
+It is likely that with the current state of the art in SNARK design and erasure
+coding implementations we can only support slot sizes up to 4GB. The most
+promising way to increase the supported slot sizes seems to be to implement an
+erasure coding algorithm using a field size of around 2^24.
+
+[1]: https://github.com/catid/leopard
+[2]: https://github.com/Bulat-Ziganshin/FastECC
+[3]: https://ieeexplore.ieee.org/abstract/document/6545355
+[4]: ./proof-erasure-coding.ods