expand the documentation

2026-01-02 13:43:07 +00:00 · 2025-10-31 12:46:57 +01:00 · 2025-10-31 12:46:57 +01:00 · f7955ac21b
commit f7955ac21b
parent 39be48f67d
4 changed files with 392 additions and 20 deletions
--- a/docs/Disk_layout.md
+++ b/docs/Disk_layout.md
@ -0,0 +1,137 @@
 Disk layout
 -----------
 We hope that this protocol will be feasible to be used with at least up to \~10GB data sizes (for example 8GB original data + 8GB parity is 16GB total size).
 However, we don't want to keep such amount of data all in memory, for several reasons:
 - we want to support relatively low-spec hardware
 - the machine may want to do others things at the same time (this is supposed to be a background process)
 - we may want to support parallelism
 So it would be good if the memory consumption could be bounded (the max residency to be a configuration parameter).
 Recall that semantically, all our data is organized as _matrices_ (2 dimensional); however, data as stored on disk is 1 dimensional. These two facts are in conflict with each other, especially as we have different access patterns:
 - when the client computes the Merkle root of the original data, that's built on the top of _matrix rows_ 
 - during the Reed-Solomon encoding itself, we access the _matrix columns_ independently (and the CPU computation can be done in parallel)
 - when computing the Merkle tree of the encoded data, that's again _rows_
 - during the FRI proof, we sample random _matrix rows_
 - during future storage proofs, again one or a few _rows_ are sampled
 We also expect people using HDDs (spinning disks, as opposed to SSDs), especially if the network data storage scales up the non-trivial size, as HDDs are much cheaper. Unfortunately, spinning disks are very slow, both in linear read/write (about 200-300 MB/sec) and seeking (up to 1000 seeks/sec); and disk access, unlike CPU tasks, cannot be parallelized.
 This means disk layout is critical for performance.
 #### Data Conversion
 We can encode 31 bytes into 4 Goldilocks field elements. That's about 10% more efficient than encoding 7 bytes into a single field element, while still quite simple, so we should do that (a free 10% is a free 10%!).
 #### Parallel row hashing
 We are using a sponge construction (with state size of 12 field elements and rate 8) with the Monolith permutation for linear hashing.
 The matrix Merkle trees (original and encoded) are built on the top of these row hashes.
 As the permutation state of a single hash is encoded in $12\times 8 = 96$ bytes of memory, as long as the number of rows is not too big, we can compute these hashes in parallel even if we access the matrix columnwise: Simply process 8 columns (of size $N$) in a step, while keeping all the $N$ hash states ($N\times 96$ bytes) in memory.
 ### Example configuration
 As an example, let's aim 8GB original data and 8GB parity, so in total 16GB of data.
 With $N=2^{22}$ and $N'=2^{23}$, we need $M=265$ to encode this amount of data (one field element can on average encoded 7.75 bytes, as 4 field element can encode 31 bytes).
 With this setup, one (already RS-encoded) column takes 64 megabytes of memory (62MB of raw data decoded into 64MB of field elements).
 If processing columnwise, doing the RS-encoding and computing the row hashes, we need to keep in memory $12+8 = 20$ columns at the same time (12 for the hash states, and 8 for the hash input), that about 1.25 GB, which seems acceptable (of course more memory is needed for other purposes, but presumably this will dominate).
 So it seems to be a reasonable idea to store the original data columnwise. 
 #### Collecting small files
 However, this seems to be in conflict with the main intended use case, namely _collecting small files_: Because those are _also_ identified by Merkle roots, and we want to prove _merging_ of such partial datasets (with extremely short Merkle proofs), those must also must consists of contiguous _rows_ (of the same size! which is here $M$). But those pieces can be still stored on-disk columnwise - this however means more seeking when reading in the merged dataset.
                              M
          /       +-----------------------------+   _
         |   2^18 |          file #0            |    \
         |        +-----------------------------+     +--+
         |   2^18 |          file #1            |   _/    \
         |        +-----------------------------+   _      +---+
         |   2^18 |          file #2            |    \    /     \
         |        +-----------------------------+     +--+       \
         |   2^18 |          file #3            |   _/            \
         |        +-----------------------------+                  \
         |        |                             |   _               \
         |   2^19 |          file #4            |    \               +-- root
     N   |        |                             |     \             /
     =   |        +-----------------------------+      +-+         /
    2^22 |        |                             |     /   \       /
         |   2^19 |          file #5            |   _/     \     /
         |        |                             |           \   /
         |        +-----------------------------+            +-+
         |        |                             |           /
         |        |                             |          /
         |        |                             |         /
         |   2^21 |          file #6            |      __/
         |        |                             |
         |        |                             |
         |        |                             |
          \       +-----------------------------+
 For a single file, it makes sense to store the field elements columnwise. Assuming $N=2^n \ge 4$, we can read $7.75\times N = 31\times 2^{n-2}$ in a column.
 Even if we normally read 8 columns at the same time, storing fully columnwise still makes sense because this allows a memory tradeoff with arbitrary constituent file sizes (we can read 1/2/4 columns as a contiguous block at a time instead of 8 or more):
    +-----+------+------+-- ... --+----------+
    |  0  | N    | 2N   |         | (M-1)N   |
    |  1  | N+1  | 2N+1 |         | (M-1)N+1 |
    |  2  | N+2  | 2N+2 |         | (M-1)N+2 |
    |     |      |      |         |          |
    | ... | ...  | ...  |   ...   |   ...    |
    |     |      |      |         |          |
    | N-2 | 2N-2 | 3N-2 |         |  MxN-2   |
    | N-1 | 2N-1 | 3N-1 |         |  MxN-1   |
    +-----+------+------+-- ... --+----------+
 We can also collect the small files on SSD and store only the "sealed", erasure-coded merged blocks on HDD.
 #### Building the Merkle trees
 After computing the $N'\approx 2^{23}$ hashes, each consisting 4 field elements (32 bytes in memory) what we have is 256MB of data.
 We build a binary Merkle tree on the top of that, that's another 256MB (in total 512MB with the leaf hashes).
 This (the Merkle tree) we want to keep for the FRI prover. However, we don't want to keep all the actual data in memory, as that is too large. In the future phases (FRI sampling, and later storage proofs), we want to sample individual (random) _rows_. That would prefer in a different disk layout.
 #### Partial on-line transposition
 So we are processing the data in $N' \times 8$ "fat columns" (both computing the RS encoding and the row hashes at the same time), but we want to store the result of the RS-encoding in a way which is more suitable for efficient reading of random _rows_.
 As we cannot really transpose the whole matrix without either consuming a very large number of memory (which we don't want) or a very large number of disk seeks (which is _really_ slow), we can only have some kind of trade-off. This is pretty standard in computer science, see for example "cache-oblivious data structures".
 We can also use the SSD as a "cache" during the transposition process (as almost all machines these days include at least a small SSD).
 Option #1: Cut the $N'\times 8$ "fat columns" into smaller $K\times 8$ blocks, transpose them in-memory one-by-one (so they are now row-major), and write them to disk. To read a row you need to read $8M$ bytes and need $M/8$ seeks (the latter will dominate). No need for temporary SSD space.
 Option #2: Do the same, but reorder the $K\times 8$ blocks (using an SSD as a temporary storage), such that consecutive rows form a "fat row" of size $K\times M$. This needs $(N'/K)\times(M/8)$ seeks when writing, and $8\times K\times (M/8)$ bytes to read a row, but no seek when reading.
                                M
        ___________________________________________________________
       /                                                           \
            8         8         8         8                   8
       +---------+---------+---------+---------+--     --+---------+
       |         |         |         |         |         |         |
     K |    0    |    1    |    2    |    3    |   ...   |  M/8-1  |
       |         |         |         |         |         |         |
       +---------+---------+---------+---------+--     --+---------+
       |         |         |         |         |         |         | 
     K |   M/8   |  M/8+1  |  M/8+2  |  M/8+3  |   ...   | 2M/8-1  |
       |         |         |         |         |         |         | 
       +---------+---------+---------+---------+--     --+---------+
       |         |         |         |         |         |         |     
 For example choosing $K=2^10$ we have 64kb blocks of size $K\times 8$, and reading a row requires reading $8\times K\times M \approx 2\textrm{ MB}$ of data.
 Creating this "semi-transposed" structure takes $(N'/K)\times(M/8)\approx 300k$ seeks, which should be hopefully acceptable on an SSD; and then can be copied to a HDD linearly.
--- a/docs/FRI_details.md
+++ b/docs/FRI_details.md
@ -152,7 +152,20 @@ NR-1\, \}
 where $R=2^r=1/\rho$ is the expansion ratio.
-Note: using $r>1$ (so $R>2$ or $\rho<1/2$) in the FRI protocol may make sense even if we only keep a smaller amount of the parity data at the end (to limit the storage overhead), as it may improve soundness or decrease proof size (while also making the proof time longer).
+But then this needs to be re-indexed so that $\mathsf{data}$, $\mathsf{parity_1}$ etc become subtrees. This can be calculated as follows:
 \begin{align*}
 \mathsf{idx_Merkle}  &= 
  \left\lfloor \frac{\mathsf{idx_natural}}{R}
    \right\rfloor + N\,\times\, \textrm{mod}
      \big(\mathsf{idx_natural}\,,\,R\big) \\
 \mathsf{idx_natural} &=
  \left\lfloor \frac{\mathsf{idx_Merkle}}{N}
    \right\rfloor + R\,\times\, 
      \textrm{mod}\big(\mathsf{idx_Merkle}\,,\,N\big) \\
 \end{align*}
 Note: using $r>1$ (so $R>2$ or $\rho<1/2$) in the FRI protocol may make sense even if we only keep a smaller amount of the parity data at the end (to limit the storage overhead), as it may improve soundness or decrease proof size (while also making the proof time longer). **WARNING:** This may have non-trivial security implications?
 #### Commit phase vectors ordering
--- a/docs/Overview.md
+++ b/docs/Overview.md
@ -5,13 +5,15 @@ The purpose of local erasure coding (we used to call this "slot-level EC" in old
 The core principle behind this idea is the distance amplification property of Reed-Solomon codes.
-The concept is simple: If we encode $K$ data symbols into $N$ code symbols, then for the data to be irrecoverably lost, you need to lose at least $N-K+1$ symbols (it's a bit more complicated with data corruption, but let's ignore that for now). In a typical case of $N=2K$, this means that checking just one randomly chosen symbol gives you approximately $p\approx 1/2$ probability of detecting data loss.
+The concept is simple: If we encode $K$ data symbols into $N>K$ code symbols, then for the data to be irrecoverably lost, you need to lose at least $N-K+1$ symbols (it's a bit more complicated with data corruption, but let's ignore that for now). In a typical case of $N=2K$, this means that checking just one randomly chosen symbol gives you approximately $p\approx 1/2$ probability of detecting _data loss_.
 Technical remark: Obviously the (encoded) data can be still corrupted. The idea here is that even if there is corruption or loss, in theory the provider _could_ reconstruct the original data _if they wanted_, hence the _data itself_ is in principle not lost.
 ### Outsourcing to the provider
 In "old Codex", this encoding (together with the network-level erasure coding) was done by the client before uploading. 
-However, it would be preferable to outsource the local encoding to the providers, for several reasons:
+However, it would preferable to outsource the local encoding to the providers, for several reasons:
 - the providers typically have more computational resources than the clients (especially if the client is for example a mobile phone)
 - because the network chunks are hold by different providers, the work could be distributed among several providers, further decreasing the per-person work
@ -33,7 +35,7 @@ The FRI protocol (short for "Fast Reed-Solomon Interactive Oracle Proofs of Prox
 Note that obviously we cannot do better than "close to" without checking every single element of the vector (which again obviously wouldn't be a useful approach), so "close to" must be an acceptable compromise.
-However, in the ideal situation, if the precise distance bound $\delta$ of the "close to" concept is small enough, then there is exactly 1 codeword within that radius ("unique decoding regime"). In that situation errors in the codeword can be corrected simply by replacing it with the closest codeword. A somewhat more complicated situation is the so-called "list decoding regime".
+However, in the ideal situation, if the precise distance bound $\delta$ of the "close to" concept is small enough, then there is exactly 1 codeword within that radius ("unique decoding regime"). In that situation errors in the codeword can be corrected simply by replacing it with the closest codeword. A somewhat more complicated situation is the so-called "list decoding regime". For more details see the accompanying document about security/soundness.
 As usual we can make this non-interactive via the Fiat-Shamir heuristic.
@ -46,23 +48,23 @@ This gives us a relatively simple plan of attack:
 - the provider also distributes the Merkle root of the parity data together with the FRI proof. This is the proof (a singleton Merkle path) connecting the original data Merkle root and the codeword Merkle root (storage proofs will be validated against the latter)
 - the metadata is updated: the new content ID will be Merkle root of the codeword, against which storage proofs will be required in the future. Of course one will also need a mapping from the original content ID(s) to the new locations (root + pointer(s) inside the RS encoded data)
-Remark: If the 2x storage overhead is too big, after executing the protocol, we may try to trim some of the parity (say 50% of it). You can probably still prove some new Merkle root with a little bit of care, but non-power-of-two sizes make everything more complicated.
+Remark: If the 2x storage overhead is too big, after executing the protocol, we may try to trim some of the parity (say 50% of it). You can probably still prove some new Merkle root with a little bit of care, but non-power-of-two sizes make everything more complicated. WARNING: This "truncation" may have serious implication on the security of the whole protocol!!
-Remark #2: There are also improved versions of the FRI protocol like STIR and WHIR. I believe those can be used in the same way. But as FRI is already rather complicated, for now let's concentrate on that.
+Remark #2: There are also improved versions of the FRI protocol like STIR and WHIR. I believe those can be used in the same way. But as FRI is already rather complicated, for now let's just concentrate on that.
 ### Batching
 FRI is a relatively expensive protocol. I expect this proposal to work well for say up to 1 GB data sizes, and acceptably up to 10 GB of data. But directly executing FRI on such data sizes would be presumably very expensive.
-Fortunately, FRI can be batched, in a very simple way: Suppose you want to prove that $M$ vectors $v_i\in \mathbb{F}^N$ are all codewords. To do that, just consider a random linear combination (recall that Reed-Solomon is a linear code)
+Fortunately, FRI can be batched, in a very simple way: Suppose you want to prove that $M$ vectors $v_i\in \mathbb{F}^N$ (for $0\le i <M$) are all codewords. To do that, just consider the random linear combination (recall that Reed-Solomon is a linear code):
 $$ V := \sum_{i=1}^M \alpha^i v_i $$
-with a randomly chosen $0\neq\alpha\in\mathbb{F}$ (choosen by the verifier or via Fiat-Shamir). Intuitively, it's very unlikely that any of $v_i$ is _not a codeword_ but $V$ is (this can be quantified precisely). So it's enough to run the FRI on the combined vector $V$.
+with a randomly chosen $0\neq\alpha\in\mathbb{F}$ (choosen by the verifier or via Fiat-Shamir). Intuitively, it's very unlikely that any of $v_i$ is _not a codeword_ but $V$ is (this can be quantified precisely). So it's enough to run the FRI protocol on the combined vector $V$.
 Note: If the field is not big enough, you may need to either repeat this with several different $\alpha$-s, or consider a field extension. This is the case for example with the Goldilocks field, which has size $|\mathbb{F}|\approx 2^{64}$. Plonky2 for example choses $\alpha\in\widetilde{\mathbb{F}}$ from a degree two field extension $\widetilde{\mathbb{F}}$ (so approx. 128 bits), which is big enough for any practical purposes. FRI is then executed in that bigger field.
-This approach has another nice consequences: Now instead of doing one big RS encoding, we have to do many smaller ones. This is good, because:
+This batching approach has another nice consequence: Now instead of doing one big RS encoding, we have to do many smaller ones. This is good, because:
 - that's always faster (because of the $O(N\log(N))$ scaling of FFT)
 - it needs much less memory
@ -76,7 +78,7 @@ We need to choose a prime field (but see below) for the Reed-Solomon encoding, a
 Just for executing the FRI protocol, the hash function could be any (cryptographic) hash, and we could even use different hash functions for constructing the row hashes and the Merkle tree. However, if in the future we want to do recursive proof aggregation, then since in that situation the Merkle path proofs need to be to be calculated inside ZK too, it's better to choose a ZK-friendly hash.
-With these in mind, a reasonable choice seems to be the Goldilocks field ($p=2^{64}-2^{32}+1$) and the Monolith hash function (which is one of the fastest ZK-friendly hashes). This way the Reed-Solomon encoding and the hash function use a compatible structure.
+With these in mind, a reasonable choice seems to be the Goldilocks field ($p=2^{64}-2^{32}+1$) and the Monolith hash function (which is one of the fastest ZK-friendly hashes). This way the Reed-Solomon encoding and the hash function uses a compatible structure.
 Remark: While in principle both FFT and FRI should work with a binary field instead of a prime field (see eg. FRI-Binius), I'm not at all familiar with those variations, so let's leave that for future work. Also, if we want to do recursive proof aggregation, again using prime fields for such proof systems is more common (but again, in principle that should be possible too with a binary field).
@ -84,19 +86,19 @@ Remark: While in principle both FFT and FRI should work with a binary field inst
 So the basic idea is to organize the data into a $2^n\times M$ matrix of field elements.
-Then extend each column via a rate $\rho=1/2$ Reed-Solomon code to a matrix of size $2^{n+1}\times M$, so that top half is the original data, and the bottom half is parity.
+Then extend each column via a rate $\rho=1/2\,$ Reed-Solomon code to a matrix of size $2^{n+1}\times M$, so that top half is the original data, and the bottom half is parity.
 Then hash each row, and build a Merkle tree on the top of the row hashes (this is the structure we need for the batched FRI argument).
 #### Row hashes
-However, we have a lot of freedom on how to construct the row hashes. The simplest is of course just a linear (sequential) hash. This is efficient (linear sponge hash with $t=12$ should be faster than building a Merkle tree, as you can consume 8 field elements with a single permutation call, instead of an average 4 with a binary Merkle tree); however a disadvantage is that a Merkle path proof needs to include a whole row, which can be potentially a large amount of data if $M$ is big.
+However, we have a lot of freedom on how to construct the row hashes. The simplest is of course just a linear (sequential) hash. This is efficient (linear sponge hash with $t=12$ should be faster than building a Merkle tree, as you can consume 8 field elements with a single permutation call, instead of an average 4 with a binary Merkle tree); however a disadvantage is that a Merkle path proof (when doing random sampling for storage proofs) needs to include a whole row, which can be potentially a large amount of data if $M$ is big.
 The other end of the spectrum is to use a Merkle tree over the rows too. Then the Merkle path is really only a Merkle path. However, to make this reasonably simple, we need $M$ to be a power-of-two.
 We can also go anywhere inbetween: Split the row into say 8, 16 or 32 chunks; hash those individually; and build a tiny Merkle tree (of depth 3, 4 or 5, respectively) on the top of them. The Merkle root of this small tree will be the row hash. This looks like a nice compromise with the good properties of both extreme cases, while also keeping the Merkle trees complete binary trees (that is, power-of-two number of leaves).
-A problem though with this approach is that a single Merkle path doesn't really "proves" the full file, except when you also include the full row data. Which can be much bigger than the Merkle path itself...
+A **problem** though with this approach is that a single Merkle path doesn't really "proves" the full file, except when you also include the full row data. Which can be much bigger than the Merkle path itself... This is because if each column is Reed-Solomon encoded independently, then you have to prove all of them independently...
 So maybe actually including full rows is the right approach, and we just have to accept larger proofs (eg. with 2048 byte rows and a 16GB dataset, we have a Merkle tree of depth 23, that is, a Merkle proof is 736 bytes + the 2048 bytes row data is 2784 bytes proof per sample).
@ -128,11 +130,13 @@ Another subtle issue is how to order this data on the disk. Unfortunately spinni
 What are our typical access patterns?
 - to do the FFT encoding, we need to access all the columns, independently (and then they will be processed in parallel)
- to do the query phase of the FRI protocol, we need to access some randomly selected rows (maybe 50--100 of them), and also the full dataset in a form of the "combined polynomial"
+- to do the query phase of the FRI protocol, we need to access some randomly selected rows (maybe 50--100 of them)
 - when doing the randomly sampled "storage proofs" (which is supposed to happen periodically!), again we need to access random rows
 Accessing both rows and columns efficiently is pretty much contradictory to each other...
 (See also the separate "disk layout" document here!)
 Maybe a compromise could be something like a "cache-oblivious" layout on disk, for example, partitition the matrix into medium-sized squares, so that both rows and columns are somewhat painful, but neither of them too much painful.
 On the other hand, ideally the encoding and the FRI proof is done only once, while storage proofs are done periodically, so maybe row access should have a priority? It's a bit hard to estimate the cost-benefit profile of this, it also depends on the average lifetime. We have a rather heavy one-time cost, and a rather light but periodically occuring cost...
--- a/docs/Protocol.md
+++ b/docs/Protocol.md
@ -1,9 +1,227 @@
-The outsourcing protocol
+The RS outsourcing protocol
------------------------
+---------------------------
-TODO: I should describe exactly what happens in the protocol.
+The Reed-Solomon outsourcing protocol is an interactive protocol to convince a client that an untrusted server applied Reed-Solomon encoding to the client's data correctly.
-This is mostly modelled on [Plonky2](https://github.com/0xPolygonZero/plonky2), 
+More precisely, the data is organized as a matrix; it's committed to by a Merkle root built on the top of row hashes; and Reed-Solomon encoding is applied to each column independently. 
-which in turn is essentially the same as
+
 The client uploads the data and the server replies with the Merkle root of the RS-encoded data, and a proof connecting the Merkle roots of the original data and the encoded root (and also proving correctness the of encoding).
 As usual, the protocol can be made mostly non-interactive; but still there are two phases of communication: data upload by the client, and then a server reply with a proof.
 Note: In the proposed use case, there isn't really a "client", and the original data is already committed; however it's easier to describe in a two-party setting (which could be also useful in other settings).
 The protocol has three major parts:
 - data preparation
 - FRI protocol to prove the codeword
 - connecting the original data to the encoded data
 Furthermore we need to specify the conventions used, for example:
 - encoding of raw data into field elements
 - hash function used to hash the rows
 - Merkle tree construction
 ### Global parameters
 The client and the server needs to agree on the global parameters. These include:
 - the size and shape of the data to be encoded
 - the rate $\rho = 2^{-r}$ of the Reed-Solomon coding
 - all the parameters of the FRI proof
 We use the Goldilocks prime field $p=2^{64}-2^{32}+1$. In the FRI protocol we also use the quadratic field extension $\widetilde{\mathbb{F}} = \mathbb{F}_p[X]/(X^2-7)$; the particular extensions chosen mainly for compatibility reasons: $p(X) = X^2\pm 7$ are the two "simplest" irreducible polynomials over $\mathbb{F}_p$, and $X^2-7$ was chosen by Plonky2 and Plonky3.
 We use the [Monolith](https://eprint.iacr.org/2023/1025) hash function. For linear hashing, we use the sponge construction, with state size `t = 12` and `rate = 8`, the `10*` padding strategy, and a custom IV (for domain separation).
 For Merkle trees we use a custom construction with a keyed compression function based on the Monolith permutation.
 ### Data preparation
 We need to convert the linear sequence of bytes on some harddrive into a matrix of field elements. There are many ways to do that, and some important trade-offs to decide about.
 TODO;
 See also the separate disk layout document.
 ### Connecting the original data to the encoded data
 As mentioned above, the original data is encoded as a matrix $D\in\mathbb{F}^{N\times M}$ of field elements, and the columnwise RS-encoded data by an another such matrix $A\in\mathbb{F}^{N'\times M}$ of size $N'\times M$. Ideally both $N$ and $N'$ are powers of two, however, in practice we may require trade-offs where this is not satisfied.
 Both matrices are committed by a Merkle root, with the binary Merkle tree built over the (linear) hashes of the matrix rows.
 #### The ideal case
 In the simplest, ideal case, we have $N=2^n$ and $N'=N/\rho=2^{n+r}$ where the code rate $\rho=2^{-r}$.
 The encoded data can be partitioned as $(D;P_1,P_2,\dots,P_{R-1})$ where $D\in \mathbb{F}^{N\times M}$ is the original data, and $P_i\in \mathbb{F}^{N\times M}$ are the parity data (here $R=2^r=\rho^{-1}$). 
 Each of these $R$ matrices can be committed with a Merkle root $\mathsf{h}_i\in\mathcal{H}$; the commitment of the encoded data $A$ (the vertical concatenation of these matrices) will be then the root of the Merkle tree built on these $R=2^r$ roots.
 Here is an ASCII diagram for $\rho = 1/4$:
                        root(A)
                          /  \
                        /      \
                      /          \
                   /\              /\ 
                  /  \            /  \
     root(D) = h0      h1      h2      h3 = root(P3) 
               /\      /\      /\      /\   
              /  \    /  \    /  \    /  \ 
             /____\  /____\  /____\  /____\
               D       P1      P2      P3   
 We want to connect the commitment of the original data $\mathsf{root}(D)=\mathsf{h}_0$ with the commitment of the encoded data $\mathsf{root}(A)$.
 Note that $(\mathbf{h}_1,\mathbf{h}_2,\mathbf{h}_3)\in\mathcal{H}^3$ are a Merkle proof for this connection! To verify we only need to check that:
 $$ \mathsf{root}(A) \;=\; \mathsf{hash}\Big(\;\mathsf{hash}\big(
 \mathsf{root}(D)\,\|\,\mathsf{h}_0\big) \;\big\|\;
 \mathsf{hash}\big(\mathsf{h}_2\,\|\,\mathsf{h}_3\big) \;\Big) 
 $$
 #### What if $N$ is not a power of two?
 We may want to allow the original data matrix's number of columns not being a power of two.
 Reasons to do this include:
 - finer control over data size than just changing $M$
 - having some other restrictions on the number of columns $M$ and wishing for less waste of space
 As we want to use Fast Fourier Transform for the encoding, the simplest solution is pad to the next power of two $2^n$, for example with zeros.
 We don't have to store these zeros on the disk, they can be always "added" run-time. By default, the size of the parity data will be still this $(\rho^{-1}-1)\times 2^n$ though.
 #### What if $N'$ is not a power of two?
 This is more interesting. In the above ideal setting, we allow $\rho=1/2^r$, however in practice that means $\rho=1/2$, as already $\rho=1/4$ would mean a 4x storage overhead!
 But already in the only practical case we have a 2x overhead, we may want a smaller one, say 1.5x.
 Unfortunately, the FRI protocol as described only works on power of two sizes, so in this case we would still do the encoding with $\rho=1/2$, but then discard half of the parity data (**WARNING!** Doing this may have serious security implications, which we ignore here).
 In this case we will have to connect _three_ Merkle roots:
 - the original data $D$;
 - the RS codewords $E$ with $\rho=1/2$ (recall that FRI itself is also a proof against a Merkle commitment);
 - and the truncated codewords (with effective $\rho=2/3$) $A$.
 In glorious ASCII, this would look something like this:
               root(E) = h0123
                           /\
                         /    \
                       /        \
                     /            \
         root(D) = h01 -- r(A)     h23
                   /\       \      /\ 
                  /  \       \    /  \
               h0      h1      h2      h3 = root(P3) 
               /\      /\      /\      /\   
              /  \    /  \    /  \    /  \ 
             /____\  /____\  /____\  /____\
               D0      D1      P2      P3   
             \____________/  \____________/
                data D          parity
             \____________________/
                   truncated A
 Here the public information is $\mathsf{root}(D)=\mathsf{h}_{01}$ and 
 $$\mathsf{root}(A) \;:=\; \mathsf{hash}\big (\mathsf{root}(D)\,\|\,\mathsf{root}(P_2)\big) \;=\; \mathsf{hash}(\mathsf{h}_{01}\|\mathsf{h}_2)
 $$
 The connection proof will then consist of $\mathsf{h}_{2,3}=\mathsf{root}(P_{2,3})$ and $\mathsf{root}(E)=\mathsf{h}_{0123}$; and we can then check that:
 $$
 \begin{align*}
  \mathsf{root}(A) &= \mathsf{hash}\big(\;
      \mathsf{h_{01}}\;\|\; \mathsf{h_2}\;\big) \\
  \mathsf{root}(E) &= \mathsf{hash}\big(\;\mathsf{h}_{01} \;\|\; \mathsf{hash}(\mathsf{h}_{2}\|\mathsf{h}_2)\;\big) 
 \end{align*}
 $$
 and that $E$ is really a matrix of codewords.
 #### What if we want $\rho^{-1}>2$, that is, a larger code?
 The security reasoning of the FRI protocol is very involved (see the corresponding security document), and we may want the codeword be significantly larger because of that; for example $\rho=1/8$.
 From the "connecting original to encoded" point of view, this situation is similar to the previous one.
 ### Batched FRI protocol
 The FRI protocol we use is essentially the same as the one in Plonky2, which is furthermore essentially the same as in the paper ["DEEP-FRI: Sampling outside the box improves soundness"](https://eprint.iacr.org/2019/336) by Eli Ben-Sasson et al.
 Setup: We have a matrix of Goldilocks field elements of size $N\times M$ with $N=2^n$ being a power of two. We encode each column with Reed-Solomon encoding into size $N/\rho$ (also assumed to be a power of two), interpreting the data as values of a polynomial on a coset $\mathcal{C}\subset \mathbb{F}^\times$, and the codeword on larger coset $\mathcal{C}'\supset\mathcal{C}$.
 The protocol proves that (the columns of) the matrix are "close to" Reed-Solomon codewords (in a precise sense).
 **The prover's side:**
 - the prover and the verifier agree on the public parameters
 - the prover computes the Reed-Solomon encoding of the columns, and commits to the encoded (larger) matrix with a Merkle root (or Merkle cap)
 - the verifier samples a random $\alpha\in\widetilde{\mathbb{F}}$ combining coefficient
 - the provers computes the linear combination of the RS-encoded columns with coefficients $1,\alpha,\alpha^2,\dots,\alpha^{M-1}$
 - "commit phase": the prover repeatedly
    - commits the current vector of values
    - the verifier chooses a random $\beta_k\in\widetilde{\mathbb{F}}$ folding coefficient
    - "folds" the polynomial with the pre-agreed folding arity $A_k = 2^{a_k}$
    - evaluates the folded polynomial on the evaluation domain $\mathcal{D}_{k+1} = \mathcal{D}_{k} ^ {A_k}$
 - until the degree of the folded polynomial becomes small enough
 - then the final polynomial is sent in clear (by its coefficients)
 - an optional proof-of-work "grinding" is performed by the prover
 - the verifier samples random row indices $0 \le \mathsf{idx}_j < N/\rho$ for $0\le j < n_{\mathrm{rounds}}$
 - "query phase": repeatedly (by the pre-agreed number $n_{\mathrm{rounds}}$ of times)
    - the provers sends the full row corresponding the index $\mathsf{idx}_j$, together with a Merkle proof (against the Merkle root of the encoded matrix)
    - repeatedly (for each folding step):
        - extract the folding coset including the "upstream index" from the folded encoded vector
        - send it together with a Merkle proof against the corresponding commit phase Merkle root
 - serialize the proof into a bytestring
 **The verifier's side:**
 - deserialize the proof from a bytestring
 - check the "shape" of the proof data structure against the global parameters:
    - all the global parameters match the expected (if included in the proof) 
    - merkle cap sizes
    - number of folding steps and folding arities 
    - number of commit phase Merkle caps
    - degree of the final polynomial
    - all Merkle proof lengths
    - number of query rounds
    - number of steps in each query round
    - opened matrix row sizes
    - opened folding coset sizes
 - compute all the FRI challenges from the transcript:
    - combining coeff $\alpha\in\widetilde{\mathbb{F}}$
    - folding coeffs $\beta_k\in\widetilde{\mathbb{F}}$
    - grinding PoW response
    - query indicies $0 \le \mathsf{idx}_j < N/\rho$
 - check the grinding proof-of-work condition
 - for each query round:
    - check the row opening Merkle proof
    - compute the combined "upstream value" $u_0 = \sum \alpha^j\cdot \mathsf{row}_j \in\widetilde{\mathbb{F}}$
    - for each folding step:
        - check the "upstream value" $u_k\in\widetilde{\mathbb{F}}$ against the corresponding element in the opened coset
        - check the folding coset values opening Merkle proof
        - compute the "downstream value" $u_{k+1}\in\widetilde{\mathbb{F}}$ from the coset values using the folding coefficient $\beta_k$ (for example by applying an IFFT on the values and linearly combining the result with powers of $\beta$)
    - check the final downstream value against the evaluation of the final polynomial at the right location
 - accept if all checks passed.
 This concludes the batched FRI protocol.
 ### Summary
 We have to prove two things:
 - that commitment to the "encoded data" really corresponds to something which looks like a set of Reed-Solomon codewords
 - and that that is really an encoding of the original data, which practically means, because this was a  so-called "systematic code", that the original data is contained in the encoded data
 The first point can be done using the FRI protocol, and the second part via a very simple Merkle proof-type argument.
 There are a lot of complications in the details, starting from how to encode the data into a matrix of field elements (important because of performance considerations) to all the peculiar details of the (optimized) FRI protocol.
 - E. Ben-Sasson, L. Goldberg, S. Kopparty, and S. Saraf: _"DEEP-FRI: Sampling outside the box improves soundness"_