diff --git a/docs/Disk_layout.md b/docs/Disk_layout.md new file mode 100644 index 0000000..e2c02ed --- /dev/null +++ b/docs/Disk_layout.md @@ -0,0 +1,137 @@ +Disk layout +----------- + +We hope that this protocol will be feasible to be used with at least up to \~10GB data sizes (for example 8GB original data + 8GB parity is 16GB total size). + +However, we don't want to keep such amount of data all in memory, for several reasons: + +- we want to support relatively low-spec hardware +- the machine may want to do others things at the same time (this is supposed to be a background process) +- we may want to support parallelism + +So it would be good if the memory consumption could be bounded (the max residency to be a configuration parameter). + +Recall that semantically, all our data is organized as _matrices_ (2 dimensional); however, data as stored on disk is 1 dimensional. These two facts are in conflict with each other, especially as we have different access patterns: + +- when the client computes the Merkle root of the original data, that's built on the top of _matrix rows_ +- during the Reed-Solomon encoding itself, we access the _matrix columns_ independently (and the CPU computation can be done in parallel) +- when computing the Merkle tree of the encoded data, that's again _rows_ +- during the FRI proof, we sample random _matrix rows_ +- during future storage proofs, again one or a few _rows_ are sampled + +We also expect people using HDDs (spinning disks, as opposed to SSDs), especially if the network data storage scales up the non-trivial size, as HDDs are much cheaper. Unfortunately, spinning disks are very slow, both in linear read/write (about 200-300 MB/sec) and seeking (up to 1000 seeks/sec); and disk access, unlike CPU tasks, cannot be parallelized. + +This means disk layout is critical for performance. + +#### Data Conversion + +We can encode 31 bytes into 4 Goldilocks field elements. That's about 10% more efficient than encoding 7 bytes into a single field element, while still quite simple, so we should do that (a free 10% is a free 10%!). + +#### Parallel row hashing + +We are using a sponge construction (with state size of 12 field elements and rate 8) with the Monolith permutation for linear hashing. + +The matrix Merkle trees (original and encoded) are built on the top of these row hashes. + +As the permutation state of a single hash is encoded in $12\times 8 = 96$ bytes of memory, as long as the number of rows is not too big, we can compute these hashes in parallel even if we access the matrix columnwise: Simply process 8 columns (of size $N$) in a step, while keeping all the $N$ hash states ($N\times 96$ bytes) in memory. + +### Example configuration + +As an example, let's aim 8GB original data and 8GB parity, so in total 16GB of data. + +With $N=2^{22}$ and $N'=2^{23}$, we need $M=265$ to encode this amount of data (one field element can on average encoded 7.75 bytes, as 4 field element can encode 31 bytes). + +With this setup, one (already RS-encoded) column takes 64 megabytes of memory (62MB of raw data decoded into 64MB of field elements). + +If processing columnwise, doing the RS-encoding and computing the row hashes, we need to keep in memory $12+8 = 20$ columns at the same time (12 for the hash states, and 8 for the hash input), that about 1.25 GB, which seems acceptable (of course more memory is needed for other purposes, but presumably this will dominate). + +So it seems to be a reasonable idea to store the original data columnwise. + +#### Collecting small files + +However, this seems to be in conflict with the main intended use case, namely _collecting small files_: Because those are _also_ identified by Merkle roots, and we want to prove _merging_ of such partial datasets (with extremely short Merkle proofs), those must also must consists of contiguous _rows_ (of the same size! which is here $M$). But those pieces can be still stored on-disk columnwise - this however means more seeking when reading in the merged dataset. + + M + / +-----------------------------+ _ + | 2^18 | file #0 | \ + | +-----------------------------+ +--+ + | 2^18 | file #1 | _/ \ + | +-----------------------------+ _ +---+ + | 2^18 | file #2 | \ / \ + | +-----------------------------+ +--+ \ + | 2^18 | file #3 | _/ \ + | +-----------------------------+ \ + | | | _ \ + | 2^19 | file #4 | \ +-- root + N | | | \ / + = | +-----------------------------+ +-+ / + 2^22 | | | / \ / + | 2^19 | file #5 | _/ \ / + | | | \ / + | +-----------------------------+ +-+ + | | | / + | | | / + | | | / + | 2^21 | file #6 | __/ + | | | + | | | + | | | + \ +-----------------------------+ + +For a single file, it makes sense to store the field elements columnwise. Assuming $N=2^n \ge 4$, we can read $7.75\times N = 31\times 2^{n-2}$ in a column. + +Even if we normally read 8 columns at the same time, storing fully columnwise still makes sense because this allows a memory tradeoff with arbitrary constituent file sizes (we can read 1/2/4 columns as a contiguous block at a time instead of 8 or more): + + +-----+------+------+-- ... --+----------+ + | 0 | N | 2N | | (M-1)N | + | 1 | N+1 | 2N+1 | | (M-1)N+1 | + | 2 | N+2 | 2N+2 | | (M-1)N+2 | + | | | | | | + | ... | ... | ... | ... | ... | + | | | | | | + | N-2 | 2N-2 | 3N-2 | | MxN-2 | + | N-1 | 2N-1 | 3N-1 | | MxN-1 | + +-----+------+------+-- ... --+----------+ + +We can also collect the small files on SSD and store only the "sealed", erasure-coded merged blocks on HDD. + +#### Building the Merkle trees + +After computing the $N'\approx 2^{23}$ hashes, each consisting 4 field elements (32 bytes in memory) what we have is 256MB of data. + +We build a binary Merkle tree on the top of that, that's another 256MB (in total 512MB with the leaf hashes). + +This (the Merkle tree) we want to keep for the FRI prover. However, we don't want to keep all the actual data in memory, as that is too large. In the future phases (FRI sampling, and later storage proofs), we want to sample individual (random) _rows_. That would prefer in a different disk layout. + +#### Partial on-line transposition + +So we are processing the data in $N' \times 8$ "fat columns" (both computing the RS encoding and the row hashes at the same time), but we want to store the result of the RS-encoding in a way which is more suitable for efficient reading of random _rows_. + +As we cannot really transpose the whole matrix without either consuming a very large number of memory (which we don't want) or a very large number of disk seeks (which is _really_ slow), we can only have some kind of trade-off. This is pretty standard in computer science, see for example "cache-oblivious data structures". + +We can also use the SSD as a "cache" during the transposition process (as almost all machines these days include at least a small SSD). + +Option #1: Cut the $N'\times 8$ "fat columns" into smaller $K\times 8$ blocks, transpose them in-memory one-by-one (so they are now row-major), and write them to disk. To read a row you need to read $8M$ bytes and need $M/8$ seeks (the latter will dominate). No need for temporary SSD space. + +Option #2: Do the same, but reorder the $K\times 8$ blocks (using an SSD as a temporary storage), such that consecutive rows form a "fat row" of size $K\times M$. This needs $(N'/K)\times(M/8)$ seeks when writing, and $8\times K\times (M/8)$ bytes to read a row, but no seek when reading. + + M + ___________________________________________________________ + / \ + 8 8 8 8 8 + +---------+---------+---------+---------+-- --+---------+ + | | | | | | | + K | 0 | 1 | 2 | 3 | ... | M/8-1 | + | | | | | | | + +---------+---------+---------+---------+-- --+---------+ + | | | | | | | + K | M/8 | M/8+1 | M/8+2 | M/8+3 | ... | 2M/8-1 | + | | | | | | | + +---------+---------+---------+---------+-- --+---------+ + | | | | | | | + + +For example choosing $K=2^10$ we have 64kb blocks of size $K\times 8$, and reading a row requires reading $8\times K\times M \approx 2\textrm{ MB}$ of data. + +Creating this "semi-transposed" structure takes $(N'/K)\times(M/8)\approx 300k$ seeks, which should be hopefully acceptable on an SSD; and then can be copied to a HDD linearly. + diff --git a/docs/FRI_details.md b/docs/FRI_details.md index f4aba9f..a94c7a7 100644 --- a/docs/FRI_details.md +++ b/docs/FRI_details.md @@ -152,7 +152,20 @@ NR-1\, \} where $R=2^r=1/\rho$ is the expansion ratio. -Note: using $r>1$ (so $R>2$ or $\rho<1/2$) in the FRI protocol may make sense even if we only keep a smaller amount of the parity data at the end (to limit the storage overhead), as it may improve soundness or decrease proof size (while also making the proof time longer). +But then this needs to be re-indexed so that $\mathsf{data}$, $\mathsf{parity_1}$ etc become subtrees. This can be calculated as follows: + +\begin{align*} +\mathsf{idx_Merkle} &= + \left\lfloor \frac{\mathsf{idx_natural}}{R} + \right\rfloor + N\,\times\, \textrm{mod} + \big(\mathsf{idx_natural}\,,\,R\big) \\ +\mathsf{idx_natural} &= + \left\lfloor \frac{\mathsf{idx_Merkle}}{N} + \right\rfloor + R\,\times\, + \textrm{mod}\big(\mathsf{idx_Merkle}\,,\,N\big) \\ +\end{align*} + +Note: using $r>1$ (so $R>2$ or $\rho<1/2$) in the FRI protocol may make sense even if we only keep a smaller amount of the parity data at the end (to limit the storage overhead), as it may improve soundness or decrease proof size (while also making the proof time longer). **WARNING:** This may have non-trivial security implications? #### Commit phase vectors ordering diff --git a/docs/Overview.md b/docs/Overview.md index 4be757e..20a746d 100644 --- a/docs/Overview.md +++ b/docs/Overview.md @@ -5,13 +5,15 @@ The purpose of local erasure coding (we used to call this "slot-level EC" in old The core principle behind this idea is the distance amplification property of Reed-Solomon codes. -The concept is simple: If we encode $K$ data symbols into $N$ code symbols, then for the data to be irrecoverably lost, you need to lose at least $N-K+1$ symbols (it's a bit more complicated with data corruption, but let's ignore that for now). In a typical case of $N=2K$, this means that checking just one randomly chosen symbol gives you approximately $p\approx 1/2$ probability of detecting data loss. +The concept is simple: If we encode $K$ data symbols into $N>K$ code symbols, then for the data to be irrecoverably lost, you need to lose at least $N-K+1$ symbols (it's a bit more complicated with data corruption, but let's ignore that for now). In a typical case of $N=2K$, this means that checking just one randomly chosen symbol gives you approximately $p\approx 1/2$ probability of detecting _data loss_. + +Technical remark: Obviously the (encoded) data can be still corrupted. The idea here is that even if there is corruption or loss, in theory the provider _could_ reconstruct the original data _if they wanted_, hence the _data itself_ is in principle not lost. ### Outsourcing to the provider In "old Codex", this encoding (together with the network-level erasure coding) was done by the client before uploading. -However, it would be preferable to outsource the local encoding to the providers, for several reasons: +However, it would preferable to outsource the local encoding to the providers, for several reasons: - the providers typically have more computational resources than the clients (especially if the client is for example a mobile phone) - because the network chunks are hold by different providers, the work could be distributed among several providers, further decreasing the per-person work @@ -33,7 +35,7 @@ The FRI protocol (short for "Fast Reed-Solomon Interactive Oracle Proofs of Prox Note that obviously we cannot do better than "close to" without checking every single element of the vector (which again obviously wouldn't be a useful approach), so "close to" must be an acceptable compromise. -However, in the ideal situation, if the precise distance bound $\delta$ of the "close to" concept is small enough, then there is exactly 1 codeword within that radius ("unique decoding regime"). In that situation errors in the codeword can be corrected simply by replacing it with the closest codeword. A somewhat more complicated situation is the so-called "list decoding regime". +However, in the ideal situation, if the precise distance bound $\delta$ of the "close to" concept is small enough, then there is exactly 1 codeword within that radius ("unique decoding regime"). In that situation errors in the codeword can be corrected simply by replacing it with the closest codeword. A somewhat more complicated situation is the so-called "list decoding regime". For more details see the accompanying document about security/soundness. As usual we can make this non-interactive via the Fiat-Shamir heuristic. @@ -46,23 +48,23 @@ This gives us a relatively simple plan of attack: - the provider also distributes the Merkle root of the parity data together with the FRI proof. This is the proof (a singleton Merkle path) connecting the original data Merkle root and the codeword Merkle root (storage proofs will be validated against the latter) - the metadata is updated: the new content ID will be Merkle root of the codeword, against which storage proofs will be required in the future. Of course one will also need a mapping from the original content ID(s) to the new locations (root + pointer(s) inside the RS encoded data) -Remark: If the 2x storage overhead is too big, after executing the protocol, we may try to trim some of the parity (say 50% of it). You can probably still prove some new Merkle root with a little bit of care, but non-power-of-two sizes make everything more complicated. +Remark: If the 2x storage overhead is too big, after executing the protocol, we may try to trim some of the parity (say 50% of it). You can probably still prove some new Merkle root with a little bit of care, but non-power-of-two sizes make everything more complicated. WARNING: This "truncation" may have serious implication on the security of the whole protocol!! -Remark #2: There are also improved versions of the FRI protocol like STIR and WHIR. I believe those can be used in the same way. But as FRI is already rather complicated, for now let's concentrate on that. +Remark #2: There are also improved versions of the FRI protocol like STIR and WHIR. I believe those can be used in the same way. But as FRI is already rather complicated, for now let's just concentrate on that. ### Batching FRI is a relatively expensive protocol. I expect this proposal to work well for say up to 1 GB data sizes, and acceptably up to 10 GB of data. But directly executing FRI on such data sizes would be presumably very expensive. -Fortunately, FRI can be batched, in a very simple way: Suppose you want to prove that $M$ vectors $v_i\in \mathbb{F}^N$ are all codewords. To do that, just consider a random linear combination (recall that Reed-Solomon is a linear code) +Fortunately, FRI can be batched, in a very simple way: Suppose you want to prove that $M$ vectors $v_i\in \mathbb{F}^N$ (for $0\le i 2$, that is, a larger code? + +The security reasoning of the FRI protocol is very involved (see the corresponding security document), and we may want the codeword be significantly larger because of that; for example $\rho=1/8$. + +From the "connecting original to encoded" point of view, this situation is similar to the previous one. + +### Batched FRI protocol + +The FRI protocol we use is essentially the same as the one in Plonky2, which is furthermore essentially the same as in the paper ["DEEP-FRI: Sampling outside the box improves soundness"](https://eprint.iacr.org/2019/336) by Eli Ben-Sasson et al. + +Setup: We have a matrix of Goldilocks field elements of size $N\times M$ with $N=2^n$ being a power of two. We encode each column with Reed-Solomon encoding into size $N/\rho$ (also assumed to be a power of two), interpreting the data as values of a polynomial on a coset $\mathcal{C}\subset \mathbb{F}^\times$, and the codeword on larger coset $\mathcal{C}'\supset\mathcal{C}$. + +The protocol proves that (the columns of) the matrix are "close to" Reed-Solomon codewords (in a precise sense). + +**The prover's side:** + +- the prover and the verifier agree on the public parameters +- the prover computes the Reed-Solomon encoding of the columns, and commits to the encoded (larger) matrix with a Merkle root (or Merkle cap) +- the verifier samples a random $\alpha\in\widetilde{\mathbb{F}}$ combining coefficient +- the provers computes the linear combination of the RS-encoded columns with coefficients $1,\alpha,\alpha^2,\dots,\alpha^{M-1}$ +- "commit phase": the prover repeatedly + - commits the current vector of values + - the verifier chooses a random $\beta_k\in\widetilde{\mathbb{F}}$ folding coefficient + - "folds" the polynomial with the pre-agreed folding arity $A_k = 2^{a_k}$ + - evaluates the folded polynomial on the evaluation domain $\mathcal{D}_{k+1} = \mathcal{D}_{k} ^ {A_k}$ +- until the degree of the folded polynomial becomes small enough +- then the final polynomial is sent in clear (by its coefficients) +- an optional proof-of-work "grinding" is performed by the prover +- the verifier samples random row indices $0 \le \mathsf{idx}_j < N/\rho$ for $0\le j < n_{\mathrm{rounds}}$ +- "query phase": repeatedly (by the pre-agreed number $n_{\mathrm{rounds}}$ of times) + - the provers sends the full row corresponding the index $\mathsf{idx}_j$, together with a Merkle proof (against the Merkle root of the encoded matrix) + - repeatedly (for each folding step): + - extract the folding coset including the "upstream index" from the folded encoded vector + - send it together with a Merkle proof against the corresponding commit phase Merkle root +- serialize the proof into a bytestring + +**The verifier's side:** + +- deserialize the proof from a bytestring +- check the "shape" of the proof data structure against the global parameters: + - all the global parameters match the expected (if included in the proof) + - merkle cap sizes + - number of folding steps and folding arities + - number of commit phase Merkle caps + - degree of the final polynomial + - all Merkle proof lengths + - number of query rounds + - number of steps in each query round + - opened matrix row sizes + - opened folding coset sizes +- compute all the FRI challenges from the transcript: + - combining coeff $\alpha\in\widetilde{\mathbb{F}}$ + - folding coeffs $\beta_k\in\widetilde{\mathbb{F}}$ + - grinding PoW response + - query indicies $0 \le \mathsf{idx}_j < N/\rho$ +- check the grinding proof-of-work condition +- for each query round: + - check the row opening Merkle proof + - compute the combined "upstream value" $u_0 = \sum \alpha^j\cdot \mathsf{row}_j \in\widetilde{\mathbb{F}}$ + - for each folding step: + - check the "upstream value" $u_k\in\widetilde{\mathbb{F}}$ against the corresponding element in the opened coset + - check the folding coset values opening Merkle proof + - compute the "downstream value" $u_{k+1}\in\widetilde{\mathbb{F}}$ from the coset values using the folding coefficient $\beta_k$ (for example by applying an IFFT on the values and linearly combining the result with powers of $\beta$) + - check the final downstream value against the evaluation of the final polynomial at the right location +- accept if all checks passed. + +This concludes the batched FRI protocol. + +### Summary + +We have to prove two things: + +- that commitment to the "encoded data" really corresponds to something which looks like a set of Reed-Solomon codewords +- and that that is really an encoding of the original data, which practically means, because this was a so-called "systematic code", that the original data is contained in the encoded data + +The first point can be done using the FRI protocol, and the second part via a very simple Merkle proof-type argument. + +There are a lot of complications in the details, starting from how to encode the data into a matrix of field elements (important because of performance considerations) to all the peculiar details of the (optimized) FRI protocol. -- E. Ben-Sasson, L. Goldberg, S. Kopparty, and S. Saraf: _"DEEP-FRI: Sampling outside the box improves soundness"_ \ No newline at end of file