improve the documentation (especially disk layout)

This commit is contained in:
Balazs Komuves 2025-12-04 11:53:44 +01:00
parent ae3c9827d3
commit c9c5533b3d
No known key found for this signature in database
GPG Key ID: F63B7AEF18435562
2 changed files with 48 additions and 43 deletions

View File

@ -23,10 +23,14 @@ We also expect people using HDDs (spinning disks, as opposed to SSDs), especiall
This means disk layout is critical for performance. This means disk layout is critical for performance.
However, it's fair to assume that the machine also contains an SSD (for the operating system etc), which can be used for temporary storage.
#### Data Conversion #### Data Conversion
We can encode 31 bytes into 4 Goldilocks field elements. That's about 10% more efficient than encoding 7 bytes into a single field element, while still quite simple, so we should do that (a free 10% is a free 10%!). We can encode 31 bytes into 4 Goldilocks field elements. That's about 10% more efficient than encoding 7 bytes into a single field element, while still quite simple, so we should do that (a free 10% is a free 10%!).
For example 2048 bytes _almost_ fit into 66 such 31 byte blocks: $66\times 31 = 2046$. The remaining 2 bytes can be stored in an extra field element, resulting in $66\times 31 + 1 = 265$ field elements. This can be padded (virtually, with zeros or with the `10*` strategy) to $34\times 8 = 272$ field elements (for example for hashing).
#### Parallel row hashing #### Parallel row hashing
We are using a sponge construction (with state size of 12 field elements and rate 8) with the Monolith permutation for linear hashing. We are using a sponge construction (with state size of 12 field elements and rate 8) with the Monolith permutation for linear hashing.
@ -39,7 +43,7 @@ As the permutation state of a single hash is encoded in $12\times 8 = 96$ bytes
As an example, let's aim 8GB original data and 8GB parity, so in total 16GB of data. As an example, let's aim 8GB original data and 8GB parity, so in total 16GB of data.
With $N=2^{22}$ and $N'=2^{23}$, we need $M=265$ to encode this amount of data (one field element can on average encoded 7.75 bytes, as 4 field element can encode 31 bytes). With $N=2^{22}$ and $N'=2^{23}$, we need $M=265$ to encode this amount of data (one field element can on average encoded 7.75 bytes, as 4 field element can encode 31 bytes; a row corresponding to 2048 bytes of data).
With this setup, one (already RS-encoded) column takes 64 megabytes of memory (62MB of raw data decoded into 64MB of field elements). With this setup, one (already RS-encoded) column takes 64 megabytes of memory (62MB of raw data decoded into 64MB of field elements).
@ -49,7 +53,9 @@ So it seems to be a reasonable idea to store the original data columnwise.
#### Collecting small files #### Collecting small files
However, this seems to be in conflict with the main intended use case, namely _collecting small files_: Because those are _also_ identified by Merkle roots, and we want to prove _merging_ of such partial datasets (with extremely short Merkle proofs), those must also must consists of contiguous _rows_ (of the same size! which is here $M$). But those pieces can be still stored on-disk columnwise - this however means more seeking when reading in the merged dataset. However, this seems to be in conflict with the main intended use case, namely _collecting small files_: Because those are _also_ identified by Merkle roots, and we want to prove _merging_ of such partial datasets (with extremely short Merkle proofs), those must also must consists of contiguous _rows_ (of the same size! which is here $M$).
Furthermore, it's important to remember that as providers, we have to serve the original data in 64kb continguous blocks. Since we want the RS encoded data stored in row-wise (for easy sampling), this means that original data layout must be compatible with this (as we don't want to store it twice, but only as part of the encoded dataset), that is, row-major. And the encoded dataset must be re-packed to 2048 bytes per row (from the 265 field elements, which in memory is represented by $265\times 8 = 2120$ byte)s.
M M
/ +-----------------------------+ _ / +-----------------------------+ _
@ -77,23 +83,8 @@ However, this seems to be in conflict with the main intended use case, namely _c
| | | | | |
| | | | | |
\ +-----------------------------+ \ +-----------------------------+
For a single file, it makes sense to store the field elements columnwise. Assuming $N=2^n \ge 4$, we can read $7.75\times N = 31\times 2^{n-2}$ in a column.
Even if we normally read 8 columns at the same time, storing fully columnwise still makes sense because this allows a memory tradeoff with arbitrary constituent file sizes (we can read 1/2/4 columns as a contiguous block at a time instead of 8 or more): However, we can use the SSD as a faster storage for both collecting the small files, and use as a "cache" while doing the FFT encoding. Thus using the slow HDD only for the final, "sealed", erasure-coded merged blocks.
+-----+------+------+-- ... --+----------+
| 0 | N | 2N | | (M-1)N |
| 1 | N+1 | 2N+1 | | (M-1)N+1 |
| 2 | N+2 | 2N+2 | | (M-1)N+2 |
| | | | | |
| ... | ... | ... | ... | ... |
| | | | | |
| N-2 | 2N-2 | 3N-2 | | MxN-2 |
| N-1 | 2N-1 | 3N-1 | | MxN-1 |
+-----+------+------+-- ... --+----------+
We can also collect the small files on SSD and store only the "sealed", erasure-coded merged blocks on HDD.
#### Building the Merkle trees #### Building the Merkle trees
@ -101,37 +92,49 @@ After computing the $N'\approx 2^{23}$ hashes, each consisting 4 field elements
We build a binary Merkle tree on the top of that, that's another 256MB (in total 512MB with the leaf hashes). We build a binary Merkle tree on the top of that, that's another 256MB (in total 512MB with the leaf hashes).
This (the Merkle tree) we want to keep for the FRI prover. However, we don't want to keep all the actual data in memory, as that is too large. In the future phases (FRI sampling, and later storage proofs), we want to sample individual (random) _rows_. That would prefer in a different disk layout. This (the Merkle tree) we want to keep for the FRI prover. However, we don't want to keep all the actual data in memory, as that is too large. In the future phases (FRI sampling, and later storage proofs), we want to sample individual (random) _rows_.
#### Partial on-line transposition #### Partial on-line transposition
So we are processing the data in $N' \times 8$ "fat columns" (both computing the RS encoding and the row hashes at the same time), but we want to store the result of the RS-encoding in a way which is more suitable for efficient reading of random _rows_. So we are processing the data in $N' \times 8$ "fat columns" (both computing the RS encoding and the row hashes at the same time), but we want to store the result of the RS-encoding in a way which is more suitable for efficient reading of random _rows_, and the original input data is also stored in row-major order.
As we cannot really transpose the whole matrix without either consuming a very large number of memory (which we don't want) or a very large number of disk seeks (which is _really_ slow), we can only have some kind of trade-off. This is pretty standard in computer science, see for example "cache-oblivious data structures". As we cannot really transpose the whole matrix without either consuming a very large number of memory (which we don't want) or a very large number of disk seeks (which is _really_ slow), we can only have some kind of trade-off. This is pretty standard in computer science, see for example "cache-oblivious data structures".
We can also use the SSD as a "cache" during the transposition process (as almost all machines these days include at least a small SSD). We can also use the SSD as a "cache" during the transposition process (as almost all machines these days include at least a small SSD).
Option #1: Cut the $N'\times 8$ "fat columns" into smaller $K\times 8$ blocks, transpose them in-memory one-by-one (so they are now row-major), and write them to disk. To read a row you need to read $8M$ bytes and need $M/8$ seeks (the latter will dominate). No need for temporary SSD space. #### Final proposed strategy
Option #2: Do the same, but reorder the $K\times 8$ blocks (using an SSD as a temporary storage), such that consecutive rows form a "fat row" of size $K\times M$. This needs $(N'/K)\times(M/8)$ seeks when writing, and $8\times K\times (M/8)$ bytes to read a row, but no seek when reading. The proposed strategy is the following:
M - the input data is stored in several smaller files (virtually padded to power of two sizes), which we interpret in row-major order, each row consisting of 2048 bytes, which can be depacked into 265 field elements.
___________________________________________________________ - these files are virtually concatenated into a bundle of size at most 8GB
/ \ - this bundle can be split into say 256GB chunks (interpreted as $2^{17}\times 265$ matrices). Note that a chunk can consist both several files or only a portion of a single file, or even both!
8 8 8 8 8 - we can read one such chunk into memory, depack the rows, and transpose the resulting matrix while consuming say less than 300GB memory
+---------+---------+---------+---------+-- --+---------+ - we can immediately do an IFFT on the columns of this chunk, as that will be needed anyway as part of the full column IFFT algorithm. As these columns are relatively small, we are not required to use an in-place algorithm here
| | | | | | | - write out the transposed and IFFT-ed chunk to a temporary location on SSD
K | 0 | 1 | 2 | 3 | ... | M/8-1 | - do the same for all the $32 = 8\textrm{GB} \,/\, 256\textrm{MB}$ chunks.
| | | | | | | - now proceed in blocks of 4 or 8 columns to do the RS-encoding:
+---------+---------+---------+---------+-- --+---------+ - from each of the 32 chunks, read in the first 8 columns; that's $32 \times (2^{16}\times 8)$ matrices, occupying about $32\times 8\textrm{MB} = 256\textrm{MB}$ memory
| | | | | | | - each chunk is already IFFT-d; we can probably finish the large $2^{22}$ sized IFFT-s with an in-place algorithm
K | M/8 | M/8+1 | M/8+2 | M/8+3 | ... | 2M/8-1 | - proceed similarly for the shifted FFT computing the 8 parity columns
| | | | | | | - write out the result to SSD, but in $2^{17}\times 8$ sized blocks (!)
+---------+---------+---------+---------+-- --+---------+ - do a final transposition to row-major order and store on long-term storage (HDD):
| | | | | | | - again for each of the 32 chunk (a matrix of field elements of size $2^{17} \times 265$), read the $2^{17}\times 8$ cells from this
- do a transposition in-memory so it's row-major
- compute the row hashes (and store in memory)
- re-pack the rows into 2048 bytes
- store the rows linearly on HDD
- compute the Merkle tree over the row hashes (size $2^{23}$)
- execute the FRI protocol. Reading the rows from HDD should be fine (100-200 seeks for 100-200 samples), but optionally we can also cache the same final encoded dataset on SSD for the duration of this
For example choosing $K=2^10$ we have 64kb blocks of size $K\times 8$, and reading a row requires reading $8\times K\times M \approx 2\textrm{ MB}$ of data. ### Estimated time for the whole process
Creating this "semi-transposed" structure takes $(N'/K)\times(M/8)\approx 300k$ seeks, which should be hopefully acceptable on an SSD; and then can be copied to a HDD linearly. All critical operations:
- reading/writing from HDD
- Monolith hashing
- NTT / INTT
seem to have similar speeds: Namely, around 200-300 MB/sec (but the CPU ones can be parallelized; these are measured single-core on an M2 macbook pro).
As $16\textrm{GB}\,/\,\textrm{200MB} = 82$, I expect the whole process to finish in maybe 3 minutes or less for 8GB of original (16GB of encoded) data.

View File

@ -1,7 +1,9 @@
Outsourcing local erasure coding Outsourcing local erasure coding
-------------------------------- --------------------------------
The purpose of local erasure coding (we used to call this "slot-level EC" in old Codex) is to increase the strength of storage proofs based on random sampling. We would like to outsource the Reed-Solomon encoding of some piece of data to an untrusted server. The "client" authenticates its data with some kind of hash commitment (in practice, a Merkle root), and similarly the "server" returns another Merkle root of the encoded data, together with a proof connecting the two and ensuring the the correctness of the encoding.
Why do we want this? The purpose of local erasure coding (we used to call this "slot-level EC" in old Codex) is to increase the strength of storage proofs based on random sampling.
The core principle behind this idea is the distance amplification property of Reed-Solomon codes. The core principle behind this idea is the distance amplification property of Reed-Solomon codes.
@ -15,9 +17,9 @@ In "old Codex", this encoding (together with the network-level erasure coding) w
However, it would preferable to outsource the local encoding to the providers, for several reasons: However, it would preferable to outsource the local encoding to the providers, for several reasons:
- the providers typically have more computational resources than the clients (especially if the client is for example a mobile phone) - the providers typically have more computational resources than the clients (especially if the client is for example a mobile phone);
- because the network chunks are hold by different providers, the work could be distributed among several providers, further decreasing the per-person work - because the network chunks (redundancy encoding) are hold by different providers, the work could be distributed among several providers, further decreasing the per-person work;
- if it's the provider who does it, it can be postponed until enough data (possible many small pieces from many different clients) is accumulated to make the resulting data unit size economical - if it's the provider who does it, it can be postponed until enough data (possibly many small pieces from many different clients) is accumulated to make the resulting data unit size economical.
However, in general we don't want to trust the provider(s), but instead verify that they did it correctly. However, in general we don't want to trust the provider(s), but instead verify that they did it correctly.