outsourcing-Reed-Solomon/docs/Disk_layout.md

Disk layout
-----------

We hope that this protocol will be feasible to be used with at least up to \~10GB data sizes (for example 8GB original data + 8GB parity is 16GB total size).

However, we don't want to keep such amount of data all in memory, for several reasons:

- we want to support relatively low-spec hardware
- the machine may want to do others things at the same time (this is supposed to be a background process)
- we may want to support parallelism

So it would be good if the memory consumption could be bounded (the max residency to be a configuration parameter).

Recall that semantically, all our data is organized as _matrices_ (2 dimensional); however, data as stored on disk is 1 dimensional. These two facts are in conflict with each other, especially as we have different access patterns:

- when the client computes the Merkle root of the original data, that's built on the top of _matrix rows_
- during the Reed-Solomon encoding itself, we access the _matrix columns_ independently (and the CPU computation can be done in parallel)
- when computing the Merkle tree of the encoded data, that's again _rows_
- during the FRI proof, we sample random _matrix rows_
- during future storage proofs, again one or a few _rows_ are sampled

We also expect people using HDDs (spinning disks, as opposed to SSDs), especially if the network data storage scales up the non-trivial size, as HDDs are much cheaper. Unfortunately, spinning disks are very slow, both in linear read/write (about 200-300 MB/sec) and seeking (up to 1000 seeks/sec); and disk access, unlike CPU tasks, cannot be parallelized.

This means disk layout is critical for performance.

However, it's fair to assume that the machine also contains an SSD (for the operating system etc), which can be used for temporary storage.

#### Data Conversion

We can encode 31 bytes into 4 Goldilocks field elements. That's about 10% more efficient than encoding 7 bytes into a single field element, while still quite simple, so we should do that (a free 10% is a free 10%!).

For example 2048 bytes _almost_ fit into 66 such 31 byte blocks: $66\times 31 = 2046$. The remaining 2 bytes can be stored in an extra field element, resulting in $66\times 31 + 1 = 265$ field elements. This can be padded (virtually, with zeros or with the `10*` strategy) to $34\times 8 = 272$ field elements (for example for hashing).

#### Parallel row hashing

We are using a sponge construction (with state size of 12 field elements and rate 8) with the Monolith permutation for linear hashing.

The matrix Merkle trees (original and encoded) are built on the top of these row hashes.

As the permutation state of a single hash is encoded in $12\times 8 = 96$ bytes of memory, as long as the number of rows is not too big, we can compute these hashes in parallel even if we access the matrix columnwise: Simply process 8 columns (of size $N$) in a step, while keeping all the $N$ hash states ($N\times 96$ bytes) in memory.

### Example configuration

As an example, let's aim 8GB original data and 8GB parity, so in total 16GB of data.

With $N=2^{22}$ and $N'=2^{23}$, we need $M=265$ to encode this amount of data (one field element can on average encoded 7.75 bytes, as 4 field element can encode 31 bytes; a row corresponding to 2048 bytes of data).

With this setup, one (already RS-encoded) column takes 64 megabytes of memory (62MB of raw data decoded into 64MB of field elements).

If processing columnwise, doing the RS-encoding and computing the row hashes, we need to keep in memory $12+8 = 20$ columns at the same time (12 for the hash states, and 8 for the hash input), that about 1.25 GB, which seems acceptable (of course more memory is needed for other purposes, but presumably this will dominate).

So it seems to be a reasonable idea to store the original data columnwise.

#### Collecting small files

However, this seems to be in conflict with the main intended use case, namely _collecting small files_: Because those are _also_ identified by Merkle roots, and we want to prove _merging_ of such partial datasets (with extremely short Merkle proofs), those must also must consists of contiguous _rows_ (of the same size! which is here $M$).

Furthermore, it's important to remember that as providers, we have to serve the original data in 64kb continguous blocks. Since we want the RS encoded data stored in row-wise (for easy sampling), this means that original data layout must be compatible with this (as we don't want to store it twice, but only as part of the encoded dataset), that is, row-major. And the encoded dataset must be re-packed to 2048 bytes per row (from the 265 field elements, which in memory is represented by $265\times 8 = 2120$ byte)s.

                              M
          /       +-----------------------------+   _
         |   2^18 |          file #0            |    \
         |        +-----------------------------+     +--+
         |   2^18 |          file #1            |   _/    \
         |        +-----------------------------+   _      +---+
         |   2^18 |          file #2            |    \    /     \
         |        +-----------------------------+     +--+       \
         |   2^18 |          file #3            |   _/            \
         |        +-----------------------------+                  \
         |        |                             |   _               \
         |   2^19 |          file #4            |    \               +-- root
     N   |        |                             |     \             /
     =   |        +-----------------------------+      +-+         /
    2^22 |        |                             |     /   \       /
         |   2^19 |          file #5            |   _/     \     /
         |        |                             |           \   /
         |        +-----------------------------+            +-+
         |        |                             |           /
         |        |                             |          /
         |        |                             |         /
         |   2^21 |          file #6            |      __/
         |        |                             |
         |        |                             |
         |        |                             |
          \       +-----------------------------+

However, we can use the SSD as a faster storage for both collecting the small files, and use as a "cache" while doing the FFT encoding. Thus using the slow HDD only for the final, "sealed", erasure-coded merged blocks.

#### Building the Merkle trees

After computing the $N'\approx 2^{23}$ hashes, each consisting 4 field elements (32 bytes in memory) what we have is 256MB of data.

We build a binary Merkle tree on the top of that, that's another 256MB (in total 512MB with the leaf hashes).

This (the Merkle tree) we want to keep for the FRI prover. However, we don't want to keep all the actual data in memory, as that is too large. In the future phases (FRI sampling, and later storage proofs), we want to sample individual (random) _rows_.

#### Partial on-line transposition

So we are processing the data in $N' \times 8$ "fat columns" (both computing the RS encoding and the row hashes at the same time), but we want to store the result of the RS-encoding in a way which is more suitable for efficient reading of random _rows_, and the original input data is also stored in row-major order.

As we cannot really transpose the whole matrix without either consuming a very large number of memory (which we don't want) or a very large number of disk seeks (which is _really_ slow), we can only have some kind of trade-off. This is pretty standard in computer science, see for example "cache-oblivious data structures".

We can also use the SSD as a "cache" during the transposition process (as almost all machines these days include at least a small SSD).

#### Final proposed strategy

The proposed strategy is the following:

- the input data is stored in several smaller files (virtually padded to power of two sizes), which we interpret in row-major order, each row consisting of 2048 bytes, which can be depacked into 265 field elements.
- these files are virtually concatenated into a bundle of size at most 8GB
- this bundle can be split into say 256GB chunks (interpreted as $2^{17}\times 265$ matrices). Note that a chunk can consist both several files or only a portion of a single file, or even both!
    - we can read one such chunk into memory, depack the rows, and transpose the resulting matrix while consuming say less than 300GB memory
    - we can immediately do an IFFT on the columns of this chunk, as that will be needed anyway as part of the full column IFFT algorithm. As these columns are relatively small, we are not required to use an in-place algorithm here
    - write out the transposed and IFFT-ed chunk to a temporary location on SSD
    - do the same for all the $32 = 8\textrm{GB} \,/\, 256\textrm{MB}$ chunks.
- now proceed in blocks of 4 or 8 columns to do the RS-encoding:
    - from each of the 32 chunks, read in the first 8 columns; that's $32 \times (2^{16}\times 8)$ matrices, occupying about $32\times 8\textrm{MB} = 256\textrm{MB}$ memory
    - each chunk is already IFFT-d; we can probably finish the large $2^{22}$ sized IFFT-s with an in-place algorithm
    - proceed similarly for the shifted FFT computing the 8 parity columns
    - write out the result to SSD, but in $2^{17}\times 8$ sized blocks (!)
- do a final transposition to row-major order and store on long-term storage (HDD):
    - again for each of the 32 chunk (a matrix of field elements of size $2^{17} \times 265$), read the $2^{17}\times 8$ cells from this
    - do a transposition in-memory so it's row-major
    - compute the row hashes (and store in memory)
    - re-pack the rows into 2048 bytes
    - store the rows linearly on HDD
- compute the Merkle tree over the row hashes (size $2^{23}$)
- execute the FRI protocol. Reading the rows from HDD should be fine (100-200 seeks for 100-200 samples), but optionally we can also cache the same final encoded dataset on SSD for the duration of this

### Estimated time for the whole process

All critical operations:

- reading/writing from HDD
- Monolith hashing
- NTT / INTT

seem to have similar speeds: Namely, around 200-300 MB/sec (but the CPU ones can be parallelized; these are measured single-core on an M2 macbook pro).

As $16\textrm{GB}\,/\,\textrm{200MB} = 82$, I expect the whole process to finish in maybe 3 minutes or less for 8GB of original (16GB of encoded) data.