update the README to include the per-block hashing convention

This commit is contained in:
Balazs Komuves 2023-11-28 15:23:50 +01:00
parent 55015008e7
commit 4d101442ca
No known key found for this signature in database
GPG Key ID: F63B7AEF18435562

View File

@ -5,14 +5,24 @@ Codex Storage Proofs for the MVP
This document describes the storage proof system for the Codex 2023 Q4 MVP. This document describes the storage proof system for the Codex 2023 Q4 MVP.
Repo organization
-----------------
- `README.md` - this document
- `circuit/` - the proof circuit (`circom` code)
- `reference/haskell/` - Haskell reference implementation of the proof input generation
- `reference/nim/` - Nim reference implementation of the proof input generation
- `test/` - tests for (some parts of the) circuit (using the `r1cs-solver` tool)
Setup Setup
----- -----
We assume that a user dataset is split into `nSlots` number of (not necessarily We assume that a user dataset is split into `nSlots` number of (not necessarily
uniformly sized) "slots" of size `slotSize`, for example 10 GB or 100 GB (for uniformly sized) "slots" of size `slotSize`, for example 10 GB or 100 GB or even
the MVP we may chose a smaller sizes). The slots of the same dataset are spread 1,000 GB (for the MVP we may chose a smaller sizes). The slots of the same dataset
over different storage nodes, but a single storage node can hold several slots are spread over different storage nodes, but a single storage node can hold several
(of different sizes, and belonging to different datasets). The slots themselves slots (of different sizes, and belonging to different datasets). The slots themselves
can be optionally erasure coded, but this does not change the proof system, only can be optionally erasure coded, but this does not change the proof system, only
the robustness of it. the robustness of it.
@ -26,23 +36,30 @@ Note that we can simply calculate:
nCells = slotSize / cellSize nCells = slotSize / cellSize
We then hash each cell (using the sponge construction with Poseidon2; see below We hash each cell independently, using the sponge construction with Poseidon2
for details), and build a binary Merkle tree over this hashes. This has depth (see below for details).
`d = ceil[ log2(nCells) ]`. Note: if `nCells` is not a power of two, then we
have to add dummy hashes. The exact conventions for doing this are described
below.
The Merkle root of the cells of a single slot is called the "slot root", and The cells are then organized to `blockSize = 64kb` blocks, each block containing
is denoted by `slotRoot`. `blockSize / cellSize = 32` cells. This is for compatibility with the networking
layer, which use larger (right now 64kb) blocks. For each block, we compute a
block hash by building a depth `5 = log2(32)` complete Merkle tree, using again
Poseidon2 hash, with the Merkle tree conventions described below.
Then for a given dataset, we can build another binary Merkle tree on the top of Then on the set of of block hashes in a slot (we have `slotSize / blockSize` many
its slot roots, resulting in the "dataset root". Grafting these Merkle trees ones), we build another (big) Merkle tree, whose root will identify the slot,
together we get a big dataset Merkle tree; however one should be careful which we call the the "slot root", and is denoted by `slotRoot`.
about the padding conventions (it makes sense to construct the dataset-level
Merkle tree separately, as `nSlots` may not be a power of two).
The dataset root is a commitment to the whole dataset, and the user will post Then for a given dataset, containing several slots, we can build a third binary
it on-chain, to ensure that the nodes really store its data and not something else. Merkle tree on the top of its slot roots, resulting in the "dataset root" (note:
this is not the same as the SHA256 hash associated with the original dataset
uploaded by the user). Grafting these Merkle trees together we get a big dataset
Merkle tree; however one should be careful about the padding conventions
(it makes sense to construct the dataset-level Merkle tree separately, as `nSlots`
may not be a power of two, and later maybe `nCells` and `nBlocks` won't be
power-of-two either).
The dataset root is a commitment to the whole (erasure coded) dataset, and will
be posted on-chain, to ensure that the nodes really store its data and not something else.
Optionally, the slot roots can be posted on-chain, but this seems to be somewhat Optionally, the slot roots can be posted on-chain, but this seems to be somewhat
wasteful. wasteful.
@ -194,16 +211,15 @@ the samples in single slot; then use Groth16 to prove it.
Public inputs: Public inputs:
- slot or dataset root (depending on what we decide on) - dataset root
- number of cells in the slot (or possibly its logarithm; right now `nCells` - slot index within the dataset
is assumed to be a power of two)
- entropy (public randomness) - entropy (public randomness)
- in case of using dataset root, also the slot index:
which slot of the dataset we are talking about; `[1..nSlots]`
Private inputs: Private inputs:
- the slot root (if it was not a public input) - the slot root
- number of cells in the slot
- the number of slots in the dataset
- the underlying data of the cells, as sequences of field elements - the underlying data of the cells, as sequences of field elements
- the Merkle paths from the leaves (the cell hashes) to the slot root - the Merkle paths from the leaves (the cell hashes) to the slot root
- the Merkle path from the slot root to the dataset root - the Merkle path from the slot root to the dataset root