update the README to include the per-block hashing convention

2026-01-03 22:13:08 +00:00 · 2023-11-28 15:23:50 +01:00 · 2023-11-28 15:23:50 +01:00 · 4d101442ca
commit 4d101442ca
parent 55015008e7
1 changed files with 40 additions and 24 deletions
--- a/README.md
+++ b/README.md
@ -5,14 +5,24 @@ Codex Storage Proofs for the MVP
 This document describes the storage proof system for the Codex 2023 Q4 MVP.


+Repo organization
+-----------------
+
+- `README.md` - this document
+- `circuit/` - the proof circuit (`circom` code)
+- `reference/haskell/` - Haskell reference implementation of the proof input generation
+- `reference/nim/` - Nim reference implementation of the proof input generation
+- `test/` - tests for (some parts of the) circuit (using the `r1cs-solver` tool)
+
+
 Setup
 -----

 We assume that a user dataset is split into `nSlots` number of (not necessarily 
-uniformly sized) "slots" of size `slotSize`, for example 10 GB or 100 GB (for 
-the MVP we may chose a smaller sizes). The slots of the same dataset are spread 
-over different storage nodes, but a single storage node can hold several slots 
-(of different sizes, and belonging to different datasets). The slots themselves
+uniformly sized) "slots" of size `slotSize`, for example 10 GB or 100 GB or even
+1,000 GB (for the MVP we may chose a smaller sizes). The slots of the same dataset 
+are spread over different storage nodes, but a single storage node can hold several 
+slots (of different sizes, and belonging to different datasets). The slots themselves
 can be optionally erasure coded, but this does not change the proof system, only 
 the robustness of it.

@ -26,23 +36,30 @@ Note that we can simply calculate:

    nCells = slotSize / cellSize

-We then hash each cell (using the sponge construction with Poseidon2; see below
-for details), and build a binary Merkle tree over this hashes. This has depth
-`d = ceil[ log2(nCells) ]`. Note: if `nCells` is not a power of two, then we
-have to add dummy hashes. The exact conventions for doing this are described 
-below.
+We hash each cell independently, using the sponge construction with Poseidon2 
+(see below for details).

-The Merkle root of the cells of a single slot is called the "slot root", and
-is denoted by `slotRoot`.
+The cells are then organized to `blockSize = 64kb` blocks, each block containing
+`blockSize / cellSize = 32` cells. This is for compatibility with the networking
+layer, which use larger (right now 64kb) blocks. For each block, we compute a
+block hash by building a depth `5 = log2(32)` complete Merkle tree, using again 
+Poseidon2 hash, with the Merkle tree conventions described below. 

-Then for a given dataset, we can build another binary Merkle tree on the top of
-its slot roots, resulting in the "dataset root". Grafting these Merkle trees
-together we get a big dataset Merkle tree; however one should be careful 
-about the padding conventions (it makes sense to construct the dataset-level
-Merkle tree separately, as `nSlots` may not be a power of two).
+Then on the set of of block hashes in a slot (we have `slotSize / blockSize` many 
+ones), we build another (big) Merkle tree, whose root will identify the slot, 
+which we call the the "slot root", and is denoted by `slotRoot`.

-The dataset root is a commitment to the whole dataset, and the user will post
-it on-chain, to ensure that the nodes really store its data and not something else.
+Then for a given dataset, containing several slots, we can build a third binary 
+Merkle tree on the top of its slot roots, resulting in the "dataset root" (note:
+this is not the same as the SHA256 hash associated with the original dataset 
+uploaded by the user). Grafting these Merkle trees together we get a big dataset 
+Merkle tree; however one should be careful about the padding conventions 
+(it makes sense to construct the dataset-level Merkle tree separately, as `nSlots`
+may not be a power of two, and later maybe `nCells` and `nBlocks` won't be
+power-of-two either).
+
+The dataset root is a commitment to the whole (erasure coded) dataset, and will 
+be posted on-chain, to ensure that the nodes really store its data and not something else.
 Optionally, the slot roots can be posted on-chain, but this seems to be somewhat
 wasteful.

@ -194,16 +211,15 @@ the samples in single slot; then use Groth16 to prove it.

 Public inputs:

- slot or dataset root (depending on what we decide on)
- number of cells in the slot (or possibly its logarithm; right now `nCells` 
-  is assumed to be a power of two)
+- dataset root
+- slot index within the dataset
 - entropy (public randomness)
- in case of using dataset root, also the slot index: 
-  which slot of the dataset we are talking about; `[1..nSlots]`

 Private inputs:

- the slot root (if it was not a public input)
+- the slot root 
+- number of cells in the slot 
+- the number of slots in the dataset
 - the underlying data of the cells, as sequences of field elements
 - the Merkle paths from the leaves (the cell hashes) to the slot root
 - the Merkle path from the slot root to the dataset root