initial Merkle tree doc draft

This commit is contained in:
Balazs Komuves 2023-11-06 23:23:17 +01:00
parent f0c0d5bb38
commit 415bb6db8c

226
design/Merkle.md Normal file
View File

@ -0,0 +1,226 @@
Merkle tree API proposal (WIP draft)
------------------------------------
Let's collect the possible problems and solutions with constructing Merkle trees.
### Vocabulary
A Merkle tree, built on a hash function `H`, produces a Merkle root of type `T`.
This is usually the same type as the output of the hash function. Some examples:
- SHA1: `T` is 160 bits
- SHA256: `T` is 256 bits
- Poseidon: `T` is one (or a few) finite field element(s)
The hash function `H` can also have different types `S` of inputs. For example:
- SHA1 / SHA256 / SHA3: `S` is an arbitrary sequence of bits
- some less-conforming implementation of these could take a sequence of bytes instead
- Poseidon: `S` is a sequence of finite field elements
- Poseidon compression function: at most `t-1` field elements (in our case `t=3`, so
that's two field elements)
- A naive Merkle tree implementation could for example accept only a power-of-two
sized sequence of `T`
Notation: Let's denote a sequence of `T` by `[T]`.
### Merkle tree API
We usually need at least two types of Merkle tree APIs:
- one which takes a sequence `S = [T]` of length `n` as input, and produces an
output (Merkle root) of type `T`
- and one which takes a sequence of bytes (or even bits, but in practice we probably
only need bytes): `S = [byte]`
We can decompose the latter into the composition of a function
`deserialize : [byte] -> [T]` and the former.
### Naive Merkle tree implementation
A straightforward implementation of a binary Merkle tree `merkleRoot : [T] -> T`
could be for example:
- if the input has length 1, it's the root
- if the input has even length `2*k`, group it into pairs, apply a
`compress : (T,T) -> T` compression function, producing the next layer of size `k`
- if the input has odd length `2*k+1`, pad it with an extra element `dummy` of
type `T`, then apply the procedure for even length, producing the next layer of size `k+1`
The compression function could be implemented in several ways:
- when `S` and `T` are just sequences of bits or bytes (as in the case of classical hash
functions like SHA256), we can just concatenate the two leaves of the node and apply the
hash: `compress(x,y) := H(x|y)`
- in case of hash functions based on the sponge construction (like Poseidon or Keccak/SHA3),
we can just fill the "capacity part" of the state with a constant (say 0), the "absorbing
part" of the state with the two inputs, apply the permutation, and extract a single `T`
### Attacks
When implemented without enough care (like the above naive algorithm), there are several
possible attacks producing hash collisions or second preimages:
1. The root of particular any layer is the same as the root of the input
2. The root of `[x_0,x_1,...,x_(2*k)]` (length is `n=2*k+1` is the same as the root of
`[x_0,x_1,...,x_(2*k),dummy]` (length is `n=2*k+2`)
3. when using bytes as the input, already `deserialize` can have similar collision attacks
4. The root of a singleton sequence is itself
Traditional (linear) hash functions usually solve the analogous problems by clever padding.
### Domain separation
It's a good practice in general to ensure that different constructions using the same
underlying hash functions will never produce the same output. This is called "domain separation",
and it's a bit similar to _multihash_; however instead of adding extra bits of information to a hash
(and thus increasing its size), we just compress the extra information into the hash itself.
A simple example would be using `H(dom|H(...))` instead of `H(...)`. The below solutions
can be interpreted as an application of this idea, where we want to separate the different
lengths `n`.
### Possible solutions (for the tree attacks)
While the third problem (`deserialize` may be not injective) is similar to the second problem,
let's deal first with the tree problems, and come back to `deserialize` (see below) later.
**Solution 0b.** Pre-hash each input element. This solves 2) and 4) (if we choose `dummy` to be
something we don't expect anybody to find a preimage), but does not solve 1); also it
doubles the computation time.
**Solution 1.** Just prepend the data with the length `n` of the input sequence. Note that any
cryptographic hash function needs an output size of at least 160 bits (and usually at least
256 bits), so we can always embed the length (surely less than `2^64`) into `T`. This solves
both problems 1) and 2) (the height of the tree is a deterministic function of the length),
and 4) too.
However, a typical application of a Merkle tree is the case where the length of the input
`n=2^d` is a power of two; in this case it looks a little bit "inelegant" to increase the size
to `n=2^d+1`, though the overhead with above even-odd construction is only `log2(n)`.
An advantage is that you can _prove_ the size of the input with a standard Merkle inclusion proof.
Alternative version: append instead of prepend; then the indexing of the leaves does not change.
**Solution 2.** Apply an extra compression step at the very end including the length `n`,
calculating `newRoot = compress(n,origRoot)`. This again solves all 3 problems. However, it
makes the code a bit less regular; and you have to submit the length as part of Merkle proofs.
**Solution 3a.** Use two different compression function, one for the bottom layer (by bottom
I mean the closest to the input) and another for all the other layers. For example you can
use `compress(x,y) := H(isBottomLayer|x|y)`. This solves problem 1).
**Solution 3b.** Use two different compression function, one for the even nodes, and another
for the odd nodes (that is, those with a single children instead of two). Similarly to the
previous case, you can use for example `compress(x,y) := H(isOddNode|x|y)` (note that for
the odd nodes, we will have `y=dummy`). This solves problem 2). Remark: The extra bits of
information (odd/even) added to the last nodes (one in each layer) are exactly the binary
expansion of the length `n`. A disadvantage is that for verifying a Merkle proof, we need to
know for each node whether it's the last or not, so we need to include the length `n` into
any Merkle proof here too.
**Solution 3.** Combining **3a** and **3b**, we can solve both problems 1) and 2); so here we add
two bits of information to each node (that is, we need 4 different compression functions).
4) can be always solved by adding a final compression call.
**Solution 4a.** Replace each input element `x_i` with `compress(i,x_i)`. This solves
both problems again (and 4) too), but doubles the amount of computation.
**Solution 4b.** Only in the bottom layer, use `H(1|isOddNode|i|x_{2i}|x_{2i+1})` for
compression (not that for the odd node we have `x_{2i+1}=dummy`). This is similar to
the previous solution, but does not increase the amount of computation.
**Solution 4c.** Only in the bottom layer, use `H(i|j|x_i|x_j)` for even nodes
(with `i=2*k` and `j=2*k+1`), and `H(i|0|x_i|0)` for the odd node (or alternatively
we could also use `H(i|i|x_i|x_i)` for the odd node). Note: when verifying
a Merkle proof, you still need to know whether the element you prove is the last _and_
odd element, or not. However instead of submitting the length, you can encode this
into a single bit (not sure if that's much better though).
**Solution 5 (??).** Use a different tree shape, where the left subtree is always a complete
(full) binary tree with `2^floor(log2(n-1))` leaves, and the right subtree is
constructed recursively. Then the shape of tree encodes the number of inputs `n`.
This however complicates the Merkle proofs (they won't have uniform size).
TODO: think more about this!
### Keyed compression functions
How can we have many different compression functions? Consider three case studies:
**Poseidon.** The Poseidon family of hashes is built on a (fixed) permutation
`pi : F^t -> F^t`, where `F` is a (large) finite field. For simplicity consider the case `t=3`.
The standard compression function is then defined as:
compress(x,y) := let (u,_,_) = pi(x,y,0) in u
That, we take the triple `(x,y,0)`, apply the permutation to get another triple `(u,v,w)`, and
extract the field element `u` (we could use `v` or `w` too, it shouldn't matter).
Now we can see that it is in fact very easy to generalize this to a _keyed_ (or _indexed_)
compression function:
compress_k(x,y) := let (u,_,_) = pi(x,y,k) in u
where `k` is the key. Note that there is no overhead in doing this. And since `F` is pretty
big (in our case, about 253 bits), there is plenty of information we can encode in the key `k`.
Note: We probably lose a few bits of security here, if somebody looks for a preimage among
_all_ keys; however in our constructions the keys have a fixed structure, so it's probably
not that dangerous. If we want to be extra safe, we could use `t=4` and `pi(x,y,k,0)`
instead (but that has some computation overhead).
**SHA256.** When using SHA256 as our hash function, normally the compression function is
defined as `compress(x,y) := SHA256(x|y)`, that is, concatenate the (bitstring representation of the)
two elements, and apply SHA256 to the resulting (bit)string. Normally `x` and `y` are both
256 bits long, and so is the result. If we look into the details of how SHA256 is specified,
this is actually wasteful. That's because while SHA256 processes the input in 512 bit chunks,
it also prescribes a mandatory nonempty padding. So when calling SHA256 on an input of size
512 bit (64 bytes), it will actually process two chunks, the second chunk consisting purely
of padding. When constructing a binary Merkle tree using a compression function like before,
the input is always of the same size, so this padding is unnecessary; nevertheless, people
usually prefer to follow the standardized SHA256 call. But, if we are processing 1024 bits
anyway, we have a lot of free space to include our key `k`! In fact we can add up to
`512-64-1=447` bits of additional information; so for example
compress_k(x,y) := SHA256(k|x|y)
works perfectly well with no overhead compared to `SHA256(x|y)`.
**MiMC.** MiMC is another arithmetic construction, however in this
case the starting point is a _block cipher_, that is, we start with
a keyed permutation! Unfortunately MiMC-p/p is a (keyed) permutation
of `F`, which is not very useful for usl; however in Feistel mode we
get a keyed permutation of `F^2`, and we can just take the first
component of the output of that as the compressed output.
### Tree padding proposal
It seems to me, that whatever way we try to solve problem 2) without pre-hashing, we need
to include the length (or at least one bit information about the length) into the Merkle
proofs. So maybe we should just live with that.
Then from the above choices, right now maybe solution **4c**, or some variation of it
looks the nicest to me.
### Making `deserialize` injective
Consider the following simple algorithm to deserialize a sequence of bytes into chunks of
31 bytes:
- pad the input with at most 30 zero bytes such that the padded length becomes divisible
with 31
- split the padded sequnce into `ceil(n/31)` chunks, each 31 bytes.
The problem with this, is that for example `0x123456`, `0x12345600` and `0x1234560000`
all results in the same output.
Some possible solutions:
- prepend the length (number of input bytes) to the input, say as a 64-bit little-endian integer (8 bytes),
before padding as above
- or append the length instead of prepending, then pad
- or first pad with zero bytes, but leave 8 bytes for the length (so that when we finally append
the length, the result will be divisible 31).
- use the following padding strategy: _always_ add a single 0x01 byte, then enough 0x00 bytes so that the length
is divisible by 31. Why does this work? Well, consider an already padded sequence. Count
the number of zero bytes from the end: you get a number `0 <= m < 31`. This number
determines the residue class of the original length `n` modulo 31; then this class,
together with the padded length fully determines the original length.