diff --git a/design/Merkle.md b/design/Merkle.md new file mode 100644 index 0000000..89e91f6 --- /dev/null +++ b/design/Merkle.md @@ -0,0 +1,226 @@ + +Merkle tree API proposal (WIP draft) +------------------------------------ + +Let's collect the possible problems and solutions with constructing Merkle trees. + +### Vocabulary + +A Merkle tree, built on a hash function `H`, produces a Merkle root of type `T`. +This is usually the same type as the output of the hash function. Some examples: + +- SHA1: `T` is 160 bits +- SHA256: `T` is 256 bits +- Poseidon: `T` is one (or a few) finite field element(s) + +The hash function `H` can also have different types `S` of inputs. For example: + +- SHA1 / SHA256 / SHA3: `S` is an arbitrary sequence of bits +- some less-conforming implementation of these could take a sequence of bytes instead +- Poseidon: `S` is a sequence of finite field elements +- Poseidon compression function: at most `t-1` field elements (in our case `t=3`, so + that's two field elements) +- A naive Merkle tree implementation could for example accept only a power-of-two + sized sequence of `T` + +Notation: Let's denote a sequence of `T` by `[T]`. + +### Merkle tree API + +We usually need at least two types of Merkle tree APIs: + +- one which takes a sequence `S = [T]` of length `n` as input, and produces an + output (Merkle root) of type `T` +- and one which takes a sequence of bytes (or even bits, but in practice we probably + only need bytes): `S = [byte]` + +We can decompose the latter into the composition of a function +`deserialize : [byte] -> [T]` and the former. + +### Naive Merkle tree implementation + +A straightforward implementation of a binary Merkle tree `merkleRoot : [T] -> T` +could be for example: + +- if the input has length 1, it's the root +- if the input has even length `2*k`, group it into pairs, apply a + `compress : (T,T) -> T` compression function, producing the next layer of size `k` +- if the input has odd length `2*k+1`, pad it with an extra element `dummy` of + type `T`, then apply the procedure for even length, producing the next layer of size `k+1` + +The compression function could be implemented in several ways: + +- when `S` and `T` are just sequences of bits or bytes (as in the case of classical hash + functions like SHA256), we can just concatenate the two leaves of the node and apply the + hash: `compress(x,y) := H(x|y)` +- in case of hash functions based on the sponge construction (like Poseidon or Keccak/SHA3), + we can just fill the "capacity part" of the state with a constant (say 0), the "absorbing + part" of the state with the two inputs, apply the permutation, and extract a single `T` + +### Attacks + +When implemented without enough care (like the above naive algorithm), there are several +possible attacks producing hash collisions or second preimages: + +1. The root of particular any layer is the same as the root of the input +2. The root of `[x_0,x_1,...,x_(2*k)]` (length is `n=2*k+1` is the same as the root of + `[x_0,x_1,...,x_(2*k),dummy]` (length is `n=2*k+2`) +3. when using bytes as the input, already `deserialize` can have similar collision attacks +4. The root of a singleton sequence is itself + +Traditional (linear) hash functions usually solve the analogous problems by clever padding. + +### Domain separation + +It's a good practice in general to ensure that different constructions using the same +underlying hash functions will never produce the same output. This is called "domain separation", +and it's a bit similar to _multihash_; however instead of adding extra bits of information to a hash +(and thus increasing its size), we just compress the extra information into the hash itself. +A simple example would be using `H(dom|H(...))` instead of `H(...)`. The below solutions +can be interpreted as an application of this idea, where we want to separate the different +lengths `n`. + +### Possible solutions (for the tree attacks) + +While the third problem (`deserialize` may be not injective) is similar to the second problem, +let's deal first with the tree problems, and come back to `deserialize` (see below) later. + +**Solution 0b.** Pre-hash each input element. This solves 2) and 4) (if we choose `dummy` to be +something we don't expect anybody to find a preimage), but does not solve 1); also it +doubles the computation time. + +**Solution 1.** Just prepend the data with the length `n` of the input sequence. Note that any +cryptographic hash function needs an output size of at least 160 bits (and usually at least +256 bits), so we can always embed the length (surely less than `2^64`) into `T`. This solves +both problems 1) and 2) (the height of the tree is a deterministic function of the length), +and 4) too. +However, a typical application of a Merkle tree is the case where the length of the input +`n=2^d` is a power of two; in this case it looks a little bit "inelegant" to increase the size +to `n=2^d+1`, though the overhead with above even-odd construction is only `log2(n)`. +An advantage is that you can _prove_ the size of the input with a standard Merkle inclusion proof. +Alternative version: append instead of prepend; then the indexing of the leaves does not change. + +**Solution 2.** Apply an extra compression step at the very end including the length `n`, +calculating `newRoot = compress(n,origRoot)`. This again solves all 3 problems. However, it +makes the code a bit less regular; and you have to submit the length as part of Merkle proofs. + +**Solution 3a.** Use two different compression function, one for the bottom layer (by bottom +I mean the closest to the input) and another for all the other layers. For example you can +use `compress(x,y) := H(isBottomLayer|x|y)`. This solves problem 1). + +**Solution 3b.** Use two different compression function, one for the even nodes, and another +for the odd nodes (that is, those with a single children instead of two). Similarly to the +previous case, you can use for example `compress(x,y) := H(isOddNode|x|y)` (note that for +the odd nodes, we will have `y=dummy`). This solves problem 2). Remark: The extra bits of +information (odd/even) added to the last nodes (one in each layer) are exactly the binary +expansion of the length `n`. A disadvantage is that for verifying a Merkle proof, we need to +know for each node whether it's the last or not, so we need to include the length `n` into +any Merkle proof here too. + +**Solution 3.** Combining **3a** and **3b**, we can solve both problems 1) and 2); so here we add +two bits of information to each node (that is, we need 4 different compression functions). +4) can be always solved by adding a final compression call. + +**Solution 4a.** Replace each input element `x_i` with `compress(i,x_i)`. This solves +both problems again (and 4) too), but doubles the amount of computation. + +**Solution 4b.** Only in the bottom layer, use `H(1|isOddNode|i|x_{2i}|x_{2i+1})` for +compression (not that for the odd node we have `x_{2i+1}=dummy`). This is similar to +the previous solution, but does not increase the amount of computation. + +**Solution 4c.** Only in the bottom layer, use `H(i|j|x_i|x_j)` for even nodes +(with `i=2*k` and `j=2*k+1`), and `H(i|0|x_i|0)` for the odd node (or alternatively +we could also use `H(i|i|x_i|x_i)` for the odd node). Note: when verifying +a Merkle proof, you still need to know whether the element you prove is the last _and_ +odd element, or not. However instead of submitting the length, you can encode this +into a single bit (not sure if that's much better though). + +**Solution 5 (??).** Use a different tree shape, where the left subtree is always a complete +(full) binary tree with `2^floor(log2(n-1))` leaves, and the right subtree is +constructed recursively. Then the shape of tree encodes the number of inputs `n`. +This however complicates the Merkle proofs (they won't have uniform size). +TODO: think more about this! + +### Keyed compression functions + +How can we have many different compression functions? Consider three case studies: + +**Poseidon.** The Poseidon family of hashes is built on a (fixed) permutation +`pi : F^t -> F^t`, where `F` is a (large) finite field. For simplicity consider the case `t=3`. +The standard compression function is then defined as: + + compress(x,y) := let (u,_,_) = pi(x,y,0) in u + +That, we take the triple `(x,y,0)`, apply the permutation to get another triple `(u,v,w)`, and +extract the field element `u` (we could use `v` or `w` too, it shouldn't matter). +Now we can see that it is in fact very easy to generalize this to a _keyed_ (or _indexed_) +compression function: + + compress_k(x,y) := let (u,_,_) = pi(x,y,k) in u + +where `k` is the key. Note that there is no overhead in doing this. And since `F` is pretty +big (in our case, about 253 bits), there is plenty of information we can encode in the key `k`. + +Note: We probably lose a few bits of security here, if somebody looks for a preimage among +_all_ keys; however in our constructions the keys have a fixed structure, so it's probably +not that dangerous. If we want to be extra safe, we could use `t=4` and `pi(x,y,k,0)` +instead (but that has some computation overhead). + +**SHA256.** When using SHA256 as our hash function, normally the compression function is +defined as `compress(x,y) := SHA256(x|y)`, that is, concatenate the (bitstring representation of the) +two elements, and apply SHA256 to the resulting (bit)string. Normally `x` and `y` are both +256 bits long, and so is the result. If we look into the details of how SHA256 is specified, +this is actually wasteful. That's because while SHA256 processes the input in 512 bit chunks, +it also prescribes a mandatory nonempty padding. So when calling SHA256 on an input of size +512 bit (64 bytes), it will actually process two chunks, the second chunk consisting purely +of padding. When constructing a binary Merkle tree using a compression function like before, +the input is always of the same size, so this padding is unnecessary; nevertheless, people +usually prefer to follow the standardized SHA256 call. But, if we are processing 1024 bits +anyway, we have a lot of free space to include our key `k`! In fact we can add up to +`512-64-1=447` bits of additional information; so for example + + compress_k(x,y) := SHA256(k|x|y) + +works perfectly well with no overhead compared to `SHA256(x|y)`. + +**MiMC.** MiMC is another arithmetic construction, however in this +case the starting point is a _block cipher_, that is, we start with +a keyed permutation! Unfortunately MiMC-p/p is a (keyed) permutation +of `F`, which is not very useful for usl; however in Feistel mode we +get a keyed permutation of `F^2`, and we can just take the first +component of the output of that as the compressed output. + +### Tree padding proposal + +It seems to me, that whatever way we try to solve problem 2) without pre-hashing, we need +to include the length (or at least one bit information about the length) into the Merkle +proofs. So maybe we should just live with that. + +Then from the above choices, right now maybe solution **4c**, or some variation of it +looks the nicest to me. + +### Making `deserialize` injective + +Consider the following simple algorithm to deserialize a sequence of bytes into chunks of +31 bytes: + +- pad the input with at most 30 zero bytes such that the padded length becomes divisible + with 31 +- split the padded sequnce into `ceil(n/31)` chunks, each 31 bytes. + +The problem with this, is that for example `0x123456`, `0x12345600` and `0x1234560000` +all results in the same output. + +Some possible solutions: + +- prepend the length (number of input bytes) to the input, say as a 64-bit little-endian integer (8 bytes), + before padding as above +- or append the length instead of prepending, then pad +- or first pad with zero bytes, but leave 8 bytes for the length (so that when we finally append + the length, the result will be divisible 31). +- use the following padding strategy: _always_ add a single 0x01 byte, then enough 0x00 bytes so that the length + is divisible by 31. Why does this work? Well, consider an already padded sequence. Count + the number of zero bytes from the end: you get a number `0 <= m < 31`. This number + determines the residue class of the original length `n` modulo 31; then this class, + together with the padded length fully determines the original length. +