diff --git a/design/Merkle.md b/design/Merkle.md
index 89e91f6..45576a1 100644
--- a/design/Merkle.md
+++ b/design/Merkle.md
@@ -4,6 +4,9 @@ Merkle tree API proposal (WIP draft)
 
 Let's collect the possible problems and solutions with constructing Merkle trees.
 
+See [section "Final proposal"](#Final-proposal) at the bottom for the concrete 
+version we decided to implement.
+
 ### Vocabulary
 
 A Merkle tree, built on a hash function `H`, produces a Merkle root of type `T`. 
@@ -73,9 +76,11 @@ Traditional (linear) hash functions usually solve the analogous problems by clev
 ### Domain separation
 
 It's a good practice in general to ensure that different constructions using the same 
-underlying hash functions will never produce the same output. This is called "domain separation",
-and it's a bit similar to _multihash_; however instead of adding extra bits of information to a hash
-(and thus increasing its size), we just compress the extra information into the hash itself.
+underlying hash function will never produce the same output. This is called "domain separation",
+and it can very loosely remind one to _multihash_; however instead of adding extra bits of information 
+to a hash (and thus increasing its size), we just compress the extra information into the hash itself.
+So the information itself is lost, however collisions between different domains are prevented.
+
 A simple example would be using `H(dom|H(...))` instead of `H(...)`. The below solutions
 can be interpreted as an application of this idea, where we want to separate the different
 lengths `n`.
@@ -135,10 +140,11 @@ a Merkle proof, you still need to know whether the element you prove is the last
 odd element, or not. However instead of submitting the length, you can encode this
 into a single bit (not sure if that's much better though).
 
-**Solution 5 (??).** Use a different tree shape, where the left subtree is always a complete
+**Solution 5.** Use a different tree shape, where the left subtree is always a complete
 (full) binary tree with `2^floor(log2(n-1))` leaves, and the right subtree is
 constructed recursively. Then the shape of tree encodes the number of inputs `n`.
-This however complicates the Merkle proofs (they won't have uniform size).
+Blake3 hash uses such a strategy internally. This however complicates the Merkle proofs 
+(they won't have uniform size anymore).
 TODO: think more about this!
 
 ### Keyed compression functions
@@ -146,17 +152,17 @@ TODO: think more about this!
 How can we have many different compression functions? Consider three case studies:
 
 **Poseidon.** The Poseidon family of hashes is built on a (fixed) permutation 
-`pi : F^t -> F^t`, where `F` is a (large) finite field. For simplicity consider the case `t=3`. 
+`perm : F^t -> F^t`, where `F` is a (large) finite field. For simplicity consider the case `t=3`. 
 The standard compression function is then defined as:
 
-    compress(x,y) := let (u,_,_) = pi(x,y,0) in u
+    compress(x,y) := let (u,_,_) = perm(x,y,0) in u
 
 That, we take the triple `(x,y,0)`, apply the permutation to get another triple `(u,v,w)`, and
 extract the field element `u` (we could use `v` or `w` too, it shouldn't matter).
 Now we can see that it is in fact very easy to generalize this to a _keyed_ (or _indexed_)
 compression function:
 
-    compress_k(x,y) := let (u,_,_) = pi(x,y,k) in u
+    compress_k(x,y) := let (u,_,_) = perm(x,y,k) in u
 
 where `k` is the key. Note that there is no overhead in doing this. And since `F` is pretty
 big (in our case, about 253 bits), there is plenty of information we can encode in the key `k`.
@@ -186,19 +192,10 @@ works perfectly well with no overhead compared to `SHA256(x|y)`.
 **MiMC.** MiMC is another arithmetic construction, however in this
 case the starting point is a _block cipher_, that is, we start with
 a keyed permutation! Unfortunately MiMC-p/p is a (keyed) permutation 
-of `F`, which is not very useful for usl; however in Feistel mode we
+of `F`, which is not very useful for us; however in Feistel mode we
 get a keyed permutation of `F^2`, and we can just take the first
 component of the output of that as the compressed output.
 
-### Tree padding proposal
-
-It seems to me, that whatever way we try to solve problem 2) without pre-hashing, we need
-to include the length (or at least one bit information about the length) into the Merkle
-proofs. So maybe we should just live with that.
-
-Then from the above choices, right now maybe solution **4c**, or some variation of it
-looks the nicest to me.
-
 ### Making `deserialize` injective
 
 Consider the following simple algorithm to deserialize a sequence of bytes into chunks of
@@ -211,16 +208,82 @@ Consider the following simple algorithm to deserialize a sequence of bytes into
 The problem with this, is that for example `0x123456`, `0x12345600` and `0x1234560000` 
 all results in the same output.
 
-Some possible solutions:
+#### About padding in general
+
+Let's take a step back, and meditate a little bit of what's the meaning of padding.
+
+What is padding? It's a mapping from a set of sequences into a subset. In our case 
+we have an arbitrary sequence of bytes, and we want to map into the subset of sequences 
+whose length is divisible by 31.
+
+Why do we want padding? Because we want to apply an algorithm (in this case a hash function) 
+to arbitrary sequences, but the algorithm can only handle a subset of all sequences.
+In our case we first map the arbitrary sequence of bytes into a sequence of bytes
+whose length is divisible by 31, and then map that into a sequence of finite field
+elements.
+
+What properties do we want from padding? Well, that depends on what what properties we 
+want from the resulting algorithm. In this case we do hashing, so we definitely want 
+to avoid collisions. This means that our padding should never map two different input 
+sequences into the same padded sequence (because that would create a trivial collision). 
+In mathematics, we call such functions "injective".
+
+How do you prove that a function is injective? You provide an inverse function, 
+which takes a padded sequences and outputs the original one. 
+
+In summary we need to come up with an injective padding strategy for arbitrary byte 
+sequences, which always results in a byte sequence whose length is divisible by 31.
+
+#### Some possible solutions:
 
 - prepend the length (number of input bytes) to the input, say as a 64-bit little-endian integer (8 bytes),
   before padding as above
-- or append the length instead of prepending, then pad
+- or append the length instead of prepending, then pad (note: appending is streaming-friendly; prepending is not)
 - or first pad with zero bytes, but leave 8 bytes for the length (so that when we finally append
-  the length, the result will be divisible 31).
-- use the following padding strategy: _always_ add a single 0x01 byte, then enough 0x00 bytes so that the length
-  is divisible by 31. Why does this work? Well, consider an already padded sequence. Count
-  the number of zero bytes from the end: you get a number `0 <= m < 31`. This number 
-  determines the residue class of the original length `n` modulo 31; then this class,
-  together with the padded length fully determines the original length. 
+  the length, the result will be divisible 31). This is _almost_ exactly what SHA2 does.
+- use the following padding strategy: _always_ add a single `0x01` byte, then enough `0x00` bytes so that the length
+  is divisible by 31. This is usually called the `10*` padding strategy, abusing regexp notation.
+  Why does this work? Well, consider an already padded sequence. It's very easy to recover the
+  original byte sequence by 1) first removing all trailing zeros; and 2) after that, remove the single
+  trailing `0x01` byte. This proves that the padding is an injective function.
+- one can easily come up with many similar padding strategies. For example SHA3/Keccak uses `10*1` 
+  (but on bits, not bytes), and SHA2 uses a combination of `10*` and appending the bit length of the
+  original input.
 
+Remark: Any safe padding strategy will result in at least one extra field element
+if the input length was already divisible by 31. This is both unavoidable in general,
+and not an issue in practice (as the size of the input grows, the overhead becomes 
+negligible). The same thing happens when you SHA256 hash an integer multiple of 64 bytes.
+
+
+### Final proposal
+
+We decided to implement the following version.
+
+- pad byte sequences (to have length divisible by 31) with the `10*` padding strategy; that is, 
+  always append a single `0x01` byte, and after that add a number of zero bytes (between 0 and 30),
+  so that the resulting sequence have length divisible by 31
+- when converting an (already padded) byte sequence to a sequence of field elements, 
+  split it up into 31 byte chunks, interpret those as little-endian 248-bit unsigned
+  integers, and finally interpret those integers as field elements in the BN254 prime 
+  field (using the standard mapping `Z -> Z/p`).
+- when using the Poseidon2 sponge construction to compute a linear hash out of 
+  a sequence of field elements, we use the BN254 field, `t=3` and `(0,0,domsep)` 
+  as the initial state, where `domsep := 2^64 + 256*t + rate` is the domain separation
+  IV. Note that because `t=3`, we can only have `rate=1` or `rate=2`. We need
+  a padding strategy here too (since the input length must be divisible by `rate`):
+  we use `10*` again, but here on field elements. 
+  Remark: For `rate=1` this makes things always a tiny bit slower, but we plan to use
+  `rate=2` anyway (as it's twice as fast), and it's better not to have exceptional cases.
+- when using Poseidon2 to build a binary Merkle tree, we use "solution #3" from above.
+  That is, we use a keyed compression function, with the key being one of `{0,1,2,3}`
+  (two bits). The lowest bit is 1 in the bottom-most (that is, the widest) layer,
+  and 0 otherwise; the other bit is 1 if it's both the last element of the layer,
+  _and_ it is an odd layer; 0 otherwise. In odd layers, we also add an extra 0 field
+  element to make it even. This is also valid for the singleton input: in that case
+  it's both odd and the bottommost, so the root of a singleton input `[x]` will
+  be `H_{key=3}(x|0)`
+- we will use the same strategy when constructing binary Merkle trees with the
+  SHA256 hash; in that case, the compression function will be `SHA256(x|y|key)`.
+  Note: since SHA256 already uses padding internally, adding the key does not
+  result in any overhoad.