nim-eth/doc/trie.md
Jacek Sieka fd6caa0fdc
Rlp experimental (#227)
* rlp: remove experimental features

* avoid range library

* trie: avoid reference-unsafe bitrange type
2020-04-20 20:14:39 +02:00

12 KiB

trie

Nim Implementation of the Ethereum Trie structure

Hexary Trie

Binary Trie

Binary-trie is a dictionary-like data structure to store key-value pair. Much like it's sibling Hexary-trie, the key-value pair will be stored into key-value flat-db. The primary difference with Hexary-trie is, each node of Binary-trie only consist of one or two child, while Hexary-trie node can contains up to 16 or 17 child-nodes.

Unlike Hexary-trie, Binary-trie store it's data into flat-db without using rlp encoding. Binary-trie store its value using simple Node-Types encoding. The encoded-node will be hashed by keccak_256 and the hash value will be the key to flat-db. Each entry in the flat-db will looks like:

key value
32-bytes-keccak-hash encoded-node(KV or BRANCH or LEAF encoded)

Node-Types

  • KV = [0, encoded-key-path, 32 bytes hash of child]
  • BRANCH = [1, 32 bytes hash of left child, 32 bytes hash of right child]
  • LEAF = [2, value]

The KV node can have BRANCH node or LEAF node as it's child, but cannot a KV node. The internal algorithm will merge a KV(parent)->KV(child) into one KV node. Every KV node contains encoded keypath to reduce the number of blank nodes.

The BRANCH node can have KV, BRANCH, or LEAF node as it's children.

The LEAF node is the terminal node, it contains the value of a key.

encoded-key-path

While Hexary-trie encode the path using Hex-Prefix encoding, Binary-trie encode the path using binary encoding, the scheme looks like this table below.

            |--------- odd --------|
       00mm yyyy xxxx xxxx xxxx xxxx
            |------ even -----|
  1000 00mm yyyy xxxx xxxx xxxx
symbol explanation
xxxx nibble of binary keypath in bits, 0 = left, 1 = right
yyyy nibble contains 0-3 bits padding + binary keypath
mm number of binary keypath bits modulo 4 (0-3)
00 zero zero prefix
1000 even numbered nibbles prefix

if there is no padding, then yyyy bit sequence is absent, mm also zero. yyyy = mm bits + padding bits must be 4 bits length.

The API

The primary API for Binary-trie is set and get.

  • set(key, value) --- store a value associated with a key
  • get(key): value --- get a value using a key

Both key and value are of seq[byte] type. And they cannot have zero length.

Getting a non-existent key will return zero length seq[byte].

Binary-trie also provide dictionary syntax API for set and get.

  • trie[key] = value -- same as set
  • value = trie[key] -- same as get
  • contains(key) a.k.a. in operator

Additional APIs are:

  • exists(key) -- returns bool, to check key-value existence -- same as contains
  • delete(key) -- remove a key-value from the trie
  • deleteSubtrie(key) -- remove a key-value from the trie plus all of it's subtrie that starts with the same key prefix
  • rootNode() -- get root node
  • rootNode(node) -- replace the root node
  • getRootHash(): KeccakHash with seq[byte] type
  • getDB(): DB -- get flat-db pointer

Constructor API:

  • initBinaryTrie(DB, rootHash[optional]) -- rootHash has seq[byte] or KeccakHash type
  • init(BinaryTrie, DB, rootHash[optional])

Normally you would not set the rootHash when constructing an empty Binary-trie. Setting the rootHash occured in a scenario where you have a populated DB with existing trie structure and you know the rootHash, and then you want to continue/resume the trie operations.

Examples

import
  eth/trie/[db, binary, utils]

var db = newMemoryDB()
var trie = initBinaryTrie(db)
trie.set("key1", "value1")
trie.set("key2", "value2")
doAssert trie.get("key1") == "value1".toBytes
doAssert trie.get("key2") == "value2".toBytes

# delete all subtrie with key prefixes "key"
trie.deleteSubtrie("key")
doAssert trie.get("key1") == []
doAssert trie.get("key2") == []]

trie["moon"] = "sun"
doAssert "moon" in trie
doAssert trie["moon"] == "sun".toBytes

Remember, set and get are trie operations. A single set operation may invoke more than one store/lookup operation into the underlying DB. The same is also happened to get operation, it could do more than one flat-db lookup before it return the requested value.

The truth behind a lie

What kind of lie? actually, delete and deleteSubtrie doesn't remove the 'deleted' node from the underlying DB. It only make the node inaccessible from the user of the trie. The same also happened if you update the value of a key, the old value node is not removed from the underlying DB. A more subtle lie also happened when you add new entrie into the trie using set operation. The previous hash of affected branch become obsolete and replaced by new hash, the old hash become inaccessible to the user. You may think that is a waste of storage space. Luckily, we also provide some utilities to deal with this situation, the branch utils.

The branch utils

The branch utils consist of these API:

  • checkIfBranchExist(DB; rootHash; keyPrefix): bool
  • getBranch(DB; rootHash; key): branch
  • isValidBranch(branch, rootHash, key, value): bool
  • getWitness(DB; nodeHash; key): branch
  • getTrieNodes(DB; nodeHash): branch

keyPrefix, key, and value are bytes container with length greater than zero. They can be openArray[byte].

rootHash and nodeHash also bytes container, but they have constraint: must be 32 bytes in length, and it must be a keccak_256 hash value.

branch is a list of nodes, or in this case a seq[seq[byte]]. A list? yes, the structure is stored along with the encoded node. Therefore a list is enough to reconstruct the entire trie/branch.

import
  eth/trie/[db, binary, utils]

var db = newMemoryDB()
var trie = initBinaryTrie(db)
trie.set("key1", "value1")
trie.set("key2", "value2")

doAssert checkIfBranchExist(db, trie.getRootHash(), "key") == true
doAssert checkIfBranchExist(db, trie.getRootHash(), "key1") == true
doAssert checkIfBranchExist(db, trie.getRootHash(), "ken") == false
doAssert checkIfBranchExist(db, trie.getRootHash(), "key123") == false

The tree will looks like:

    root --->  A(kvnode, *common key prefix*)
                         |
                         |
                         |
                    B(branchnode)
                     /         \
                    /           \
                   /             \
C1(kvnode, *remain kepath*) C2(kvnode, *remain kepath*)
            |                           |
            |                           |
            |                           |
  D1(leafnode, b'value1')       D2(leafnode, b'value2')
var branchA = getBranch(db, trie.getRootHash(), "key1")
# ==> [A, B, C1, D1]

var branchB = getBranch(db, trie.getRootHash(), "key2")
# ==> [A, B, C2, D2]

doAssert isValidBranch(branchA, trie.getRootHash(), "key1", "value1") == true
# wrong key, return zero bytes
doAssert isValidBranch(branchA, trie.getRootHash(), "key5", "") == true

doAssert isValidBranch(branchB, trie.getRootHash(), "key1", "value1") # InvalidNode

var x = getBranch(db, trie.getRootHash(), "key")
# ==> [A]

x = getBranch(db, trie.getRootHash(), "key123") # InvalidKeyError
x = getBranch(db, trie.getRootHash(), "key5") # there is still branch for non-exist key
# ==> [A]

var branch = getWitness(db, trie.getRootHash(), "key1")
# equivalent to `getBranch(db, trie.getRootHash(), "key1")`
# ==> [A, B, C1, D1]

branch = getWitness(db, trie.getRootHash(), "key")
# this will include additional nodes of "key2"
# ==> [A, B, C1, D1, C2, D2]

var wholeTrie = getWitness(db, trie.getRootHash(), "")
# this will return the whole trie
# ==> [A, B, C1, D1, C2, D2]

var node = branch[1] # B
let nodeHash = keccak256.digest(node.baseAddr, uint(node.len))
var nodes = getTrieNodes(db, nodeHash)
doAssert nodes.len == wholeTrie.len - 1
# ==> [B, C1, D1, C2, D2]

Remember the lie?

Because trie delete, deleteSubtrie and set operation create inaccessible nodes in the underlying DB, we need to remove them if necessary. We already see that wholeTrie = getWitness(db, trie.getRootHash(), "") will return the whole trie, a list of accessible nodes. Then we can write the clean tree into a new DB instance to replace the old one.

Sparse Merkle Trie

Sparse Merkle Trie(SMT) is a variant of Binary Trie which uses binary encoding to represent path during trie travelsal. When Binary Trie uses three types of node, SMT only use one type of node without any additional special encoding to store it's key-path.

Actually, it doesn't even store it's key-path anywhere like Binary Trie, the key-path is stored implicitly in the trie structure during key-value insertion.

Because the key-path is not encoded in any special ways, the bits can be extracted directly from the key without any conversion.

However, the key restricted to a fixed length because the algorithm demand a fixed height trie to works properly. In this case, the trie height is limited to 160 level, or the key is of fixed length 20 bytes (8 bits x 20 = 160).

To be able to use variable length key, the algorithm can be adapted slightly using hashed key before constructing the binary key-path. For example, if using keccak256 as the hashing function, then the height of the tree will be 256, but the key itself can be any length.

The API

The primary API for Binary-trie is set and get.

  • set(key, value, rootHash[optional]) --- store a value associated with a key
  • get(key, rootHash[optional]): value --- get a value using a key

Both key and value are of BytesRange type. And they cannot have zero length. You can also use convenience API get and set which accepts Bytes or string (a string is conceptually wrong in this context and may costlier than a BytesRange, but it is good for testing purpose).

rootHash is an optional parameter. When used, get will get a key from specific root, and set will also set a key at specific root.

Getting a non-existent key will return zero length BytesRange or a zeroBytesRange.

Sparse Merkle Trie also provide dictionary syntax API for set and get.

  • trie[key] = value -- same as set
  • value = trie[key] -- same as get
  • contains(key) a.k.a. in operator

Additional APIs are:

  • exists(key) -- returns bool, to check key-value existence -- same as contains
  • delete(key) -- remove a key-value from the trie
  • getRootHash(): KeccakHash with BytesRange type
  • getDB(): DB -- get flat-db pointer
  • prove(key, rootHash[optional]): proof -- useful for merkling

Constructor API:

  • initSparseBinaryTrie(DB, rootHash[optional])
  • init(SparseBinaryTrie, DB, rootHash[optional])

Normally you would not set the rootHash when constructing an empty Sparse Merkle Trie. Setting the rootHash occured in a scenario where you have a populated DB with existing trie structure and you know the rootHash, and then you want to continue/resume the trie operations.

Examples

import
  eth/trie/[db, sparse_binary, utils]

var
  db = newMemoryDB()
  trie = initSparseMerkleTrie(db)

let
  key1 = "01234567890123456789"
  key2 = "abcdefghijklmnopqrst"

trie.set(key1, "value1")
trie.set(key2, "value2")
doAssert trie.get(key1) == "value1".toBytes
doAssert trie.get(key2) == "value2".toBytes

trie.delete(key1)
doAssert trie.get(key1) == []

trie.delete(key2)
doAssert trie[key2] == []

Remember, set and get are trie operations. A single set operation may invoke more than one store/lookup operation into the underlying DB. The same is also happened to get operation, it could do more than one flat-db lookup before it return the requested value. While Binary Trie perform a variable numbers of lookup and store operations, Sparse Merkle Trie will do constant numbers of lookup and store operations each get and set operation.

Merkle Proofing

Using prove dan verifyProof API, we can do some merkling with SMT.

  let
    value1 = "hello world"
    badValue = "bad value"

  trie[key1] = value1
  var proof = trie.prove(key1)

  doAssert verifyProof(proof, trie.getRootHash(), key1, value1) == true
  doAssert verifyProof(proof, trie.getRootHash(), key1, badValue) == false
  doAssert verifyProof(proof, trie.getRootHash(), key2, value1) == false