12 KiB
trie
Nim Implementation of the Ethereum Trie structure
Hexary Trie
Binary Trie
Binary-trie is a dictionary-like data structure to store key-value pair. Much like it's sibling Hexary-trie, the key-value pair will be stored into key-value flat-db. The primary difference with Hexary-trie is, each node of Binary-trie only consist of one or two child, while Hexary-trie node can contains up to 16 or 17 child-nodes.
Unlike Hexary-trie, Binary-trie store it's data into flat-db without using rlp encoding. Binary-trie store its value using simple Node-Types encoding. The encoded-node will be hashed by keccak_256 and the hash value will be the key to flat-db. Each entry in the flat-db will looks like:
key | value |
---|---|
32-bytes-keccak-hash | encoded-node(KV or BRANCH or LEAF encoded) |
Node-Types
- KV = [0, encoded-key-path, 32 bytes hash of child]
- BRANCH = [1, 32 bytes hash of left child, 32 bytes hash of right child]
- LEAF = [2, value]
The KV node can have BRANCH node or LEAF node as it's child, but cannot a KV node. The internal algorithm will merge a KV(parent)->KV(child) into one KV node. Every KV node contains encoded keypath to reduce the number of blank nodes.
The BRANCH node can have KV, BRANCH, or LEAF node as it's children.
The LEAF node is the terminal node, it contains the value of a key.
encoded-key-path
While Hexary-trie encode the path using Hex-Prefix encoding, Binary-trie encode the path using binary encoding, the scheme looks like this table below.
|--------- odd --------|
00mm yyyy xxxx xxxx xxxx xxxx
|------ even -----|
1000 00mm yyyy xxxx xxxx xxxx
symbol | explanation |
---|---|
xxxx | nibble of binary keypath in bits, 0 = left, 1 = right |
yyyy | nibble contains 0-3 bits padding + binary keypath |
mm | number of binary keypath bits modulo 4 (0-3) |
00 | zero zero prefix |
1000 | even numbered nibbles prefix |
if there is no padding, then yyyy bit sequence is absent, mm also zero. yyyy = mm bits + padding bits must be 4 bits length.
The API
The primary API for Binary-trie is set
and get
.
- set(key, value) --- store a value associated with a key
- get(key): value --- get a value using a key
Both key
and value
are of seq[byte]
type. And they cannot have zero length.
Getting a non-existent key will return zero length seq[byte].
Binary-trie also provide dictionary syntax API for set
and get
.
- trie[key] = value -- same as
set
- value = trie[key] -- same as
get
- contains(key) a.k.a.
in
operator
Additional APIs are:
- exists(key) -- returns
bool
, to check key-value existence -- same as contains - delete(key) -- remove a key-value from the trie
- deleteSubtrie(key) -- remove a key-value from the trie plus all of it's subtrie that starts with the same key prefix
- rootNode() -- get root node
- rootNode(node) -- replace the root node
- getRootHash():
Hash32
withseq[byte]
type - getDB():
DB
-- get flat-db pointer
Constructor API:
- initBinaryTrie(DB, rootHash[optional]) -- rootHash has
seq[byte]
or Hash32 type - init(BinaryTrie, DB, rootHash[optional])
Normally you would not set the rootHash when constructing an empty Binary-trie. Setting the rootHash occurred in a scenario where you have a populated DB with existing trie structure and you know the rootHash, and then you want to continue/resume the trie operations.
Examples
import
eth/trie/[db, binary, utils]
var db = newMemoryDB()
var trie = initBinaryTrie(db)
trie.set("key1", "value1")
trie.set("key2", "value2")
doAssert trie.get("key1") == "value1".toBytes
doAssert trie.get("key2") == "value2".toBytes
# delete all subtrie with key prefixes "key"
trie.deleteSubtrie("key")
doAssert trie.get("key1") == []
doAssert trie.get("key2") == []]
trie["moon"] = "sun"
doAssert "moon" in trie
doAssert trie["moon"] == "sun".toBytes
Remember, set
and get
are trie operations. A single set
operation may invoke
more than one store/lookup operation into the underlying DB. The same is also happened to get
operation,
it could do more than one flat-db lookup before it return the requested value.
The truth behind a lie
What kind of lie? actually, delete
and deleteSubtrie
doesn't remove the
'deleted' node from the underlying DB. It only make the node inaccessible
from the user of the trie. The same also happened if you update the value of a key,
the old value node is not removed from the underlying DB.
A more subtle lie also happened when you add new entries into the trie using set
operation.
The previous hash of affected branch become obsolete and replaced by new hash,
the old hash become inaccessible to the user.
You may think that is a waste of storage space.
Luckily, we also provide some utilities to deal with this situation, the branch utils.
The branch utils
The branch utils consist of these API:
- checkIfBranchExist(DB; rootHash; keyPrefix): bool
- getBranch(DB; rootHash; key): branch
- isValidBranch(branch, rootHash, key, value): bool
- getWitness(DB; nodeHash; key): branch
- getTrieNodes(DB; nodeHash): branch
keyPrefix
, key
, and value
are bytes container with length greater than zero.
They can be openArray[byte].
rootHash
and nodeHash
also bytes container,
but they have constraint: must be 32 bytes in length, and it must be a keccak_256 hash value.
branch
is a list of nodes, or in this case a seq[seq[byte]]
.
A list? yes, the structure is stored along with the encoded node.
Therefore a list is enough to reconstruct the entire trie/branch.
import
eth/trie/[db, binary, utils]
var db = newMemoryDB()
var trie = initBinaryTrie(db)
trie.set("key1", "value1")
trie.set("key2", "value2")
doAssert checkIfBranchExist(db, trie.getRootHash(), "key") == true
doAssert checkIfBranchExist(db, trie.getRootHash(), "key1") == true
doAssert checkIfBranchExist(db, trie.getRootHash(), "ken") == false
doAssert checkIfBranchExist(db, trie.getRootHash(), "key123") == false
The tree will looks like:
root ---> A(kvnode, *common key prefix*)
|
|
|
B(branchnode)
/ \
/ \
/ \
C1(kvnode, *remain kepath*) C2(kvnode, *remain kepath*)
| |
| |
| |
D1(leafnode, b'value1') D2(leafnode, b'value2')
var branchA = getBranch(db, trie.getRootHash(), "key1")
# ==> [A, B, C1, D1]
var branchB = getBranch(db, trie.getRootHash(), "key2")
# ==> [A, B, C2, D2]
doAssert isValidBranch(branchA, trie.getRootHash(), "key1", "value1") == true
# wrong key, return zero bytes
doAssert isValidBranch(branchA, trie.getRootHash(), "key5", "") == true
doAssert isValidBranch(branchB, trie.getRootHash(), "key1", "value1") # InvalidNode
var x = getBranch(db, trie.getRootHash(), "key")
# ==> [A]
x = getBranch(db, trie.getRootHash(), "key123") # InvalidKeyError
x = getBranch(db, trie.getRootHash(), "key5") # there is still branch for non-exist key
# ==> [A]
var branch = getWitness(db, trie.getRootHash(), "key1")
# equivalent to `getBranch(db, trie.getRootHash(), "key1")`
# ==> [A, B, C1, D1]
branch = getWitness(db, trie.getRootHash(), "key")
# this will include additional nodes of "key2"
# ==> [A, B, C1, D1, C2, D2]
var wholeTrie = getWitness(db, trie.getRootHash(), "")
# this will return the whole trie
# ==> [A, B, C1, D1, C2, D2]
var node = branch[1] # B
let nodeHash = keccak256.digest(node.baseAddr, uint(node.len))
var nodes = getTrieNodes(db, nodeHash)
doAssert nodes.len == wholeTrie.len - 1
# ==> [B, C1, D1, C2, D2]
Remember the lie?
Because trie delete
, deleteSubtrie
and set
operation create inaccessible nodes in the underlying DB,
we need to remove them if necessary. We already see that wholeTrie = getWitness(db, trie.getRootHash(), "")
will return the whole trie, a list of accessible nodes.
Then we can write the clean tree into a new DB instance to replace the old one.
Sparse Merkle Trie
Sparse Merkle Trie(SMT) is a variant of Binary Trie which uses binary encoding to represent path during trie traversal. When Binary Trie uses three types of node, SMT only use one type of node without any additional special encoding to store it's key-path.
Actually, it doesn't even store it's key-path anywhere like Binary Trie, the key-path is stored implicitly in the trie structure during key-value insertion.
Because the key-path is not encoded in any special ways, the bits can be extracted directly from the key without any conversion.
However, the key restricted to a fixed length because the algorithm demand a fixed height trie to works properly. In this case, the trie height is limited to 160 level, or the key is of fixed length 20 bytes (8 bits x 20 = 160).
To be able to use variable length key, the algorithm can be adapted slightly using hashed key before constructing the binary key-path. For example, if using keccak256 as the hashing function, then the height of the tree will be 256, but the key itself can be any length.
The API
The primary API for Binary-trie is set
and get
.
- set(key, value, rootHash[optional]) --- store a value associated with a key
- get(key, rootHash[optional]): value --- get a value using a key
Both key
and value
are of BytesRange
type. And they cannot have zero length.
You can also use convenience API get
and set
which accepts
Bytes
or string
(a string
is conceptually wrong in this context
and may costlier than a BytesRange
, but it is good for testing purpose).
rootHash is an optional parameter. When used, get
will get a key from specific root,
and set
will also set a key at specific root.
Getting a non-existent key will return zero length BytesRange or a zeroBytesRange.
Sparse Merkle Trie also provide dictionary syntax API for set
and get
.
- trie[key] = value -- same as
set
- value = trie[key] -- same as
get
- contains(key) a.k.a.
in
operator
Additional APIs are:
- exists(key) -- returns
bool
, to check key-value existence -- same as contains - delete(key) -- remove a key-value from the trie
- getRootHash():
KeccakHash
withBytesRange
type - getDB():
DB
-- get flat-db pointer - prove(key, rootHash[optional]): proof -- useful for merkling
Constructor API:
- initSparseBinaryTrie(DB, rootHash[optional])
- init(SparseBinaryTrie, DB, rootHash[optional])
Normally you would not set the rootHash when constructing an empty Sparse Merkle Trie. Setting the rootHash occurred in a scenario where you have a populated DB with existing trie structure and you know the rootHash, and then you want to continue/resume the trie operations.
Examples
import
eth/trie/[db, sparse_binary, utils]
var
db = newMemoryDB()
trie = initSparseMerkleTrie(db)
let
key1 = "01234567890123456789"
key2 = "abcdefghijklmnopqrst"
trie.set(key1, "value1")
trie.set(key2, "value2")
doAssert trie.get(key1) == "value1".toBytes
doAssert trie.get(key2) == "value2".toBytes
trie.delete(key1)
doAssert trie.get(key1) == []
trie.delete(key2)
doAssert trie[key2] == []
Remember, set
and get
are trie operations. A single set
operation may invoke
more than one store/lookup operation into the underlying DB. The same is also happened to get
operation,
it could do more than one flat-db lookup before it return the requested value.
While Binary Trie perform a variable numbers of lookup and store operations, Sparse Merkle Trie
will do constant numbers of lookup and store operations each get
and set
operation.
Merkle Proofing
Using prove
dan verifyProof
API, we can do some merkling with SMT.
let
value1 = "hello world"
badValue = "bad value"
trie[key1] = value1
var proof = trie.prove(key1)
doAssert verifyProof(proof, trie.getRootHash(), key1, value1) == true
doAssert verifyProof(proof, trie.getRootHash(), key1, badValue) == false
doAssert verifyProof(proof, trie.getRootHash(), key2, value1) == false