Merge pull request #1383 from topos-protocol/mpt_specs

Add specs for MPTs
This commit is contained in:
Alonso González 2023-11-27 16:23:19 +01:00 committed by GitHub
commit acd3b1ad54
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 82 additions and 15 deletions

View File

@ -18,3 +18,13 @@
year = {2019},
note = {\url{https://ia.cr/2019/953}},
}
@article{yellowpaper,
title={Ethereum: A secure decentralised generalised transaction ledger},
author={Wood, Gavin and others},
journal={Ethereum project yellow paper},
volume={151},
number={2014},
pages={1--32},
year={2014}
}

View File

@ -53,17 +53,6 @@ Finally, once the three MPTs have been updated, we need to carry out final check
\end{itemize}
Once those final checks are performed, the program halts.
\paragraph{MPT hashing:}
MPTs are a complex structure in the kernel, and we will not delve into all of its aspects. Here, we only explain how the hashing works, since it is part of the initialization and final checks.
The data required for the MPTs are stored in the ``TrieData'' segment in memory. Whenever we need to hash an MPT, we recover the information from the ``TrieData'' segment and write it in the correct format in the ``RlpRaw'' segment. We start by getting the node type. If the node is a hash node, we simply return its value. Otherwise, we RLP encode the node recursively:
\begin{itemize}
\item If it is an empty node, the encoding is $\texttt{0x80}$.
\item If it is a branch node, we encode the node's value and append it to the RLP tape. Then, we encode each of the children and append the encodings to the RLP tape.
\item If it is an extension node, we RLP encode its child and hex prefix it.
\item If it is a leaf, we RLP encode it depending on the type of trie, and hex prefix the encoding. Note that for a receipt leaf, the encoding is RLP($\texttt{type} || \texttt{RLP(receipt)})$. In the case of a transaction, their RLP encoding is already provided by the input, so we simply load it from memory.
\end{itemize}
Finally, we hash the output of the RLP encoding, stored in ``RlpRaw'' -- unless it is already a hash.
\subsection{Simple opcodes VS Syscalls}
For simplicity and efficiency, EVM opcodes are categorized into two groups: ``simple opcodes'' and ``syscalls''. Simple opcodes are generated directly in Rust, in \href{https://github.com/0xPolygonZero/plonky2/blob/main/evm/src/witness/operation.rs}{operation.rs}. Every call to a simple opcode adds exactly one row to the \href{https://github.com/0xPolygonZero/plonky2/blob/main/evm/spec/tables/cpu.tex}{cpu table}. Syscalls are more complex structures written with simple opcodes, in the kernel.

View File

@ -1,9 +1,16 @@
\section{Merkle Patricia tries}
\section{Merkle Patricia Tries}
\label{tries}
The \emph{EVM World state} is a representation of the different accounts at a particular time, as well as the last processed transactions together with their receipts. The world state is represented using \emph{Merkle Patricia Tries} (MPTs) \cite[App.~D]{yellowpaper}, and there are three different tries: the state trie, the transaction trie and the receipt trie.
For each transaction we need to show that the prover knows preimages of the hashed initial and final EVM states. When the kernel starts execution, it stores these three tries within the {\tt Segment::TrieData} segment. The prover loads the initial tries from the inputs into memory. Subsequently, the tries are modified during transaction execution, inserting new nodes or deleting existing nodes.
An MPT is composed of five different nodes: branch, extension, leaf, empty and digest nodes. Branch and leaf nodes might contain a payload whose format depends on the particular trie. The nodes are encoded, primarily using RLP encoding and Hex-prefix encoding (see \cite{yellowpaper} App. B and C, respectively). The resulting encoding is then hashed, following a strategy similar to that of normal Merkle trees, to generate the trie hashes.
Insertion and deletion is performed in the same way as other MPTs implementations. The only difference is for inserting extension nodes where we create a new node with the new data, instead of modifying the existing one. In the rest of this section we describe how the MPTs are represented in memory, how they are given as input, and how MPTs are hashed.
\subsection{Internal memory format}
Withour our zkEVM's kernel memory,
The tries are stored in kernel memory, specifically in the {\tt Segment:TrieData} segment. Each node type is stored as
\begin{enumerate}
\item An empty node is encoded as $(\texttt{MPT\_NODE\_EMPTY})$.
\item A branch node is encoded as $(\texttt{MPT\_NODE\_BRANCH}, c_1, \dots, c_{16}, v)$, where each $c_i$ is a pointer to a child node, and $v$ is a pointer to a value. If a branch node has no associated value, then $v = 0$, i.e. the null pointer.
@ -12,15 +19,76 @@ Withour our zkEVM's kernel memory,
\item A digest node is encoded as $(\texttt{MPT\_NODE\_HASH}, d)$, where $d$ is a Keccak256 digest.
\end{enumerate}
On the other hand the values or payloads are represented differently depending on the particular trie.
\subsubsection{State trie}
The state trie payload contains the account data. Each account is stored in 4 contiguous memory addresses containing
\begin{enumerate}
\item the nonce,
\item the balance,
\item a pointer to the account's storage trie,
\item a hash of the account's code.
\end{enumerate}
The storage trie payload in turn is a single word.
\subsubsection{Transaction Trie}
The transaction trie nodes contain the length of the RLP encoded transaction, followed by the bytes of the RLP encoding of the transaction.
\subsubsection{Receipt Trie}
The payload of the receipts trie is a receipt. Each receipt is stored as
\begin{enumerate}
\item the length in words of the payload,
\item the status,
\item the cumulative gas used,
\item the bloom filter, stored as 256 words.
\item the number of topics,
\item the topics
\item the data length,
\item the data.
\end{enumerate}
\subsection{Prover input format}
The initial state of each trie is given by the prover as a nondeterministic input tape. This tape has a slightly different format:
\begin{enumerate}
\item An empty node is encoded as $(\texttt{MPT\_NODE\_EMPTY})$.
\item A branch node is encoded as $(\texttt{MPT\_NODE\_BRANCH}, v_?, c_1, \dots, c_{16})$. Here $v_?$ consists of a flag indicating whether a value is present,\todo{In the current implementation, we use a length prefix rather than a is-present prefix, but we plan to change that.} followed by the actual value payload if one is present. Each $c_i$ is the encoding of a child node.
\item An extension node is encoded as $(\texttt{MPT\_NODE\_EXTENSION}, k, c)$, $k$ represents the part of the key associated with this extension, and is encoded as a 2-tuple $(\texttt{packed\_nibbles}, \texttt{num\_nibbles})$. $c$ is a pointer to a child node.
\item A branch node is encoded as $(\texttt{MPT\_NODE\_BRANCH}, v_?, c_1, \dots, c_{16})$. Here $v_?$ consists of a flag indicating whether a value is present, followed by the actual value payload if one is present. Each $c_i$ is the encoding of a child node.
\item An extension node is encoded as $(\texttt{MPT\_NODE\_EXTENSION}, k, c)$, where $k$ represents the part of the key associated with this extension, and is encoded as a 2-tuple $(\texttt{packed\_nibbles}, \texttt{num\_nibbles})$. $c$ is a pointer to a child node.
\item A leaf node is encoded as $(\texttt{MPT\_NODE\_LEAF}, k, v)$, where $k$ is a 2-tuple as above, and $v$ is a value payload.
\item A digest node is encoded as $(\texttt{MPT\_NODE\_HASH}, d)$, where $d$ is a Keccak256 digest.
\end{enumerate}
Nodes are thus given in depth-first order, enabling natural recursive methods for encoding and decoding this format.
The payload of state and receipt tries is given in the natural sequential way. The transaction an receipt payloads contain variable size data, thus the input is slightly different. The prover input for for the transactions is the transaction RLP encoding preceeded by its lenght. For the receipts is in the natural sequential way, except that topics and data are preceeded by their lengths, respectively.
\subsection{Encoding and Hashing}
Encoding is done recursively starting from the trie root. Leaf, branch and extension nodes are encoded as the RLP encoding of list containing the hex prefix encoding of the node key as well as
\begin{description}
\item[Leaf Node:] the encoding of the the payload,
\item[Branch Node:] the hash or encoding of the 16 children and the encoding of the payload,
\item[Extension Node:] the hash or encoding of the child and the encoding of the payload.
\end{description}
For the rest of the nodes we have:
\begin{description}
\item[Empty Node:] the encoding of an empty node is {\tt 0x80},
\item[Digest Node:] the encoding of a digest node stored as $({\tt MPT\_HASH\_NODE}, d)$ is $d$.
\end{description}
The payloads in turn are RLP encoded as follows
\begin{description}
\item[State Trie:] Encoded as a list containing nonce, balance, storage trie hash and code hash.
\item[Storage Trie:] The RLP encoding of the value (thus the double RLP encoding)
\item[Transaction Trie:] The RLP encoded transaction.
\item[Receipt Trie:] Depending on the transaction type it's encoded as ${\sf RLP}({\sf RLP}({\tt receipt}))$ for Legacy transactions or ${\sf RLP}({\tt txn\_type}||{\sf RLP}({\tt receipt}))$ for transactions of type 1 or 2. Each receipt is encoded as a list containing:
\begin{enumerate}
\item the status,
\item the cumulative gas used,
\item the bloom filter, stored as a list of length 256.
\item the list of topics
\item the data string.
\end{enumerate}
\end{description}
Once a node is encoded it is written to the {\tt Segment::RlpRaw} segment as a sequence of bytes. Then the RLP encoded data is hashed if the length of the data is more than 32 bytes. Otherwise we return the encoding. Further details can be found in the \href{https://github.com/0xPolygonZero/plonky2/tree/main/evm/src/cpu/mpt/hash}{mpt hash folder}.

Binary file not shown.