plonky2/evm/spec/mpts.tex

\section{Merkle Patricia Tries}
\label{tries}
The \emph{EVM World state} is a representation of the different accounts at a particular time, as well as the last processed transactions together with their receipts. The world state is represented using \emph{Merkle Patricia Tries} (MPTs) \cite[App.~D]{yellowpaper}, and there are three different tries: the state trie, the transaction trie and the receipt trie.

For each transaction we need to show that the prover knows preimages of the hashed initial and final EVM states.  When the kernel starts execution, it stores these three tries within the {\tt Segment::TrieData} segment. The prover loads the initial tries from the inputs into memory. Subsequently, the tries are modified during transaction execution, inserting new nodes or deleting existing nodes.

An MPT is composed of five different nodes: branch, extension, leaf, empty and digest nodes. Branch and leaf nodes might contain a payload whose format depends on the particular trie. The nodes are encoded, primarily using RLP encoding and Hex-prefix encoding (see \cite{yellowpaper} App. B and C, respectively). The resulting encoding is then hashed, following a strategy similar to that of normal Merkle trees, to generate the trie hashes.

Insertion and deletion is performed in the same way as other MPTs implementations. The only difference is for inserting extension nodes where we create a new node with the new data, instead of modifying the existing one. In the rest of this section we describe how the MPTs are represented in memory, how they are given as input, and how MPTs are hashed.

\subsection{Internal memory format}

The tries are stored in kernel memory, specifically in the {\tt Segment:TrieData} segment. Each node type is stored as
\begin{enumerate}
  \item An empty node is encoded as $(\texttt{MPT\_NODE\_EMPTY})$.
  \item A branch node is encoded as $(\texttt{MPT\_NODE\_BRANCH}, c_1, \dots, c_{16}, v)$, where each $c_i$ is a pointer to a child node, and $v$ is a pointer to a value. If a branch node has no associated value, then $v = 0$, i.e. the null pointer.
  \item An extension node is encoded as $(\texttt{MPT\_NODE\_EXTENSION}, k, c)$, $k$ represents the part of the key associated with this extension, and is encoded as a 2-tuple $(\texttt{packed\_nibbles}, \texttt{num\_nibbles})$. $c$ is a pointer to a child node.
  \item A leaf node is encoded as $(\texttt{MPT\_NODE\_LEAF}, k, v)$, where $k$ is a 2-tuple as above, and $v$ is a pointer to a value.
  \item A digest node is encoded as $(\texttt{MPT\_NODE\_HASH}, d)$, where $d$ is a Keccak256 digest.
\end{enumerate}

On the other hand the values or payloads are represented differently depending on the particular trie.

\subsubsection{State trie}
The state trie payload contains the account data. Each account is stored in 4 contiguous memory addresses containing
\begin{enumerate}
	\item the nonce,
	\item the balance,
	\item a pointer to the account's storage trie,
	\item a hash of the account's code.
\end{enumerate}
The storage trie payload in turn is a single word.

\subsubsection{Transaction Trie}
The transaction trie nodes contain the length of the RLP encoded transaction, followed by the bytes of the RLP encoding of the transaction.

\subsubsection{Receipt Trie}
The payload of the receipts trie is a receipt. Each receipt is stored as
\begin{enumerate}
	\item the length in words of the payload,
	\item the status,
	\item the cumulative gas used,
	\item the bloom filter, stored as 256 words.
	\item the number of topics,
	\item the topics
	\item the data length,
	\item the data.
\end{enumerate}


\subsection{Prover input format}

The initial state of each trie is given by the prover as a nondeterministic input tape. This tape has a slightly different format:
\begin{enumerate}
  \item An empty node is encoded as $(\texttt{MPT\_NODE\_EMPTY})$.
  \item A branch node is encoded as $(\texttt{MPT\_NODE\_BRANCH}, v_?, c_1, \dots, c_{16})$. Here $v_?$ consists of a flag indicating whether a value is present, followed by the actual value payload if one is present. Each $c_i$ is the encoding of a child node.
  \item An extension node is encoded as $(\texttt{MPT\_NODE\_EXTENSION}, k, c)$, where $k$ represents the part of the key associated with this extension, and is encoded as a 2-tuple $(\texttt{packed\_nibbles}, \texttt{num\_nibbles})$. $c$ is a pointer to a child node.
  \item A leaf node is encoded as $(\texttt{MPT\_NODE\_LEAF}, k, v)$, where $k$ is a 2-tuple as above, and $v$ is a value payload.
  \item A digest node is encoded as $(\texttt{MPT\_NODE\_HASH}, d)$, where $d$ is a Keccak256 digest.
\end{enumerate}
Nodes are thus given in depth-first order, enabling natural recursive methods for encoding and decoding this format.
The payload of state and receipt tries is given in the natural sequential way. The transaction an receipt payloads contain variable size data, thus the input is slightly different. The prover input for for the transactions is the transaction RLP encoding preceeded by its length. For the receipts is in the natural sequential way, except that topics and data are preceeded by their lengths, respectively.

\subsection{Encoding and Hashing}

Encoding is done recursively starting from the trie root. Leaf, branch and extension nodes are encoded as the RLP encoding of list containing the hex prefix encoding of the node key as well as

\begin{description}
	\item[Leaf Node:] the encoding of the the payload,
	\item[Branch Node:] the hash or encoding of the 16 children and the encoding of the payload,
	\item[Extension Node:] the hash or encoding of the child and the encoding of the payload.
\end{description}
For the rest of the nodes we have:
\begin{description}
	\item[Empty Node:] the encoding of an empty node is {\tt 0x80},
	\item[Digest Node:] the encoding of a digest node stored as $({\tt MPT\_HASH\_NODE}, d)$ is $d$.
\end{description}

The payloads in turn are RLP encoded as follows
\begin{description}
	\item[State Trie:] Encoded as a list containing nonce, balance, storage trie hash and code hash.
	\item[Storage Trie:] The RLP encoding of the value (thus the double RLP encoding)
	\item[Transaction Trie:] The RLP encoded transaction.
	\item[Receipt Trie:] Depending on the transaction type it's encoded as ${\sf RLP}({\sf RLP}({\tt receipt}))$ for Legacy transactions or ${\sf RLP}({\tt txn\_type}||{\sf RLP}({\tt receipt}))$ for transactions of type 1 or 2. Each receipt is encoded as a list containing:
	\begin{enumerate}
		\item the status,
		\item the cumulative gas used,
		\item the bloom filter, stored as a list of length 256.
		\item the list of topics
		\item the data string.
	\end{enumerate}
\end{description}

Once a node is encoded it is written to the {\tt Segment::RlpRaw} segment as a sequence of bytes. Then the RLP encoded data is hashed if the length of the data is more than 32 bytes. Otherwise we return the encoding. Further details can be found in the \href{https://github.com/0xPolygonZero/plonky2/tree/main/evm/src/cpu/mpt/hash}{mpt hash folder}.