nimbus-eth1/nimbus/db/aristo/README.md

Aristo Trie -- a Patricia Trie with Merkle hash labeled edges
=============================================================
These data structures allows to overlay the *Patricia Trie* with *Merkel
Trie* hashes. With a particular layout, the structure is called
and *Aristo Trie* (Patricia = Roman Aristocrat, Patrician.)

This description does assume familiarity with the abstract notion of a hexary
*Merkle Patricia [Trie](https://en.wikipedia.org/wiki/Trie)*. Suffice it to
say the state of a valid *Merkle Patricia Tree* is uniquely verified by its
top level vertex.

1. Deleting entries in a compact *Merkle Patricia Tree*
-------------------------------------------------------
The main feature of the *Aristo Trie* representation is that there are no
double used nodes any sub-trie as it happens with the representation as a
[compact Merkle Patricia Tree](http://archive.is/TinyK). For example,
consider the following state data for the latter.

      leaf = (0xf,0x12345678)                                            (1)
      branch = (a,a,a,,, ..) with a = hash(leaf)
      root = hash(branch)

These two nodes, called *leaf* and *branch*, and the *root* hash are a state
(aka key-value pairs) representation as a *compact Merkle Patricia Tree*. The
actual state is

      0x0f ==> 0x12345678
      0x1f ==> 0x12345678
      0x2f ==> 0x12345678

The elements from *(1)* can be organised in a key-value table with the *Merkle*
hashes as lookup keys

      a    -> leaf
      root -> branch

This is a space efficient way of keeping data as there is no duplication of
the sub-trees made up by the *Leaf* node with the same payload *0x12345678*
and path snippet *0xf*. One can imagine how this property applies to more
general sub-trees in a similar fashion.

Now delete some key-value pair of the state, e.g. for the key *0x0f*. This
amounts to removing the first of the three *a* hashes from the *branch*
record. The new state of the *Merkle Patricia Tree* will look like

      leaf = (0xf,0x12345678)                                            (2)
      branch1 = (,a,a,,, ..)
      root1 = hash(branch1)

      a     -> leaf
      root1 -> branch1

A problem arises when all keys are deleted and there is no reference to the
*leaf* data record, anymore. One should find out in general when it can be
deleted, too. It might be unknown whether the previous states leading to here
had only a single *Branch* record referencing to this *leaf* data record.

Finding a stale data record can be achieved by a *mark and sweep* algorithm,
but it becomes too clumsy to be useful on a large state (i.e. database).
Reference counts come to mind but maintaining these is generally error prone
when actors concurrently manipulate the state (i.e. database).

2. *Patricia Trie* example with *Merkle hash* labelled edges
------------------------------------------------------------
Continuing with the example from chapter 1, the *branch* node is extended by
an additional set of structural identifiers *x, w, z*. It allows to handle
the deletion of entries in a more benign way while keeping the *Merkle hashes*
for validating sub-trees.

A solution for the deletion problem is to represent the situation *(1)* as

      leaf-a = (0xf,0x12345678) copy of leaf from (1)                    (3)
      leaf-b = (0xf,0x12345678) copy of leaf from (1)
      leaf-c = (0xf,0x12345678) copy of leaf from (1)
      branch2 = ((x,y,z,,, ..)(a,b,c,,, ..))
      root2 = (w,root) with root from (1)

where

      a = hash(leaf-a) same as a from (1)
      b = hash(leaf-b) same as a from (1)
      c = hash(leaf-c) same as a from (1)

      w,x,y,z numbers, mutually different

The records above are stored in a key-value database as

      w -> branch2
      x -> leaf-a
      y -> leaf-b
      z -> leaf-c

Then this structure encodes the key-value pairs as before

      0x0f ==> 0x12345678
      0x1f ==> 0x12345678
      0x2f ==> 0x12345678

Deleting the data for key *0x0f* now results in the new state

      leaf-b = (0xf,0x12345678)                                          (4)
      leaf-c = (0xf,0x12345678)
      branch3 = ((,y,z,,, ..)(,b,c,,, ..))

      w -> branch3
      y -> leaf-b
      z -> leaf-c

Due to duplication of the *leaf* node in *(3)*, no reference count is needed
in order to detect stale records cleanly when deleting key *0x0f*. Removing
this key allows to remove hash *a* from *branch2* as well as also structural
key *x* which will consequently be deleted from the lookup table.

A minor observation is that manipulating a state entry, e.g. changing the
payload associated with key *0x0f* to

      0x0f ==> 0x987654321

the structural layout of the above trie will not change, that is the indexes
*w, x, y, z* of the table that holds the data records as values. All that
changes are values.

      leaf-d = (0xf,0x987654321)                                         (5)
      leaf-b = (0xf,0x12345678)
      leaf-c = (0xf,0x12345678)
      branch3 = ((x,y,z,,, ..)(d,b,c,,, ..))

      root3 = (w,hash(d,b,c,,, ..))

3. Discussion of the examples *(1)* and *(3)*
---------------------------------------------
Examples *(1)* and *(3)* differ in that the structural *Patricia Trie*
information from *(1)* has been removed from the *Merkle hash* instances and
implemented as separate table lookup IDs (called *vertexID*s later on.) The
values of these lookup IDs are arbitrary as long as they are all different.

In fact, the [Erigon](http://archive.is/6MJV7) project discusses a similar
situation in **Separation of keys and the structure**, albeit aiming for a
another scenario with the goal of using mostly flat data lookup structures.

A graph for the example *(1)* would look like

                |
               root
                |
         +-------------+
         |   branch    |
         +-------------+
              | | |
              a a a
              | | |
              leaf

while example *(2)* has

              (root)                                                     (6)
                |
                w
                |
         +-------------+
         |   branch2   |
         | (a) (b) (c) |
         +-------------+
            /   |   \
           x    y    z
          /     |     \
       leaf-a leaf-b leaf-c

The labels on the edges indicate the downward target of an edge while the
round brackets enclose separated *Merkle hash* information.

This last example (6) can be completely split into structural tree and Merkel
hash mapping.

         structural trie              hash map                           (7)
         ---------------              --------
                |                  (root) -> w
                w                     (a) -> x
                |                     (b) -> y
         +-------------+              (c) -> z
         |   branch2   |
         +-------------+
            /   |   \
           x    y    z
          /     |     \
       leaf-a leaf-b leaf-c


4. *Patricia Trie* node serialisation with *Merkle hash* labelled edges
-----------------------------------------------------------------------
The data structure for the *Aristo Trie* forllows example *(7)* by keeping
structural information separate from the Merkle hash labels. As for teminology,

* an *Aristo Trie* is a pair *(structural trie, hash map)* where
* the *structural trie* realises a haxary *Patricia Trie* containing the payload
  values in the leaf records
* the *hash map* contains the hash information so that this trie operates as a
  *Merkle Patricia Tree*.

In order to accommodate for the additional structural elements, a non RLP-based
data layout is used for the *Branch*, *Extension*, and *Leaf* containers used
in the key-value table that implements the *Patricia Trie*. It is now called
*Aristo Trie* for this particular data layout.

The structural keys *w, x, y, z* from the example *(3)* are called *vertexID*
and implemented as 64 bit values, stored *Big Endian* in the serialisation.

### 4.1 Branch record serialisation

        0 +--+--+--+--+--+--+--+--+--+
          |                          |       -- first vertexID
        8 +--+--+--+--+--+--+--+--+--+
          ...                                -- more vertexIDs
          +--+--+
          |     |                            -- access(16) bitmap
          +--+--+
          || |                               -- marker(2) + unused(6)
          +--+

        where
          marker(2) is the double bit array 00

For a given index *n* between *0..15*, if the bit at position *n* of the it
vector *access(16)* is reset to zero, then there is no *n*-th structural
*vertexID*. Otherwise one calculates

        the n-th vertexID is at position Vn * 8
        for Vn the number of non-zero bits in the range 0..(n-1) of access(16)

Note that data are stored *Big Endian*, so the bits *0..7* of *access* are
stored in the right byte of the serialised bitmap.

### 4.2 Extension record serialisation

        0 +--+--+--+--+--+--+--+--+--+
          |                          |       -- vertexID
        8 +--+--+--+--+--+--+--+--+--+
          |  | ...                           -- path segment
          +--+
          || |                               -- marker(2) + pathSegmentLen(6)
          +--+

        where
          marker(2) is the double bit array 10

The path segment of the *Extension* record is compact encoded. So it has at
least one byte. The first byte *P0* has bit 5 reset, i.e. *P0 and 0x20* is
zero (bit 4 is set if the right nibble is the first part of the path.)

Note that the *pathSegmentLen(6)* is redunant as it is determined by the length
of the extension record (as *recordLen - 9*.)

### 4.3 Leaf record serialisation

        0 +-- ..
          ...                                -- payload (may be empty)
          +--+
          |  | ...                           -- path segment
          +--+
          || |                               -- marker(2) + pathSegmentLen(6)
          +--+

        where
          marker(2) is the double bit array 11

A *Leaf* record path segment is compact encoded. So it has at least one byte.
The first byte *P0* has bit 5 set, i.e. *P0 and 0x20* is non-zero (bit 4 is
also set if the right nibble is the first part of the path.)

If present, the serialisation of the payload field can be either for account
data, for RLP encoded or for unstructured data as defined below.

### 4.4 Leaf record payload serialisation for account data

        0 +-- ..  --+
          |         |                        -- nonce, 0 or 8 bytes
          +-- ..  --+--+
          |            |                     -- balance, 0, 8, or 32 bytes
          +-- ..  --+--+
          |         |                        -- storage ID, 0 or 8 bytes
          +-- ..  --+--+
          |            |                     -- code hash, 0, 8 or 32 bytes
          +--+ .. --+--+
          |  |                               -- bitmask(2)-word array
          +--+

        where each bitmask(2)-word array entry defines the length of
        the preceeding data fields:
          00 -- field is missing
          01 -- field lengthh is 8 bytes
          10 -- field lengthh is 32 bytes

Apparently, entries 0 and and 2 of the bitmask(2) word array cannot have the
value 10 as they refer to the nonce and the storage ID data fields. So, joining
the bitmask(2)-word array to a single byte, the maximum value of that byte is
0x99.

### 4.5 Leaf record payload serialisation for RLP encoded data

        0 +--+ .. --+
          |  |      |                        -- data, at least one byte
          +--+ .. --+
          |  |                               -- marker byte
          +--+

        where the marker byte is 0xaa

### 4.6 Leaf record payload serialisation for unstructured data

        0 +--+ .. --+
          |  |      |                        -- data, at least one byte
          +--+ .. --+
          |  |                               -- marker byte
          +--+

        where the marker byte is 0xff

### 4.7 Descriptor record serialisation

        0 +-- ..
          ...                                -- recycled vertexIDs
          +--+--+--+--+--+--+--+--+
          |                       |          -- bottom of unused vertexIDs
          +--+--+--+--+--+--+--+--+
          || |                               -- marker(2) + unused(6)
          +--+

        where
          marker(2) is the double bit array 01

Currently, the descriptor record only contains data for producing unique
vectorID values that can be used as structural keys. If this descriptor is
missing, the value *(0x40000000,0x01)* is assumed. The last vertexID in the
descriptor list has the property that that all values greater or equal than
this value can be used as vertexID.

The vertexIDs in the descriptor record must all be non-zero and record itself
should be allocated in the structural table associated with the zero key.

5. *Patricia Trie* implementation notes
---------------------------------------

### 5.1 Database decriptor representation

        ^      +----------+
        |      | top      |   active delta layer, application cache
        |      +----------+
        |      +----------+   ^
       db      | stack[n] |   |
       desc    |    :     |   |  optional passive delta layers, handled by
       obj     | stack[1] |   |  transaction management (can be used to
        |      | stack[0] |   |  successively replace the top layer)
        |      +----------+   v
        |      +----------+
        |      | roFilter |   optional read-only backend filter
        |      +----------+
        |      +----------+
        |      | backend  |   optional physical key-value backend database
        v      +----------+

 There is a three tier access to a key-value database entry as in

        top -> roFilter -> backend

where only the *top* layer is obligatory.

### 5.2 Distributed access using the same backend

There can be many descriptors for the same database. Due to delta layers and
filters, each descriptor instance can work with a different state of the
database.

Although there is only one of the instances that can write the current state
on the physical backend database, this priviledge can be shifted to any
instance for the price of updating the *roFiters* for all other instances.

#### Example:

        db1   db2   db3       -- db1, db2, .. database descriptors/handles
         |     |     |
        tx1   tx2   tx3       -- tx1, tx2, ..transaction/top layers
         |     |     |
         ø     ø     ø        -- no backend filters yet
          \    |    /
           \   |   /
              PBE             -- physical backend database

After collapse/committing *tx1* and saving it to the physical backend
database, the above architecture mutates to

        db1   db2   db3
         |     |     |
         ø    tx2   tx3
         |     |     |
         ø   ~tx1  ~tx1       -- filter reverting the effect of tx1 on PBE
          \    |    /
           \   |   /
            tx1+PBE           -- tx1 merged into physical backend database

When looked at descriptor API there are no changes when accessing data via
*db1*, *db2*, or *db3*. In a different, more algebraic notation, the above
tansformation is written as

        | tx1, ø |                                                   (8)
        | tx2, ø | PBE
        | tx3, ø |

            ||
            \/

        |  ø,    ø  |                                                (9)
        | tx2, ~tx1 | tx1+PBE
        | tx3, ~tx1 |

 The system can be further converted without changing the API by committing
 and saving *tx2* on the middle line of matrix (9)

        |  ø,       ø  |                                             (10)
        |  ø, tx2+~tx1 | tx1+PBE
        | tx3,    ~tx1 |

            ||
            \/

        |  ø,       ~(tx2+~tx1) |                                    (11)
        |  ø,               ø   | (tx2+~tx1)+tx1+PBE
        | tx3, ~tx1+~(tx2+~tx1) |

The *+* notation just means the repeated application of filters in
left-to-right order. The notation looks like algebraic group notation but this
will not be analysed further as there is no need for a general theory for the
current implementation.

Suffice to say that the inverse *~tx* of *tx* is calculated against the
current state of the physical backend database which makes it messy to
formulate boundary conditions.

Nevertheless, *(8)* can alse be transformed by committing and saving *tx2*
(rather than *tx1*.) This gives

        | tx1, ~tx2 |                                                (12)
        |  ø,    ø  | tx2+PBE
        | tx3, ~tx2 |

            ||
            \/

        |  ø, (tx1+~tx2) |                                           (13)
        |  ø,        ø   | tx2+PBE
        | tx3,     ~tx2  |

 As *(11)* and *(13)* represent the same API, one has

        tx2+PBE == tx1+(tx2+~tx1)+PBE    because of the middle rows  (14)
        ~tx2    == ~tx1+~(tx2+~tx1)      because of (14)             (15)

 which shows some distributive property in *(14)* and commutative property in
 *(15)* for this example. In particulat it might be handy for testing/verifying
 against this example.