nimbus-eth1/nimbus/db/aristo
Jacek Sieka d39c589ec3
lru cache updates (#2590)
* replace rocksdb row cache with larger rdb lru caches - these serve the
same purpose but are more efficient because they skips serialization,
locking and rocksdb layering
* don't append fresh items to cache - this has the effect of evicting
the existing items and replacing them with low-value entries that might
never be read - during write-heavy periods of processing, the
newly-added entries were evicted during the store loop
* allow tuning rdb lru size at runtime
* add (hidden) option to print lru stats at exit (replacing the
compile-time flag)

pre:
```
INF 2024-09-03 15:07:01.136+02:00 Imported blocks
blockNumber=20012001 blocks=12000 importedSlot=9216851 txs=1837042
mgas=181911.265 bps=11.675 tps=1870.397 mgps=176.819 avgBps=10.288
avgTps=1574.889 avgMGps=155.952 elapsed=19m26s458ms
```

post:
```
INF 2024-09-03 13:54:26.730+02:00 Imported blocks
blockNumber=20012001 blocks=12000 importedSlot=9216851 txs=1837042
mgas=181911.265 bps=11.637 tps=1864.384 mgps=176.250 avgBps=11.202
avgTps=1714.920 avgMGps=169.818 elapsed=17m51s211ms
```

9%:ish import perf improvement on similar mem usage :)
2024-09-05 11:18:32 +02:00
..
aristo_check Cache a storage root ID forever in the leaf payload of an account (#2551) 2024-08-07 13:28:01 +00:00
aristo_delete Revert lazy implementation (#2585) 2024-09-02 10:34:42 +00:00
aristo_delta Revert lazy implementation (#2585) 2024-09-02 10:34:42 +00:00
aristo_desc avoid some trivial memory allocations (#2587) 2024-09-02 16:03:10 +02:00
aristo_init lru cache updates (#2590) 2024-09-05 11:18:32 +02:00
aristo_merge Store cached hash at the layer corresponding to the source data (#2492) 2024-07-18 09:13:56 +02:00
aristo_part Remove vertex ID recycle function (#2558) 2024-08-12 20:56:15 +00:00
aristo_tx Avoid unnecessary layer copy (#2567) 2024-08-19 08:46:04 +02:00
aristo_walk Aristo and kvt balancer management update (#2504) 2024-07-18 21:32:32 +00:00
README.md No ext update (#2494) 2024-07-16 19:47:59 +00:00
TODO.md Simplify aristo tree deletion functionality (#2563) 2024-08-14 12:09:30 +00:00
aristo_api.nim Provide portal proof functionality with coredb (#2550) 2024-08-07 11:30:55 +00:00
aristo_blobify.nim avoid some trivial memory allocations (#2587) 2024-09-02 16:03:10 +02:00
aristo_check.nim Provide portal proof functionality with coredb (#2550) 2024-08-07 11:30:55 +00:00
aristo_compute.nim avoid some trivial memory allocations (#2587) 2024-09-02 16:03:10 +02:00
aristo_constants.nim Simplify aristo tree deletion functionality (#2563) 2024-08-14 12:09:30 +00:00
aristo_debug.nim Aristo lazily delete larger subtrees (#2560) 2024-08-14 08:54:44 +00:00
aristo_delete.nim Aristo lazily delete larger subtrees (#2560) 2024-08-14 08:54:44 +00:00
aristo_delta.nim Coredb fix kvt save only fringe condition (#2592) 2024-09-04 13:48:38 +00:00
aristo_desc.nim Aristo lazily delete larger subtrees (#2560) 2024-08-14 08:54:44 +00:00
aristo_fetch.nim Aristo lazily delete larger subtrees (#2560) 2024-08-14 08:54:44 +00:00
aristo_get.nim Provide partial tree support for preloading tests (#2536) 2024-07-29 20:15:17 +00:00
aristo_hike.nim Store cached hash at the layer corresponding to the source data (#2492) 2024-07-18 09:13:56 +02:00
aristo_init.nim Aristo and ledger small updates (#1888) 2023-11-08 16:52:25 +00:00
aristo_layers.nim Revert lazy implementation (#2585) 2024-09-02 10:34:42 +00:00
aristo_merge.nim Aristo lazily delete larger subtrees (#2560) 2024-08-14 08:54:44 +00:00
aristo_nearby.nim Revive json tracer unit tests (#2538) 2024-08-01 10:41:20 +00:00
aristo_part.nim Cache a storage root ID forever in the leaf payload of an account (#2551) 2024-08-07 13:28:01 +00:00
aristo_path.nim Allocation-free nibbles buffer (#2406) 2024-06-22 22:33:37 +02:00
aristo_persistent.nim Aristo and ledger small updates (#1888) 2023-11-08 16:52:25 +00:00
aristo_profile.nim use Nim 2.0.6 (#2384) 2024-06-19 01:27:54 +00:00
aristo_serialise.nim avoid some trivial memory allocations (#2587) 2024-09-02 16:03:10 +02:00
aristo_sign.nim No ext update (#2494) 2024-07-16 19:47:59 +00:00
aristo_tx.nim Aristo and kvt balancer management update (#2504) 2024-07-18 21:32:32 +00:00
aristo_utils.nim Cache a storage root ID forever in the leaf payload of an account (#2551) 2024-08-07 13:28:01 +00:00
aristo_vid.nim Remove vertex ID recycle function (#2558) 2024-08-12 20:56:15 +00:00
aristo_walk.nim Aristo and ledger small updates (#1888) 2023-11-08 16:52:25 +00:00

README.md

Aristo Trie -- a Patricia Trie with Merkle hash labeled edges

These data structures allows to overlay the Patricia Trie with Merkel Trie hashes. With a particular layout, the structure is called and Aristo Trie (Patricia = Roman Aristocrat, Patrician.)

This description does assume familiarity with the abstract notion of a hexary Merkle Patricia Trie. Suffice it to say the state of a valid Merkle Patricia Tree is uniquely verified by its top level vertex.

Contents

  1. Deleting entries in a compact Merkle Patricia Tree

The main feature of the Aristo Trie representation is that there are no double used nodes any sub-trie as it happens with the representation as a compact Merkle Patricia Tree. For example, consider the following state data for the latter.

  leaf = (0xf,0x12345678)                                            (1)
  branch = (a,a,a,,, ..) with a = hash(leaf)
  root = hash(branch)

These two nodes, called leaf and branch, and the root hash are a state (aka key-value pairs) representation as a compact Merkle Patricia Tree. The actual state is

  0x0f ==> 0x12345678
  0x1f ==> 0x12345678
  0x2f ==> 0x12345678

The elements from (1) can be organised in a key-value table with the Merkle hashes as lookup keys

  a    -> leaf
  root -> branch

This is a space efficient way of keeping data as there is no duplication of the sub-trees made up by the Leaf node with the same payload 0x12345678 and path snippet 0xf. One can imagine how this property applies to more general sub-trees in a similar fashion.

Now delete some key-value pair of the state, e.g. for the key 0x0f. This amounts to removing the first of the three a hashes from the branch record. The new state of the Merkle Patricia Tree will look like

  leaf = (0xf,0x12345678)                                            (2)
  branch1 = (,a,a,,, ..)
  root1 = hash(branch1)

  a     -> leaf
  root1 -> branch1

A problem arises when all keys are deleted and there is no reference to the leaf data record, anymore. One should find out in general when it can be deleted, too. It might be unknown whether the previous states leading to here had only a single Branch record referencing to this leaf data record.

Finding a stale data record can be achieved by a mark and sweep algorithm, but it becomes too clumsy to be useful on a large state (i.e. database). Reference counts come to mind but maintaining these is generally error prone when actors concurrently manipulate the state (i.e. database).

2. Patricia Trie example with Merkle hash labelled edges

Continuing with the example from chapter 1, the branch node is extended by an additional set of structural identifiers x, w, z. It allows to handle the deletion of entries in a more benign way while keeping the Merkle hashes for validating sub-trees.

A solution for the deletion problem is to represent the situation (1) as

  leaf-a = (0xf,0x12345678) copy of leaf from (1)                    (3)
  leaf-b = (0xf,0x12345678) copy of leaf from (1)
  leaf-c = (0xf,0x12345678) copy of leaf from (1)
  branch2 = ((x,y,z,,, ..)(a,b,c,,, ..))
  root2 = (w,root) with root from (1)

where

  a = hash(leaf-a) same as a from (1)
  b = hash(leaf-b) same as a from (1)
  c = hash(leaf-c) same as a from (1)

  w,x,y,z numbers, mutually different

The records above are stored in a key-value database as

  w -> branch2
  x -> leaf-a
  y -> leaf-b
  z -> leaf-c

Then this structure encodes the key-value pairs as before

  0x0f ==> 0x12345678
  0x1f ==> 0x12345678
  0x2f ==> 0x12345678

Deleting the data for key 0x0f now results in the new state

  leaf-b = (0xf,0x12345678)                                          (4)
  leaf-c = (0xf,0x12345678)
  branch3 = ((,y,z,,, ..)(,b,c,,, ..))

  w -> branch3
  y -> leaf-b
  z -> leaf-c

Due to duplication of the leaf node in (3), no reference count is needed in order to detect stale records cleanly when deleting key 0x0f. Removing this key allows to remove hash a from branch2 as well as also structural key x which will consequently be deleted from the lookup table.

A minor observation is that manipulating a state entry, e.g. changing the payload associated with key 0x0f to

  0x0f ==> 0x987654321

the structural layout of the above trie will not change, that is the indexes w, x, y, z of the table that holds the data records as values. All that changes are values.

  leaf-d = (0xf,0x987654321)                                         (5)
  leaf-b = (0xf,0x12345678)
  leaf-c = (0xf,0x12345678)
  branch3 = ((x,y,z,,, ..)(d,b,c,,, ..))

  root3 = (w,hash(d,b,c,,, ..))

3. Discussion of the examples (1) and (3)

Examples (1) and (3) differ in that the structural Patricia Trie information from (1) has been removed from the Merkle hash instances and implemented as separate table lookup IDs (called vertexIDs later on.) The values of these lookup IDs are arbitrary as long as they are all different.

In fact, the Erigon project discusses a similar situation in Separation of keys and the structure, albeit aiming for a another scenario with the goal of using mostly flat data lookup structures.

A graph for the example (1) would look like

            |
           root
            |
     +-------------+
     |   branch    |
     +-------------+
          | | |
          a a a
          | | |
          leaf

while example (2) has

          (root)                                                     (6)
            |
            w
            |
     +-------------+
     |   branch2   |
     | (a) (b) (c) |
     +-------------+
        /   |   \
       x    y    z
      /     |     \
   leaf-a leaf-b leaf-c

The labels on the edges indicate the downward target of an edge while the round brackets enclose separated Merkle hash information.

This last example (6) can be completely split into structural tree and Merkel hash mapping.

     structural trie              hash map                           (7)
     ---------------              --------
            |                  (root) -> w
            w                     (a) -> x
            |                     (b) -> y
     +-------------+              (c) -> z
     |   branch2   |
     +-------------+
        /   |   \
       x    y    z
      /     |     \
   leaf-a leaf-b leaf-c

4. Patricia Trie node serialisation with Merkle hash labelled edges

The data structure for the Aristo Trie forllows example (7) by keeping structural information separate from the Merkle hash labels. As for teminology,

  • an Aristo Trie is a pair (structural trie, hash map) where
  • the structural trie realises a haxary Patricia Trie containing the payload values in the leaf records
  • the hash map contains the hash information so that this trie operates as a Merkle Patricia Tree.

In order to accommodate for the additional structural elements, a non RLP-based data layout is used for the Branch, Extension, and Leaf containers used in the key-value table that implements the Patricia Trie. It is now called Aristo Trie for this particular data layout.

The structural keys w, x, y, z from the example (3) are called vertexID and implemented as 64 bit values, stored Big Endian in the serialisation.

4.1 Branch record serialisation

    0 +--+--+--+--+--+--+--+--+--+
      |                          |       -- first vertexID
    8 +--+--+--+--+--+--+--+--+--+
      ...                                -- more vertexIDs
      +--+
      |  | ...                           -- nibble path segment
      +--+--+
      |     |                            -- access(16) bitmap
      +--+--+
      |  |                               -- marker(2) + pathSegmentLen(6)
      +--+

    where
      marker(2) is the double bit array 10

For a given index n between 0..15, if the bit at position n of the bit vector access(16) is reset to zero, then there is no n-th structural vertexID. Otherwise one calculates

    the n-th vertexID is at position Vn * 8
    for Vn the number of non-zero bits in the range 0..(n-1) of access(16)

Note that data are stored Big Endian, so the bits 0..7 of access are stored in the right byte of the serialised bitmap.

The nibble path segment of the Branch record is compact encoded or missing, at all. So an empty nibble path has at least one byte. The first path byte P0 has bit 5 reset, i.e. P0 and 0x20 is zero (bit 4 is set if the right nibble is the first part of the path.)

4.2 Leaf record serialisation

    0 +-- ..
      ...                                -- payload (may be empty)
      +--+
      |  | ...                           -- path segment
      +--+
      || |                               -- marker(2) + pathSegmentLen(6)
      +--+

    where
      marker(2) is the double bit array 11

A Leaf record path segment is compact encoded. So it has at least one byte. The first byte P0 has bit 5 set, i.e. P0 and 0x20 is non-zero (bit 4 is also set if the right nibble is the first part of the path.)

If present, the serialisation of the payload field can be either for account data, for RLP encoded or for unstructured data as defined below.

4.3 Leaf record payload serialisation for account data

    0 +-- ..  --+
      |         |                        -- nonce, 0 or 8 bytes
      +-- ..  --+--+
      |            |                     -- balance, 0, 8, or 32 bytes
      +-- ..  --+--+
      |         |                        -- storage ID, 0 or 8 bytes
      +-- ..  --+--+
      |            |                     -- code hash, 0, 8 or 32 bytes
      +--+ .. --+--+
      |  |                               -- 4 x bitmask(2), word array
      +--+

    where each bitmask(2)-word array entry defines the length of
    the preceeding data fields:
      00 -- field is missing
      01 -- field length is 8 bytes
      10 -- field length is 32 bytes

Apparently, entries 0 and and 2 of the 4 x bitmask(2) word array cannot have the two bit value 10 as they refer to the nonce and the storage ID data fields. So, joining the 4 x bitmask(2) word array to a single byte, the maximum value of that byte is 0x99.

4.4 Leaf record payload serialisation for unstructured data

    0 +--+ .. --+
      |  |      |                        -- data, at least one byte
      +--+ .. --+
      |  |                               -- marker(8), 0x6b
      +--+

    where
      marker(8) is the eight bit array *0110-1011*

4.5 Serialisation of the top used vertex ID

    0 +--+--+--+--+--+--+--+--+
      |                       |          -- last used vertex IDs
    8 +--+--+--+--+--+--+--+--+
      |  |                               -- marker(8), 0x7c
      +--+

    where
      marker(8) is the eight bit array *0111-1100*

The vertex IDs in this record must all be non-zero. The last entry in the list indicates that all ID values greater or equal than this value are free and can be used as vertex IDs. If this record is missing, the value (1u64,0x01) is assumed, i.e. the list with the single vertex ID 1.

4.6 Serialisation of a last saved state record

     0 +--+--+--+--+--+ .. --+--+ .. --+
       |                               | -- 32 bytes state hash
    32 +--+--+--+--+--+ .. --+--+ .. --+
       |                       |         -- state number/block number
    40 +--+--+--+--+--+--+--+--+
       |  |                              -- marker(8), 0x7f
       +--+

    where
      marker(8) is the eight bit array *0111-111f*

4.7 Serialisation record identifier tags

Any of the above records can uniquely be identified by its trailing marker, i.e. the last byte of a serialised record.

** Bit mask** Hex value Record type Chapter reference
10xx xxxx 0x80 + x(6) Branch record 4.1
11xx xxxx 0xC0 + x(6) Leaf record 4.2
0xxx 0yyy (x(3)<<4) + y(3) Account payload 4.3
0110 1011 0x6b Unstructured payload 4.4
0111 1100 0x7c Last used vertex ID 4.5
0111 1111 0x7f Last saved state 4.6

5. Patricia Trie implementation notes

5.1 Database decriptor representation

    ^      +----------+
    |      | top      |   active delta layer, application cache
    |      +----------+
    |      +----------+   ^
   db      | stack[n] |   |
   desc    |    :     |   |  optional passive delta layers, handled by
   obj     | stack[1] |   |  transaction management (can be used to
    |      | stack[0] |   |  successively recover the top layer)
    |      +----------+   v
    |      +----------+
    |      | balancer |   optional read-only backend filter
    |      +----------+
    |      +----------+
    |      | backend  |   optional physical key-value backend database
    v      +----------+

There is a three tier access to a key-value database entry as in

    top -> balancer -> backend

where only the top layer is obligatory.

5.2 Distributed access using the same backend

There can be many descriptors for the same database. Due to delta layers and filters, each descriptor instance can work with a different state of the database.

Although there is only one of the instances that can write the current state on the physical backend database, this priviledge can be shifted to any instance for the price of updating the roFiters for all other instances.

Example:

    db1   db2   db3       -- db1, db2, .. database descriptors/handles
     |     |     |
    tx1   tx2   tx3       -- tx1, tx2, ..transaction/top layers
     |     |     |
     ø     ø     ø        -- no backend filters yet
      \    |    /
       \   |   /
          PBE             -- physical backend database

After collapse/committing tx1 and saving it to the physical backend database, the above architecture mutates to

    db1   db2   db3
     |     |     |
     ø    tx2   tx3
     |     |     |
     ø   ~tx1  ~tx1       -- filter reverting the effect of tx1 on PBE
      \    |    /
       \   |   /
        tx1+PBE           -- tx1 merged into physical backend database

When looked at descriptor API there are no changes when accessing data via db1, db2, or db3. In a different, more algebraic notation, the above tansformation is written as

    | tx1, ø |                                                   (8)
    | tx2, ø | PBE
    | tx3, ø |

        ||
        \/

    |  ø,    ø  |                                                (9)
    | tx2, ~tx1 | tx1+PBE
    | tx3, ~tx1 |

The system can be further converted without changing the API by committing and saving tx2 on the middle line of matrix (9)

    |  ø,       ø  |                                             (10)
    |  ø, tx2+~tx1 | tx1+PBE
    | tx3,    ~tx1 |

        ||
        \/

    |  ø,       ~(tx2+~tx1) |                                    (11)
    |  ø,               ø   | (tx2+~tx1)+tx1+PBE
    | tx3, ~tx1+~(tx2+~tx1) |

The + notation just means the repeated application of filters in left-to-right order. The notation looks like algebraic group notation but this will not be analysed further as there is no need for a general theory for the current implementation.

Suffice to say that the inverse ~tx of tx is calculated against the current state of the physical backend database which makes it messy to formulate boundary conditions.

Nevertheless, (8) can alse be transformed by committing and saving tx2 (rather than tx1.) This gives

    | tx1, ~tx2 |                                                (12)
    |  ø,    ø  | tx2+PBE
    | tx3, ~tx2 |

        ||
        \/

    |  ø, (tx1+~tx2) |                                           (13)
    |  ø,        ø   | tx2+PBE
    | tx3,     ~tx2  |

As (11) and (13) represent the same API, one has

    tx2+PBE =~ tx1+(tx2+~tx1)+PBE    because of the middle rows  (14)
    ~tx2    =~ ~tx1+~(tx2+~tx1)      because of (14)             (15)

which looks like some distributive property in (14) and commutative property in (15) for this example (but it is not straight algebraically.) The =~ operator above indicates that the representations are equivalent in the sense that they have the same effect on the backend database (looks a bit like residue classes.)

It might be handy for testing/verifying an implementation using this example.