Distributed access architecture specs (#1680)

why: Manage access to different MPTs via the same database (makes sense only if the MPTs are not too different.)
2023-08-14 19:39:10 +01:00 · 2023-08-14 19:39:10 +01:00 · 8f21cf48a8
parent 01fe172738
commit 8f21cf48a8
1 changed files with 144 additions and 23 deletions
--- a/nimbus/db/aristo/README.md
+++ b/nimbus/db/aristo/README.md
@ -205,7 +205,7 @@ in the key-value table that implements the *Patricia Trie*. It is now called
 The structural keys *w, x, y, z* from the example *(3)* are called *vertexID*
 and implemented as 64 bit values, stored *Big Endian* in the serialisation.

-### Branch record serialisation
+### 4.1 Branch record serialisation

        0 +--+--+--+--+--+--+--+--+--+
          |                          |       -- first vertexID
@ -230,7 +230,7 @@ vector *access(16)* is reset to zero, then there is no *n*-th structural
 Note that data are stored *Big Endian*, so the bits *0..7* of *access* are
 stored in the right byte of the serialised bitmap.

-### Extension record serialisation
+### 4.2 Extension record serialisation

        0 +--+--+--+--+--+--+--+--+--+
          |                          |       -- vertexID
@ -250,7 +250,7 @@ zero (bit 4 is set if the right nibble is the first part of the path.)
 Note that the *pathSegmentLen(6)* is redunant as it is determined by the length
 of the extension record (as *recordLen - 9*.)

-### Leaf record serialisation
+### 4.3 Leaf record serialisation

        0 +-- ..
          ...                                -- payload (may be empty)
@ -270,52 +270,52 @@ also set if the right nibble is the first part of the path.)
 If present, the serialisation of the payload field can be either for account
 data, for RLP encoded or for unstructured data as defined below.

-### Leaf record payload serialisation for account data
+### 4.4 Leaf record payload serialisation for account data

        0 +-- ..  --+
-		  |         |                        -- nonce, 0 or 8 bytes
-		  +-- ..  --+--+
-		  |            |                     -- balance, 0, 8, or 32 bytes
-		  +-- ..  --+--+
-		  |         |                        -- storage ID, 0 or 8 bytes
-		  +-- ..  --+--+
-		  |            |                     -- code hash, 0, 8 or 32 bytes
+          |         |                        -- nonce, 0 or 8 bytes
+          +-- ..  --+--+
+          |            |                     -- balance, 0, 8, or 32 bytes
+          +-- ..  --+--+
+          |         |                        -- storage ID, 0 or 8 bytes
+          +-- ..  --+--+
+          |            |                     -- code hash, 0, 8 or 32 bytes
          +--+ .. --+--+
          |  |                               -- bitmask(2)-word array
          +--+

        where each bitmask(2)-word array entry defines the length of
-		the preceeding data fields:
-		  00 -- field is missing
-		  01 -- field lengthh is 8 bytes
-		  10 -- field lengthh is 32 bytes
+        the preceeding data fields:
+          00 -- field is missing
+          01 -- field lengthh is 8 bytes
+          10 -- field lengthh is 32 bytes

 Apparently, entries 0 and and 2 of the bitmask(2) word array cannot have the
 value 10 as they refer to the nonce and the storage ID data fields. So, joining
 the bitmask(2)-word array to a single byte, the maximum value of that byte is
 0x99.

-### Leaf record payload serialisation for RLP encoded data
+### 4.5 Leaf record payload serialisation for RLP encoded data

        0 +--+ .. --+
-		  |  |      |                        -- data, at least one byte
-		  +--+ .. --+
+          |  |      |                        -- data, at least one byte
+          +--+ .. --+
          |  |                               -- marker byte
          +--+

        where the marker byte is 0xaa

-### Leaf record payload serialisation for unstructured data
+### 4.6 Leaf record payload serialisation for unstructured data

        0 +--+ .. --+
-		  |  |      |                        -- data, at least one byte
-		  +--+ .. --+
+          |  |      |                        -- data, at least one byte
+          +--+ .. --+
          |  |                               -- marker byte
          +--+

        where the marker byte is 0xff

-### Descriptor record serialisation
+### 4.7 Descriptor record serialisation

        0 +-- ..
          ...                                -- recycled vertexIDs
@ -330,9 +330,130 @@ the bitmask(2)-word array to a single byte, the maximum value of that byte is

 Currently, the descriptor record only contains data for producing unique
 vectorID values that can be used as structural keys. If this descriptor is
-missing, the value `(0x40000000,0x01)` is assumed. The last vertexID in the
+missing, the value *(0x40000000,0x01)* is assumed. The last vertexID in the
 descriptor list has the property that that all values greater or equal than
 this value can be used as vertexID.

 The vertexIDs in the descriptor record must all be non-zero and record itself
 should be allocated in the structural table associated with the zero key.
+
+5. *Patricia Trie* implementation notes
+---------------------------------------
+
+### 5.1 Database decriptor representation
+
+        ^      +----------+
+        |      | top      |   active delta layer, application cache
+        |      +----------+
+        |      +----------+   ^
+       db      | stack[n] |   |
+       desc    |    :     |   |  optional passive delta layers, handled by
+       obj     | stack[1] |   |  transaction management (can be used to
+        |      | stack[0] |   |  successively replace the top layer)
+        |      +----------+   v
+        |      +----------+
+        |      | roFilter |   optional read-only backend filter
+        |      +----------+
+        |      +----------+
+        |      | backend  |   optional physical key-value backend database
+        v      +----------+
+
+ There is a three tier access to a key-value database entry as in
+
+        top -> roFilter -> backend
+
+where only the *top* layer is obligatory.
+
+### 5.2 Distributed access using the same backend
+
+There can be many descriptors for the same database. Due to delta layers and
+filters, each descriptor instance can work with a different state of the
+database.
+
+Although there is only one of the instances that can write the current state
+on the physical backend database, this priviledge can be shifted to any
+instance for the price of updating the *roFiters* for all other instances.
+
+#### Example:
+
+        db1   db2   db3       -- db1, db2, .. database descriptors/handles
+         |     |     |
+        tx1   tx2   tx3       -- tx1, tx2, ..transaction/top layers
+         |     |     |
+         ø     ø     ø        -- no backend filters yet
+          \    |    /
+           \   |   /
+              PBE             -- physical backend database
+
+After collapse/committing *tx1* and saving it to the physical backend
+database, the above architecture mutates to
+
+        db1   db2   db3
+         |     |     |
+         ø    tx2   tx3
+         |     |     |
+         ø   ~tx1  ~tx1       -- filter reverting the effect of tx1 on PBE
+          \    |    /
+           \   |   /
+            tx1+PBE           -- tx1 merged into physical backend database
+
+When looked at descriptor API there are no changes when accessing data via
+*db1*, *db2*, or *db3*. In a different, more algebraic notation, the above
+tansformation is written as
+ 
+        | tx1, ø |                                                   (8)
+        | tx2, ø | PBE
+        | tx3, ø |
+
+            ||
+            \/
+
+        |  ø,    ø  |                                                (9)
+        | tx2, ~tx1 | tx1+PBE
+        | tx3, ~tx1 |
+
+ The system can be further converted without changing the API by committing
+ and saving *tx2* on the middle line of matrix (9)
+ 
+        |  ø,       ø  |                                             (10)
+        |  ø, tx2+~tx1 | tx1+PBE
+        | tx3,    ~tx1 |
+
+            ||
+            \/
+
+        |  ø,       ~(tx2+~tx1) |                                    (11)
+        |  ø,               ø   | (tx2+~tx1)+tx1+PBE
+        | tx3, ~tx1+~(tx2+~tx1) |
+
+The *+* notation just means the repeated application of filters in
+left-to-right order. The notation looks like algebraic group notation but this
+will not be analysed further as there is no need for a general theory for the
+current implementation.
+
+Suffice to say that the inverse *~tx* of *tx* is calculated against the
+current state of the physical backend database which makes it messy to
+formulate boundary conditions.
+
+Nevertheless, *(8)* can alse be transformed by committing and saving *tx2*
+(rather than *tx1*.) This gives
+
+        | tx1, ~tx2 |                                                (12)
+        |  ø,    ø  | tx2+PBE
+        | tx3, ~tx2 |
+
+            ||
+            \/
+
+        |  ø, (tx1+~tx2) |                                           (13)
+        |  ø,        ø   | tx2+PBE
+        | tx3,     ~tx2  |
+
+ As *(11)* and *(13)* represent the same API, one has
+
+        tx2+PBE == tx1+(tx2+~tx1)+PBE    because of the middle rows  (14)
+        ~tx2    == ~tx1+~(tx2+~tx1)      because of (14)             (15)
+
+ which shows some distributive property in *(14)* and commutative property in
+ *(15)* for this example. In particulat it might be handy for testing/verifying
+ against this example.