From 8f21cf48a8cf9295f3a95d2909af011b11ec5ae6 Mon Sep 17 00:00:00 2001 From: Jordan Hrycaj Date: Mon, 14 Aug 2023 19:39:10 +0100 Subject: [PATCH] Distributed access architecture specs (#1680) why: Manage access to different MPTs via the same database (makes sense only if the MPTs are not too different.) --- nimbus/db/aristo/README.md | 167 ++++++++++++++++++++++++++++++++----- 1 file changed, 144 insertions(+), 23 deletions(-) diff --git a/nimbus/db/aristo/README.md b/nimbus/db/aristo/README.md index efd68caba..d1c11ce04 100644 --- a/nimbus/db/aristo/README.md +++ b/nimbus/db/aristo/README.md @@ -205,7 +205,7 @@ in the key-value table that implements the *Patricia Trie*. It is now called The structural keys *w, x, y, z* from the example *(3)* are called *vertexID* and implemented as 64 bit values, stored *Big Endian* in the serialisation. -### Branch record serialisation +### 4.1 Branch record serialisation 0 +--+--+--+--+--+--+--+--+--+ | | -- first vertexID @@ -230,7 +230,7 @@ vector *access(16)* is reset to zero, then there is no *n*-th structural Note that data are stored *Big Endian*, so the bits *0..7* of *access* are stored in the right byte of the serialised bitmap. -### Extension record serialisation +### 4.2 Extension record serialisation 0 +--+--+--+--+--+--+--+--+--+ | | -- vertexID @@ -250,7 +250,7 @@ zero (bit 4 is set if the right nibble is the first part of the path.) Note that the *pathSegmentLen(6)* is redunant as it is determined by the length of the extension record (as *recordLen - 9*.) -### Leaf record serialisation +### 4.3 Leaf record serialisation 0 +-- .. ... -- payload (may be empty) @@ -270,52 +270,52 @@ also set if the right nibble is the first part of the path.) If present, the serialisation of the payload field can be either for account data, for RLP encoded or for unstructured data as defined below. -### Leaf record payload serialisation for account data +### 4.4 Leaf record payload serialisation for account data 0 +-- .. --+ - | | -- nonce, 0 or 8 bytes - +-- .. --+--+ - | | -- balance, 0, 8, or 32 bytes - +-- .. --+--+ - | | -- storage ID, 0 or 8 bytes - +-- .. --+--+ - | | -- code hash, 0, 8 or 32 bytes + | | -- nonce, 0 or 8 bytes + +-- .. --+--+ + | | -- balance, 0, 8, or 32 bytes + +-- .. --+--+ + | | -- storage ID, 0 or 8 bytes + +-- .. --+--+ + | | -- code hash, 0, 8 or 32 bytes +--+ .. --+--+ | | -- bitmask(2)-word array +--+ where each bitmask(2)-word array entry defines the length of - the preceeding data fields: - 00 -- field is missing - 01 -- field lengthh is 8 bytes - 10 -- field lengthh is 32 bytes + the preceeding data fields: + 00 -- field is missing + 01 -- field lengthh is 8 bytes + 10 -- field lengthh is 32 bytes Apparently, entries 0 and and 2 of the bitmask(2) word array cannot have the value 10 as they refer to the nonce and the storage ID data fields. So, joining the bitmask(2)-word array to a single byte, the maximum value of that byte is 0x99. -### Leaf record payload serialisation for RLP encoded data +### 4.5 Leaf record payload serialisation for RLP encoded data 0 +--+ .. --+ - | | | -- data, at least one byte - +--+ .. --+ + | | | -- data, at least one byte + +--+ .. --+ | | -- marker byte +--+ where the marker byte is 0xaa -### Leaf record payload serialisation for unstructured data +### 4.6 Leaf record payload serialisation for unstructured data 0 +--+ .. --+ - | | | -- data, at least one byte - +--+ .. --+ + | | | -- data, at least one byte + +--+ .. --+ | | -- marker byte +--+ where the marker byte is 0xff -### Descriptor record serialisation +### 4.7 Descriptor record serialisation 0 +-- .. ... -- recycled vertexIDs @@ -330,9 +330,130 @@ the bitmask(2)-word array to a single byte, the maximum value of that byte is Currently, the descriptor record only contains data for producing unique vectorID values that can be used as structural keys. If this descriptor is -missing, the value `(0x40000000,0x01)` is assumed. The last vertexID in the +missing, the value *(0x40000000,0x01)* is assumed. The last vertexID in the descriptor list has the property that that all values greater or equal than this value can be used as vertexID. The vertexIDs in the descriptor record must all be non-zero and record itself should be allocated in the structural table associated with the zero key. + +5. *Patricia Trie* implementation notes +--------------------------------------- + +### 5.1 Database decriptor representation + + ^ +----------+ + | | top | active delta layer, application cache + | +----------+ + | +----------+ ^ + db | stack[n] | | + desc | : | | optional passive delta layers, handled by + obj | stack[1] | | transaction management (can be used to + | | stack[0] | | successively replace the top layer) + | +----------+ v + | +----------+ + | | roFilter | optional read-only backend filter + | +----------+ + | +----------+ + | | backend | optional physical key-value backend database + v +----------+ + + There is a three tier access to a key-value database entry as in + + top -> roFilter -> backend + +where only the *top* layer is obligatory. + +### 5.2 Distributed access using the same backend + +There can be many descriptors for the same database. Due to delta layers and +filters, each descriptor instance can work with a different state of the +database. + +Although there is only one of the instances that can write the current state +on the physical backend database, this priviledge can be shifted to any +instance for the price of updating the *roFiters* for all other instances. + +#### Example: + + db1 db2 db3 -- db1, db2, .. database descriptors/handles + | | | + tx1 tx2 tx3 -- tx1, tx2, ..transaction/top layers + | | | + ø ø ø -- no backend filters yet + \ | / + \ | / + PBE -- physical backend database + +After collapse/committing *tx1* and saving it to the physical backend +database, the above architecture mutates to + + db1 db2 db3 + | | | + ø tx2 tx3 + | | | + ø ~tx1 ~tx1 -- filter reverting the effect of tx1 on PBE + \ | / + \ | / + tx1+PBE -- tx1 merged into physical backend database + +When looked at descriptor API there are no changes when accessing data via +*db1*, *db2*, or *db3*. In a different, more algebraic notation, the above +tansformation is written as + + | tx1, ø | (8) + | tx2, ø | PBE + | tx3, ø | + + || + \/ + + | ø, ø | (9) + | tx2, ~tx1 | tx1+PBE + | tx3, ~tx1 | + + The system can be further converted without changing the API by committing + and saving *tx2* on the middle line of matrix (9) + + | ø, ø | (10) + | ø, tx2+~tx1 | tx1+PBE + | tx3, ~tx1 | + + || + \/ + + | ø, ~(tx2+~tx1) | (11) + | ø, ø | (tx2+~tx1)+tx1+PBE + | tx3, ~tx1+~(tx2+~tx1) | + +The *+* notation just means the repeated application of filters in +left-to-right order. The notation looks like algebraic group notation but this +will not be analysed further as there is no need for a general theory for the +current implementation. + +Suffice to say that the inverse *~tx* of *tx* is calculated against the +current state of the physical backend database which makes it messy to +formulate boundary conditions. + +Nevertheless, *(8)* can alse be transformed by committing and saving *tx2* +(rather than *tx1*.) This gives + + | tx1, ~tx2 | (12) + | ø, ø | tx2+PBE + | tx3, ~tx2 | + + || + \/ + + | ø, (tx1+~tx2) | (13) + | ø, ø | tx2+PBE + | tx3, ~tx2 | + + As *(11)* and *(13)* represent the same API, one has + + tx2+PBE == tx1+(tx2+~tx1)+PBE because of the middle rows (14) + ~tx2 == ~tx1+~(tx2+~tx1) because of (14) (15) + + which shows some distributive property in *(14)* and commutative property in + *(15)* for this example. In particulat it might be handy for testing/verifying + against this example.