Each branch node may have up to 16 sub-items - currently, these are
given VertexID based when they are first needed leading to a
mostly-random order of vertexid for each subitem.
Here, we pre-allocate all 16 vertex ids such that when a branch subitem
is filled, it already has a vertexid waiting for it. This brings several
important benefits:
* subitems are sorted and "close" in their id sequencing - this means
that when rocksdb stores them, they are likely to end up in the same
data block thus improving read efficiency
* because the ids are consequtive, we can store just the starting id and
a bitmap representing which subitems are in use - this reduces disk
space usage for branches allowing more of them fit into a single disk
read, further improving disk read and caching performance - disk usage
at block 18M is down from 84 to 78gb!
* the in-memory footprint of VertexRef reduced allowing more instances
to fit into caches and less memory to be used overall.
Because of the increased locality of reference, it turns out that we no
longer need to iterate over the entire database to efficiently generate
the hash key database because the normal computation is now faster -
this significantly benefits "live" chain processing as well where each
dirtied key must be accompanied by a read of all branch subitems next to
it - most of the performance benefit in this branch comes from this
locality-of-reference improvement.
On a sample resync, there's already ~20% improvement with later blocks
seeing increasing benefit (because the trie is deeper in later blocks
leading to more benefit from branch read perf improvements)
```
blocks: 18729664, baseline: 190h43m49s, contender: 153h59m0s
Time (total): -36h44m48s, -19.27%
```
Note: clients need to be resynced as the PR changes the on-disk format
R.I.P. little bloom filter - your life in the repo was short but
valuable
When `nimbus import` runs, we end up with a database without MPT roots
leading to long startup times the first time one is needed.
Computing the state root is slow because the on-disk order based on
VertexID sorting does not match the trie traversal order and therefore
makes lookups inefficent.
Here we introduce a helper that speeds up this computation by traversing
the trie in on-disk order and computing the trie hashes bottom up
instead - even though this leads to some redundant reads of nodes that
we cannot yet compute, it's still a net win as leaves and "bottom"
branches make up the majority of the database.
This PR also addresses a few other sources of inefficiency largely due
to the separation of AriKey and AriVtx into their own column families.
Each column family is its own LSM tree that produces hundreds of SST
filtes - with a limit of 512 open files, rocksdb must keep closing and
opening files which leads to expensive metadata reads during random
access.
When rocksdb makes a lookup, it has to read several layers of files for
each lookup. Ribbon filters to skip over files that don't have the
requested data but when these filters are not in memory, reading them is
slow - this happens in two cases: when opening a file and when the
filter has been evicted from the LRU cache. Addressing the open file
limit solves one source of inefficiency, but we must also increase the
block cache size to deal with this problem.
* rocksdb.max_open_files increased to 2048
* per-file size limits increased so that fewer files are created
* WAL size increased to avoid partial flushes which lead to small files
* rocksdb block cache increased
All these increases of course lead to increased memory usage, but at
least performance is acceptable - in the future, we'll need to explore
options such as joining AriVtx and AriKey and/or reducing the row count
(by grouping branch layers under a single vertexid).
With this PR, the mainnet state root can be computed in ~8 hours (down
from 2-3 days) - not great, but still better.
Further, we write all keys to the database, also those that are less
than 32 bytes - because the mpt path is part of the input, it is very
rare that we actually hit a key like this (about 200k such entries on
mainnet), so the code complexity is not worth the benefit really, in the
current database layout / design.
* Aristo: Merge `delta_siblings` module into `deltaPersistent()`
* Aristo: Add `isEmpty()` for canonical checking whether a layer is empty
* Aristo: Merge `LayerDeltaRef` into `LayerObj`
why:
No need to maintain nested object refs anymore. Previously the
`LayerDeltaRef` object had a companion `LayerFinalRef` which held
non-delta layer information.
* Kvt: Merge `LayerDeltaRef` into `LayerRef`
why:
No need to maintain nested object refs (as with `Aristo`)
* Kvt: Re-write balancer logic similar to `Aristo`
why:
Although `Kvt` was a cheap copy of `Aristo` it sort of got out of
sync and the balancer code was wrong.
* Update iterator over forked peers
why:
Yield additional field `isLast` indicating that the last iteration
cycle was approached.
* Optimise balancer calculation.
why:
One can often avoid providing a new object containing the merge of two
layers for the balancer. This avoids copying tables. In some cases this
is replaced by `hasKey()` look ups though. One uses one of the two
to combine and merges the other into the first.
Of course, this needs some checks for making sure that none of the
components to merge is eventually shared with something else.
* Fix copyright year
The state and account MPT:s currenty share key space in the database
based on that vertex id:s are assigned essentially randomly, which means
that when two adjacent slot values from the same contract are accessed,
they might reside at large distance from each other.
Here, we prefix each vertex id by its root causing them to be sorted
together thus bringing all data belonging to a particular contract
closer together - the same effect also happens for the main state MPT
whose nodes now end up clustered together more tightly.
In the future, the prefix given to the storage keys can also be used to
perform range operations such as reading all the storage at once and/or
deleting an account with a batch operation.
Notably, parts of the API already supported this rooting concept while
parts didn't - this PR makes the API consistent by always working with a
root+vid.
* Remove all journal related stuff
* Refactor function names journal*() => delta*(), filter*() => delta*()
* remove `trg` fileld from `FilterRef`
why:
Same as `kMap[$1]`
* Re-type FilterRef.src as `HashKey`
why:
So it is directly comparable to `kMap[$1]`
* Moved `vGen[]` field from `LayerFinalRef` to `LayerDeltaRef`
why:
Then a separate `FilterRef` type is not needed, anymore
* Rename `roFilter` field in `AristoDbRef` => `balancer`
why:
New name more appropriate.
* Replace `FilterRef` by `LayerDeltaRef` type
why:
This allows to avoid copying into the `balancer` (see next patch set)
most of the time. Typically, only one instance is running on the backend
and the `balancer` is only used as a stage before saving data.
* Refactor way how to store data persistently
why:
Avoid useless copy when staging `top` layer for persistently saving to
backend.
* Fix copyright header?
* Code cosmetics
* Aristo+Kvt: Fix api wrappers
why:
Api setup killed the backend descriptor when backend mapping was
disabled.
* Aristo: Implement masked profiling entries
why:
Database backend should be listed but not counted in tally
* CoreDb: Simplify backend() methods
why:
DBMS backend access Was provided very early and over engineered. Now
there are only two backend machines, one for `Kvt` and the other one
for an `Mpt` available only via new API.
* CoreDb: Code cleanup regarding descriptor types
* CoreDb: Refactor/redefine `persistent()` methods
why:
There were `persistent()` methods for any type of caching storage
facilities `Kvt`, `Mpt`, `Phk`, and `Acc`. Now there is only a single
`persistent()` method storing all facilities in tandem (similar to
how transactions work.)
For non shared `Kvt` tables, there is now an extra storage method
`saveOffSite()`.
* CoreDb lingo update: `trie` becomes `column`
why:
Notion of a `trie` is pretty much hidden by the new `CoreDb` api.
Revealed are sort of database columns for accounts an storage data,
any of which have an internal state represented by a Keccack hash.
So a `trie` or `MPT` becomes a `column` and a `rootHash` becomes a
column state.
* Aristo: rename backend filed `filters` => `journal`
* Update full sync logging
details:
+ Disable eth handler noise while syncing
+ Log journal depth (if available)
* Fix copyright year
* Fix cruft and unwanted imports
* Aristo: Update unit test suite
* Aristo/Kvt: Fix iterators
why:
Generic iterators were not properly updated after backend change
* Aristo: Add sub-trie deletion functionality
why:
For storage tries linked to an account payload vertex ID, a the
whole storage trie needs to be deleted with the account.
* Aristo: Reserve vertex ID numbers for static custom state roots
why:
Static custom state roots may be controlled by an application,
e.g. for a receipt or a transaction root. The `Aristo` functions
are agnostic of what the static state roots are when different
from the internal tree vertex ID 1.
details;
The `merge()` function applied to a non-static state root (assumed
to be a storage root) will check the payload of an accounts leaf
and mark its Merkle keys to be re-checked.
* Aristo: Correct error code symbol
* Aristo: Update error code symbols
* Aristo: Code cosmetics/comments
* Aristo: Fix hashify schedule calculator
why:
Had a tendency to stop early leaving an incomplete job
* Update KVT layers abstraction
details:
modelled after Aristo layers
* Simplified KVT database iterators (removed item counters)
why:
Not needed for production functions
* Simplify KVT merge function `layersCc()`
* Simplified Aristo database iterators (removed item counters)
why:
Not needed for production functions
* Update failure condition for hash labels compiler `hashify()`
why:
Node need not be rejected as long as links are on the schedule. In
that case, `redo[]` is to become `wff.base[]` at a later stage.
* Update merging layers and label update functions
why:
+ Merging a stack of layers with `layersCc()` could be simplified
+ Merging layers will optimise the reverse `kMap[]` table maps
`pAmk: label->{vid, ..}` by deleting empty mappings `label->{}` where
they are redundant.
+ Updated `layersPutLabel()` for optimising `pAmk[]` tables
* Fix kvt headers
* Provide differential layers for KVT transaction stack
why:
Significant performance improvement
* Provide abstraction layer for database top cache layer
why:
This will eventually implemented as a differential database layers
or transaction layers. The latter is needed to improve performance.
behavioural changes:
Zero vertex and keys (i.e. delete requests) are not optimised out
until the last layer is written to the database.
* Provide differential layers for Aristo transaction stack
why:
Significant performance improvement
* Fix debug noise in `hashify()` for perfectly normal situation
why:
Was previously considered a fixable error
* Fix test sample file names
why:
The larger test file `goerli68161.txt.gz` is already in the local
archive. So there is no need to use the smaller one from the external
repo.
* Activate `accounts_cache` module from `db/ledger`
why:
A copy of the original `accounts_cache.nim` source to be integrated
into the `Ledger` module wrapper which allows to switch between
different `accounts_cache` implementations unser tha same API.
details:
At a later state, the `db/accounts_cache.nim` wrapper will be
removed so that there is only one access to that module via
`db/ledger/accounts_cache.nim`.
* Fix copyright headers in source code
* Kvt: Implemented multi-descriptor access on the same backend
why:
This behaviour mirrors the one of Aristo and can be used for
simultaneous transactions on Aristo + Kvt
* Kvt: Update database iterators
why:
Forgot to run on the top layer first
* Kvt: Misc fixes
* Aristo, use `openArray[byte]` rather than `Blob` in prototype
* Aristo, by default hashify right after cloning descriptor
why:
Typically, a completed descriptor is expected after cloning. Hashing
can be suppressed by argument flag.
* Aristo provides `replicate()` iterator, similar to legacy `replicate()`
* Aristo API fixes and updates
* CoreDB: Rename `legacy_persistent` => `legacy_rocksdb`
why:
More systematic, will be in line with Aristo DB which might have
more than one persistent backends
* CoreDB: Prettify API sources
why:
Better to read and maintain
details:
Annotating with custom pragmas which cleans up the prototypes
* CoreDB: Update MPT/put() prototype allowing `CatchableError`
why:
Will be needed for Aristo API (legacy is OK with `RlpError`)
* Update docu
* Update Aristo/Kvt constructor prototype
why:
Previous version used an `enum` value to indicate what backend is to
be used. This was replaced by using the backend object type.
* Rewrite `hikeUp()` return code into `Result[Hike,(Hike,AristoError)]`
why:
Better code maintenance. Previously, the `Hike` object was returned. It
had an internal error field so partial success was also available on
a failure. This error field has been removed.
* Use `openArray[byte]` rather than `Blob` in functions prototypes
* Provide synchronised multi instance transactions
why:
The `CoreDB` object was geared towards the legacy DB which used a single
transaction for the key-value backend DB. Different state roots are
provided by the backend database, so all instances work directly on the
same backend.
Aristo db instances have different in-memory mappings (aka different
state roots) and the transactions are on top of there mappings. So each
instance might run different transactions.
Multi instance transactions are a compromise to converge towards the
legacy behaviour. The synchronised transactions span over all instances
available at the time when base transaction was opened. Instances
created later are unaffected.
* Provide key-value pair database iterator
why:
Needed in `CoreDB` for `replicate()` emulation
also:
Some update of internal code
* Extend API (i.e. prototype variants)
why:
Needed for `CoreDB` geared towards the legacy backend which has a more
basic API than Aristo.
* Set scheduler state as part of the backend descriptor
details:
Moved type definitions `QidLayoutRef` and `QidSchedRef` to
`desc_structural.nim` so that it shares the same folder as
`desc_backend.nim`
* Automatic filter queue table initialisation in backend
details:
Scheduler can be tweaked or completely disabled
* Updated backend unit tests
details:
+ some code clean up/beautification, reads better now
+ disabled persistent filters so that there is no automated filter
management which will be implemented next
* Prettify/update unit tests source code
details:
Mostly replacing the `check()` paradigm by `xCheck()`
* Somewhat simplified backend type management
why:
Backend objects are labelled with a `BackendType` symbol where the
`BackendVoid` label is implicitly assumed for a `nil` backend object
reference.
To make it easier, a `kind()` function is used now applicable to
`nil` references as well.
* Fix DB storage layout for filter objects
why:
Need to store the filter ID with the object
* Implement reverse [] index on fifo
why:
An integer index argument on `[]` retrieves the QueueID (label) of the
fifo item while a QueueID argument on `[]` retrieves the index (so
it is inverse to the former variant).
* Provide iterator over filters as fifo
why:
This iterator goes along the cascased fifo structure (i.e. in
historical order)
* Rename FilterID => QueueID
why:
The current usage does not identify a particular filter but uses it as
storage tag to manage it on the database (to be organised in a set of
FIFOs or queues.)
* Split `aristo_filter` source into sub-files
why:
Make space for filter management API
* Store filter queue IDs in pairs on the backend
why:
Any pair will will describe a FIFO accessed by bottom/top IDs
* Reorg some source file names
why:
The "aristo_" prefix for make local/private files is tedious to
use, so removed.
* Implement filter slot scheduler
details:
Filters will be stored on the database on cascaded FIFOs. When a FIFO
queue is full, some filter items are bundled together and stored on the
next FIFO.
* Removed dedicated transcoder tests
why:
will implicitely be provided by other tests:
+ encode/write -> hashify -> test_tx
+ decode/read -> merge raw nodes -> test_tx
+ de/blobfiy -> backend operations, taext_tx, test_backend, test_filter
* Clarify how the vertex ID generator state is accessed from the backend
why:
This state is a list of unused vertex IDs. It was just stored somewhere
on the backend which details were exposed when iterating over some
sub-table(s).
As there will be more such single information records, an admin
sub-tables has been defined (formerly ID generator table) with dedicated
access keys and type. Also, the iterator over the single ID generator
state item has been removed. It must be accessed via the `get()`
interface.
* Remove trailing space from file name
why:
fixes windows bail out
* Remove concept of empty/blind filters
why:
Not needed. A non-existent filter is is coded as a nil reference.
* Slightly generalised backend iterators
why:
* VertexID as key for the ID generator state makes no sense
* there will be more tables addressed by non-VertexID keys
* Store serialised/blobified vertices on memory backend
why:
This is more in line with the RocksDB backend so more appropriate
for testing when comparing behaviour. For a speedy memory database,
a backend-less variant should be used.
* Drop the `Aristo` prefix from names `AristoLayerRef`, etc.
* Suppress compiler warning
why:
duplicate imports
* Add filter serialisation transcoder
why:
Will be used as storage format
* Renamed type `NoneBackendRef` => `VoidBackendRef`
* Clarify names: `BE=filter+backend` and `UBE=backend (unfiltered)`
why:
Most functions used full names as `getVtxUnfilteredBackend()` or
`getKeyBackend()`. After defining abbreviations (and its meaning) it
seems easier to use `getVtxUBE()` and `getKeyBE()`.
* Integrate `hashify()` process into transaction logic
why:
Is now transparent unless explicitly controlled.
details:
Cache changes imply setting a `dirty` flag which in turn triggers
`hashify()` processing in transaction and `pack()` directives.
* Removed `aristo_tx.exec()` directive
why:
Inconsistent implementation, functionality will be provided with a
different paradigm.
* Provide deep copy for each transaction layer
why:
Localising changes. Selective deep copy was just overlooked.
* Generalise vertex ID generator state reorg function `vidReorg()`
why:
makes it somewhat easier to handle when saving layers.
* Provide dummy back end descriptor `NoneBackendRef`
* Optional read-only filter between backend and transaction cache
why:
Some staging area for accumulating changes to the backend DB. This
will eventually be an access layer for emulating a backend with
multiple/historic state roots.
* Re-factor `persistent()` with filter between backend/tx-cache => `stow()`
why:
The filter provides an abstraction from the physically stored data on
disk. So, there can be several MPT instances using the same disk data
with different state roots. Of course, all the MPT instances should
not differ too much for practical reasons :).
TODO:
Filter administration tools need to be provided.