A bit unexpectedly, nibble handling shows up in the profiler mainly
because the current impl is tuned towards slicing while the most common
operation is prefix comparison - since the code is simple, might has
well get rid of some of the excess fat by always aliging the nibbles to
the byte buffer.
Each branch node may have up to 16 sub-items - currently, these are
given VertexID based when they are first needed leading to a
mostly-random order of vertexid for each subitem.
Here, we pre-allocate all 16 vertex ids such that when a branch subitem
is filled, it already has a vertexid waiting for it. This brings several
important benefits:
* subitems are sorted and "close" in their id sequencing - this means
that when rocksdb stores them, they are likely to end up in the same
data block thus improving read efficiency
* because the ids are consequtive, we can store just the starting id and
a bitmap representing which subitems are in use - this reduces disk
space usage for branches allowing more of them fit into a single disk
read, further improving disk read and caching performance - disk usage
at block 18M is down from 84 to 78gb!
* the in-memory footprint of VertexRef reduced allowing more instances
to fit into caches and less memory to be used overall.
Because of the increased locality of reference, it turns out that we no
longer need to iterate over the entire database to efficiently generate
the hash key database because the normal computation is now faster -
this significantly benefits "live" chain processing as well where each
dirtied key must be accompanied by a read of all branch subitems next to
it - most of the performance benefit in this branch comes from this
locality-of-reference improvement.
On a sample resync, there's already ~20% improvement with later blocks
seeing increasing benefit (because the trie is deeper in later blocks
leading to more benefit from branch read perf improvements)
```
blocks: 18729664, baseline: 190h43m49s, contender: 153h59m0s
Time (total): -36h44m48s, -19.27%
```
Note: clients need to be resynced as the PR changes the on-disk format
R.I.P. little bloom filter - your life in the repo was short but
valuable
When walking AriVtx, parsing integers and nibbles actually becomes a
hotspot - these trivial changes reduces CPU usage during initial key
cache computation by ~15%.
Currently, computed hash keys are stored in a separate column family
with respect to the MPT data they're generated from - this has several
disadvantages:
* A lot of space is wasted because the lookup key (`RootedVertexID`) is
repeated in both tables - this is 30% of the `AriKey` content!
* rocksdb must maintain in-memory bloom filters and LRU caches for said
keys, doubling its "minimal efficient cache size"
* An extra disk traversal must be made to check for existence of cached
hash key
* Doubles the amount of files on disk due to each column family being
its own set of files
Here, the two CFs are joined such that both key and data is stored in
`AriVtx`. This means:
* we save ~30% disk space on repeated lookup keys
* we save ~2gb of memory overhead that can be used to cache data instead
of indices
* we can skip storing hash keys for MPT leaf nodes - these are trivial
to compute and waste a lot of space - previously they had to present in
the `AriKey` CF to avoid having to look in two tables on the happy path.
* There is a small increase in write amplification because when a hash
value is updated for a branch node, we must write both key and branch
data - previously we would write only the key
* There's a small shift in CPU usage - instead of performing lookups in
the database, hashes for leaf nodes are (re)-computed on the fly
* We can return to slightly smaller on-disk SST files since there's
fewer of them, which should reduce disk traffic a bit
Internally, there are also other advantages:
* when clearing keys, we no longer have to store a zero hash in memory -
instead, we deduce staleness of the cached key from the presence of an
updated VertexRef - this saves ~1gb of mem overhead during import
* hash key cache becomes dedicated to branch keys since leaf keys are no
longer stored in memory, reducing churn
* key computation is a lot faster thanks to the skipped second disk
traversal - a key computation for mainnet can be completed in 11 hours
instead of ~2 days (!) thanks to better cache usage and less read
amplification - with additional improvements to the on-disk format, we
can probably get rid of the initial full traversal method of seeding the
key cache on first start after import
All in all, this PR reduces the size of a mainnet database from 160gb to
110gb and the peak memory footprint during import by ~1-2gb.
This kind of data is not used except in tests where it is used only to
create databases that don't match actual usage of aristo.
Removing simplifies future optimizations that can focus on processing
specific leaf types more efficiently.
A casualty of this removal is some test code as well as some proof
generation code that is unused - on the surface, it looks like it should
be possible to port both of these to the more specific data types -
doing so would ensure that a database written by one part of the
codebase can interact with the other - as it stands, there is confusion
on this point since using the proof generation code will result in a
database of a shape that is incompatible with the rest of eth1.
This is a minimal set of changes to make things work with the new types
in nim-eth - this is the minimal PR that merely resolves
incompatibilities while the full change set would include more cleanup
and migration.
* batch database key writes during `computeKey` calls
* log progress when there are many keys to update
* avoid evicting the vertex cache when traversing the trie for key
computation purposes
* avoid storing trivial leaf hashes that directly can be loaded from the
vertex
* Add missing leaf cache update when a leaf turns to a branch with two
leaves (on merge) and vice versa (on delete) - this could lead to stale
leaves being returned from the cache causing validation failures - it
didn't happen because the leaf caches were not being used efficiently :)
* Replace `seq` with `ArrayBuf` in `Hike` allowing it to become
allocation-free - this PR also works around an inefficiency in nim in
returning large types via a `var` parameter
* Use the leaf cache instead of `getVtxRc` to fetch recent leaves - this
makes the vertex cache more efficient at caching branches because fewer
leaf requests pass through it.
* move pfx out of variant which avoids pointless field type panic checks
and copies on access
* make `VertexRef` a non-inheritable object which reduces its memory
footprint and simplifies its use - it's also unclear from a semantic
point of view why inheritance makes sense for storing keys
detail:
For practical reasons, ifsuch an account is asked for a slot, an empty
proof list is returned. It is up to the user to provide an account
proof that shows that there is no storage tree.
* pre-allocate `blobify` data and remove redundant error handling
(cannot fail on correct data)
* use threadvar for temporary storage when decoding rdb, avoiding
closure env
* speed up database walkers by avoiding many temporaries
~5% perf improvement on block import, 100x on database iteration (useful
for building analysis tooling)
* Extract sub-tree deletion functions into separate sub-modules
* Move/rename `aristo_desc.accLruSize` => `aristo_constants.ACC_LRU_SIZE`
* Lazily delete sub-trees
why:
This gives some control of the memory used to keep the deleted vertices
in the cached layers. For larger sub-trees, keys and vertices might be
on the persistent backend to a large extend. This would pull an amount
of extra information from the backend into the cached layer.
For lazy deleting it is enough to remember sub-trees by a small set of
(at most 16) sub-roots to be processed when storing persistent data.
Marking the tree root deleted immediately allows to let most of the code
base work as before.
* Comments and cosmetics
* No need to import all for `Aristo` here
* Kludge to make `chronicle` usage in sub-modules work with `fluffy`
why:
That `fluffy` would not run with any logging in `core_deb` is a problem
I have known for a while. Up to now, logging was only used for debugging.
With the current `Aristo` PR, there are cases where logging might be
wanted but this works only if `chronicles` runs without the
`json[dynamic]` sinks.
So this should be re-visited.
* More of a kludge
* Extracted `test_tx.testTxMergeProofAndKvpList()` => separate file
* Fix serialiser
why:
Typo lead to duplicate rlp-encoded nodes in chain
* Remove cruft
* Implemnt portal proof nodes generators `partXxxTwig()`
* Add unit test for portal proof nodes generator `partAccountTwig()`
* Cosmetics
* Simplify serialiser return code format
* Fix proof generator for extension nodes
why:
Code was simply bonkers, not detected before the unit tests were
adapted to check for just this.
* Implemented portal proof nodes verifier `partUntwig()`
* Cosmetics
* Fix `testutp` cli poblem
* Implement partial trees
why:
This is currently needed for unit tests to pre-load the database
with test data similar to `proof` node pre-load.
The basic features for `snap-sync` boundary proofs are available
as well for future use. What is missing is the final proof verification
and a complete storage data load/merge function (stub is available.)
* Cosmetics, clean up
* Update config for Ledger and CoreDb
why:
Prepare for tracer which depends on the API jump table (as well as
the profiler.) The API jump table is now enabled in unit/integration
test mode piggybacking on the `unittest2DisableParamFiltering`
compiler flag or on an extra compiler flag `dbjapi_enabled`.
* No deed for error field in `NodeRef`
why:
Was opnly needed by proof nodes pre-loader which will be re-implemented
* Cosmetics
* Aristo: Merge `delta_siblings` module into `deltaPersistent()`
* Aristo: Add `isEmpty()` for canonical checking whether a layer is empty
* Aristo: Merge `LayerDeltaRef` into `LayerObj`
why:
No need to maintain nested object refs anymore. Previously the
`LayerDeltaRef` object had a companion `LayerFinalRef` which held
non-delta layer information.
* Kvt: Merge `LayerDeltaRef` into `LayerRef`
why:
No need to maintain nested object refs (as with `Aristo`)
* Kvt: Re-write balancer logic similar to `Aristo`
why:
Although `Kvt` was a cheap copy of `Aristo` it sort of got out of
sync and the balancer code was wrong.
* Update iterator over forked peers
why:
Yield additional field `isLast` indicating that the last iteration
cycle was approached.
* Optimise balancer calculation.
why:
One can often avoid providing a new object containing the merge of two
layers for the balancer. This avoids copying tables. In some cases this
is replaced by `hasKey()` look ups though. One uses one of the two
to combine and merges the other into the first.
Of course, this needs some checks for making sure that none of the
components to merge is eventually shared with something else.
* Fix copyright year
* Remove `chunkedMpt` from `persistent()`/`stow()` function
why:
Proof-mode code was removed with PR #2445 and needs to be re-designed.
* Remove unused `beStateRoot` argument from `deltaMerge()`
* Update/drastically simplify `txStow()`
why:
Got rid of many boundary conditions
details:
Many pre-conditions have changed. In particular, previous versions
used the account state (hash) which was conveniently available and
checked it against the backend in order to find out whether there
was something to do, at all. Currently, only an empty set of all
tables in the delta layer has the balancer update ignored.
Notable changes are:
* no check against account state (see above)
* balancer filters have no hash signature (some legacy stuff left over
from journals)
* no (shap sync) proof data which made the generation of the a top layer
more complex
* Cosmetics, cruft removal
* Update unit test file & function name
why:
Was legacy module
* Remove cruft left-over from PR #2494
* TODO
* Update comments on `HashKey` type values
* Remove obsolete hash key conversion flag `forceRoot`
why:
Is treated implicitly by having vertex keys as `HashKey` type and
root vertex states converted to `Hash256`
* Imported/rebase from `no-ext`, PR #2485
Store extension nodes together with the branch
Extension nodes must be followed by a branch - as such, it makes sense
to store the two together both in the database and in memory:
* fewer reads, writes and updates to traverse the tree
* simpler logic for maintaining the node structure
* less space used, both memory and storage, because there are fewer
nodes overall
There is also a downside: hashes can no longer be cached for an
extension - instead, only the extension+branch hash can be cached - this
seems like a fine tradeoff since computing it should be fast.
TODO: fix commented code
* Fix merge functions and `toNode()`
* Update `merkleSignCommit()` prototype
why:
Result is always a 32bit hash
* Update short Merkle hash key generation
details:
Ethereum reference MPTs use Keccak hashes as node links if the size of
an RLP encoded node is at least 32 bytes. Otherwise, the RLP encoded
node value is used as a pseudo node link (rather than a hash.) This is
specified in the yellow paper, appendix D.
Different to the `Aristo` implementation, the reference MPT would not
store such a node on the key-value database. Rather the RLP encoded node value is stored instead of a node link in a parent node
is stored as a node link on the parent database.
Only for the root hash, the top level node is always referred to by the
hash.
* Fix/update `Extension` sections
why:
Were commented out after removal of a dedicated `Extension` type which
left the system disfunctional.
* Clean up unused error codes
* Update unit tests
* Update docu
---------
Co-authored-by: Jacek Sieka <jacek@status.im>
This PR adds a storage hike cache similar to the account hike cache
already present - this cache is less efficient because account storage
is already partically cached in the account ledger but nonetheless helps
keep hiking down.
Notably, there's an opportunity to optimise this cache and the others so
that they cooperate better insteado of overlapping, which is left for a
future PR.
This PR also fixes an O(N) memory usage for storage slots where the
delete would keep the full storage in a work list which on mainnet can
grow very large - the work list is replaced with a more conventional
recursive `O(log N)` approach.
The Vertex type unifies branches, extensions and leaves into a single
memory area where the larges member is the branch (128 bytes + overhead) -
the payloads we have are all smaller than 128 thus wrapping them in an
extra layer of `ref` is wasteful from a memory usage perspective.
Further, the ref:s must be visited during the M&S phase of garbage
collection - since we keep millions of these, many of them
short-lived, this takes up significant CPU time.
```
Function CPU Time: Total CPU Time: Self Module Function (Full) Source File Start Address
system::markStackAndRegisters 10.0% 4.922s nimbus system::markStackAndRegisters(var<system::GcHeap>).constprop.0 gc.nim 0x701230`
```
* Updates and corrections
* Extract `CoreDb` configuration from `base.nim` into separate module
why:
This makes it easier to avoid circular imports, in particular
when the capture journal (aka tracer) is revived.
* Extract `Ledger` configuration from `base.nim` into separate module
why:
This makes it easier to avoid circular imports (if any.)
also:
Move `accounts_ledger.nim` file to sub-folder `backend`. That way the
layout resembles that of the `core_db`.
hike allocations (and the garbage collection maintenance that follows)
are responsible for some 10% of cpu time (not wall time!) at this point
- this PR avoids them by stepping through the layers one step at a time,
simplifying the code at the same time.
Introduce a new `StoData` payload type similar to `AccountData`
* slightly more efficient storage format
* typed api
* fewer seqs
* fix encoding docs - it wasn't rlp after all :)
The state and account MPT:s currenty share key space in the database
based on that vertex id:s are assigned essentially randomly, which means
that when two adjacent slot values from the same contract are accessed,
they might reside at large distance from each other.
Here, we prefix each vertex id by its root causing them to be sorted
together thus bringing all data belonging to a particular contract
closer together - the same effect also happens for the main state MPT
whose nodes now end up clustered together more tightly.
In the future, the prefix given to the storage keys can also be used to
perform range operations such as reading all the storage at once and/or
deleting an account with a batch operation.
Notably, parts of the API already supported this rooting concept while
parts didn't - this PR makes the API consistent by always working with a
root+vid.
The storage id is frequently accessed when executing contract code and
finding the path via the database requires several hops making the
process slow - here, we add a cache to keep the most recently used
account storage id:s in memory.
A possible future improvement would be to cache all account accesses so
that for example updating the balance doesn't cause several hikes.
* Update some docu
* Resolve obsolete compile time option
why:
Not optional anymore
* Update checks
why:
The notion of what constitutes a valid `Aristo` db has changed due to
(even more) lazy calculating Merkle hash keys.
* Disable redundant unit test for production
* Remove `dirty` set from structural objects
why:
Not used anymore, the tree is dirty by default.
* Rename `aristo_hashify` -> `aristo_compute`
* Remove cruft, update comments, cosmetics, etc.
* Simplify `SavedState` object
why:
The key chaining have become obsolete after extra lazy hashing. There
is some available space for a state hash to be maintained in future.
details:
Accept the legacy `SavedState` object serialisation format for a
while (which will be overwritten by new format.)
* Normalised storage tree addressing in function prototypes
detail:
Argument list is always `<db> <account-path> <slot-path> ..` with
both path arguments as `openArray[]`
* Remove cruft
* CoreDb internally Use full account paths rather than addresses
* Update API logging
* Use hashed account address only in prototypes
why:
This avoids unnecessary repeated hashing of the same account address.
The burden of doing that is upon the application. In the case here,
the ledger caches all kinds of stuff anyway so it is common sense to
exploit that for account address hashes.
caveat:
Using `openArray[byte]` argument types for hashed accounts is inherently
fragile. In non-release mode, a length verification `doAssert` is
enabled by default.
* No accPath in data record (use `AristoAccount` as `CoreDbAccount`)
* Remove now unused `eAddr` field from ledger `AccountRef` type
why:
Is duplicate of lookup key
* Avoid merging the account record/statement in the ledger twice.
* Tighten `CoreDb` API for accounts
why:
Apart from cruft, the way to fetch the accounts state root via a
`CoreDbColRef` record was unnecessarily complicated.
* Extend `CoreDb` API for accounts to cover storage tries
why:
In future, this will make the notion of column objects obsolete. Storage
trees will then be indexed by the account address rather than the vertex
ID equivalent like a `CoreDbColRef`.
* Apply new/extended accounts API to ledger and tests
details:
This makes the `distinct_ledger` module obsolete
* Remove column object constructors
why:
They were needed as an abstraction of MPT sub-trees including storage
trees. Now, storage trees are handled by the account (e.g. via address)
they belong to and all other trees can be identified by a constant well
known vertex ID. So there is no need for column objects anymore.
Still there are some left-over column object methods wnich will be
removed next.
* Remove `serialise()` and `PayloadRef` from default Aristo API
why:
Not needed. `PayloadRef` was used for unstructured/unknown payload
formats (account or blob) and `serialise()` was used for decodng
`PayloadRef`. Now it is known in advance what the payload looks
like.
* Added query function `hasStorageData()` whether a storage area exists
why:
Useful for supporting `slotStateEmpty()` of the `CoreDb` API
* In the `Ledger` replace `storage.stateEmpty()` by `slotStateEmpty()`
* On Aristo, hide the storage root/vertex ID in the `PayloadRef`
why:
The storage vertex ID is fully controlled by Aristo while the
`AristoAccount` object is controlled by the application. With the
storage root part of the `AristoAccount` object, there was a useless
administrative burden to keep that storage root field up to date.
* Remove cruft, update comments etc.
* Update changed MPT access paradigms
why:
Fixes verified proxy tests
* Fluffy cosmetics
* creating a seq from a table that holds lots of changes means copying
all data into the table - this can be several GB of data while syncing
blocks
* nim fails to optimize the moving of the `WidthFirstForest` - the real
solution is to not construct a `wff` to begin with, but this PR provides
relief while that is being worked on
This spike fix allows us to bump the rocksdb cache by another 2 GB and
still have a significantly lower peak memory usage during sync.
This buffer eleminates a large part of allocations during MPT traversal,
reducing overall memory usage and GC pressure.
Ideally, we would use it throughout in the API instead of
`openArray[byte]` since the built-in length limit appropriately exposes
the natural 64-nibble depth constraint that `openArray` fails to
capture.
* Provide dedicated functions for fetching accounts and storage trees
why:
Different prototypes for each class `account`, `generic` and
`storage`.
* Remove `fetchPayload()` and other cruft from API, `aristo_fetch`, etc.
* Fix typos, debugging left overs, comments
* Provide dedicated functions for deleteing accounts and storage trees
why:
Storage trees are always linked to an account, so there is no need
for an application to fiddle about (e.g. re-cycling, unlinking)
storage tree vertex IDs.
* Remove `delete()` and other cruft from API, `aristo_delete`, etc.
* clean up delete functions
details:
The delete implementations `deleteImpl()` and `delTreeImpl()` do not
need to be super generic anymore as all the edge cases are covered by
the specialised `deleteAccountPayload()`, `deleteGenericData()`, etc.
* Avoid unnecessary re-calculations of account keys
why:
The function `registerAccountForUpdate()` did extract the storage ID
(if any) and automatically marked the Merkle keys along the account
path for re-hashing.
This would also apply if there was later detected that the account
or the storage tree did not need to be updated.
So the `registerAccountForUpdate()` function was split into a part
which retrieved the storage ID, and another one which marked the
Merkle keys for re-calculation to be applied only when needed.
* Remove unused `merge*()` functions (for production)
details:
Some functionality moved to test suite
* Make sure that only `AccountData` leaf type is exactly used on VertexID(1)
* clean up payload type
* Provide dedicated functions for merging accounts and storage trees
why:
Storage trees are always linked to an account, so there is no need
for an application to fiddle about (e.e. creating, re-cycling) with
storage tree vertex IDs.
* CoreDb: Disable tracer functionality
why:
Must be updated to accommodate new/changed `Aristo` functions.
* CoreDb: Use new `mergeXXX()` functions
why:
Makes explicit vertex ID management obsolete for creating new
storage trees.
* Remove `mergePayload()` and other cruft from API, `aristo_merge`, etc.
* clean up merge functions
details:
The merge implementation `mergePayloadImpl()` does not need to be super
generic anymore as all the edge cases are covered by the specialised
functions `mergeAccountPayload()`, `mergeGenericData()`, and
`mergeStorageData()`.
* No tracer available at the moment, so disable offending tests
* Fix initialiser
why:
Possible crash (app profiling, tracer etc.)
* Update column family options processing
why:
Same for kvt as for aristo
* Move `AristoDbDualRocks` backend type to the test suite
why:
So it is not available for production
* Fix typos in API jump table
why:
Used for tracing and app profiling only. Needed some update
* Purged CoreDb legacy API
why:
Not needed anymore, was transitionary and disabled.
* Rename `flush` argument to `eradicate` in a DB close context
why:
The word `eradicate` leaves no doubt what is meant
* Rename `stoFlush()` -> `stoDelete()`
* Rename `core_apps_newapi` -> `core_apps` (not so new anymore)
* Bump nim-eth, nim-web3, nimbus-eth2
- Replace std.Option with results.Opt
- Fields name changes
* More fixes
* Fix Portal stream async raises and portal testnet Opt usage
* Bump eth + nimbus-eth2 + more fixes related to eth_types changes
* Fix in utp test app and nimbus-eth2 bump
* Fix test_blockchain_json rebase conflict
* Fix EVMC block_timestamp conversion plus commentary
---------
Co-authored-by: kdeme <kim.demey@gmail.com>
* bump rockdb
* Rename `KVT` objects related to filters according to `Aristo` naming
details:
filter* => delta*
roFilter => balancer
* Compulsory error handling if `persistent()` fails
* Add return code to `reCentre()`
why:
Might eventually fail if re-centring is blocked. Some logic will be
added in subsequent patch sets.
* Add column families from earlier session to rocksdb in opening procedure
why:
All previously used CFs must be declared when re-opening an existing
database.
* Update `init()` and add rocksdb `reinit()` methods for changing parameters
why:
Opening a set column families (with different open options) must span
at least the ones that are already on disk.
* Provide write-trigger-event interface into `Aristo` backend
why:
This allows to save data from a guest application (think `KVT`) to
get synced with the write cycle so the guest and `Aristo` save all
atomically.
* Use `KVT` with new column family interface from `Aristo`
* Remove obsolete guest interface
* Implement `KVT` piggyback on `Aristo` backend
* CoreDb: Add separate `KVT`/`Aristo` backend mode for debugging
* Remove `rocks_db` import from `persist()` function
why:
Some systems (i.p `fluffy` and friends) use the `Aristo` memory
backend emulation and do not link against rocksdb when building the
application. So this should fix that problem.
* Use RocksDb column families instead of a prefixed single column
why:
Better performance
* Use structural objects `VertexRef` and `HashKey` in LRU cache for RocksDb
why:
Avoids repeated de/serialisation