When lazily verifying state roots, we may end up with an entire state
without roots that gets computed for the whole database - in the current
design, that would result in hashes for the entire trie being held in
memory.
Since the hash depends only on the data in the vertex, we can store it
directly at the top-most level derived from the verticies it depends on
- be that memory or database - this makes the memory usage broadly
linear with respect to the already-existing in-memory change set stored
in the layers.
It also ensures that if we have multiple forks in memory, hashes get
cached in the correct layer maximising reuse between forks.
The same layer numbering scheme as elsewhere is reused, where -2 is the
backend, -1 is the balancer, then 0+ is the top of the stack and stack.
A downside of this approach is that we create many small batches - a
future improvement could be to collect all such writes in a single
batch, though the memory profile of this approach should be examined
first (where is the batch kept, exactly?).
* Imported/rebase from `no-ext`, PR #2485
Store extension nodes together with the branch
Extension nodes must be followed by a branch - as such, it makes sense
to store the two together both in the database and in memory:
* fewer reads, writes and updates to traverse the tree
* simpler logic for maintaining the node structure
* less space used, both memory and storage, because there are fewer
nodes overall
There is also a downside: hashes can no longer be cached for an
extension - instead, only the extension+branch hash can be cached - this
seems like a fine tradeoff since computing it should be fast.
TODO: fix commented code
* Fix merge functions and `toNode()`
* Update `merkleSignCommit()` prototype
why:
Result is always a 32bit hash
* Update short Merkle hash key generation
details:
Ethereum reference MPTs use Keccak hashes as node links if the size of
an RLP encoded node is at least 32 bytes. Otherwise, the RLP encoded
node value is used as a pseudo node link (rather than a hash.) This is
specified in the yellow paper, appendix D.
Different to the `Aristo` implementation, the reference MPT would not
store such a node on the key-value database. Rather the RLP encoded node value is stored instead of a node link in a parent node
is stored as a node link on the parent database.
Only for the root hash, the top level node is always referred to by the
hash.
* Fix/update `Extension` sections
why:
Were commented out after removal of a dedicated `Extension` type which
left the system disfunctional.
* Clean up unused error codes
* Update unit tests
* Update docu
---------
Co-authored-by: Jacek Sieka <jacek@status.im>
The Vertex type unifies branches, extensions and leaves into a single
memory area where the larges member is the branch (128 bytes + overhead) -
the payloads we have are all smaller than 128 thus wrapping them in an
extra layer of `ref` is wasteful from a memory usage perspective.
Further, the ref:s must be visited during the M&S phase of garbage
collection - since we keep millions of these, many of them
short-lived, this takes up significant CPU time.
```
Function CPU Time: Total CPU Time: Self Module Function (Full) Source File Start Address
system::markStackAndRegisters 10.0% 4.922s nimbus system::markStackAndRegisters(var<system::GcHeap>).constprop.0 gc.nim 0x701230`
```
hike allocations (and the garbage collection maintenance that follows)
are responsible for some 10% of cpu time (not wall time!) at this point
- this PR avoids them by stepping through the layers one step at a time,
simplifying the code at the same time.
The state and account MPT:s currenty share key space in the database
based on that vertex id:s are assigned essentially randomly, which means
that when two adjacent slot values from the same contract are accessed,
they might reside at large distance from each other.
Here, we prefix each vertex id by its root causing them to be sorted
together thus bringing all data belonging to a particular contract
closer together - the same effect also happens for the main state MPT
whose nodes now end up clustered together more tightly.
In the future, the prefix given to the storage keys can also be used to
perform range operations such as reading all the storage at once and/or
deleting an account with a batch operation.
Notably, parts of the API already supported this rooting concept while
parts didn't - this PR makes the API consistent by always working with a
root+vid.
The storage id is frequently accessed when executing contract code and
finding the path via the database requires several hops making the
process slow - here, we add a cache to keep the most recently used
account storage id:s in memory.
A possible future improvement would be to cache all account accesses so
that for example updating the balance doesn't cause several hikes.
* avoid costly hike memory allocations for operations that don't need to
re-traverse it
* avoid unnecessary state checks (which might trigger unwanted state
root computations)
* disable optimize-for-hits due to the MPT no longer being complete at
all times
* Normalised storage tree addressing in function prototypes
detail:
Argument list is always `<db> <account-path> <slot-path> ..` with
both path arguments as `openArray[]`
* Remove cruft
* CoreDb internally Use full account paths rather than addresses
* Update API logging
* Use hashed account address only in prototypes
why:
This avoids unnecessary repeated hashing of the same account address.
The burden of doing that is upon the application. In the case here,
the ledger caches all kinds of stuff anyway so it is common sense to
exploit that for account address hashes.
caveat:
Using `openArray[byte]` argument types for hashed accounts is inherently
fragile. In non-release mode, a length verification `doAssert` is
enabled by default.
* No accPath in data record (use `AristoAccount` as `CoreDbAccount`)
* Remove now unused `eAddr` field from ledger `AccountRef` type
why:
Is duplicate of lookup key
* Avoid merging the account record/statement in the ledger twice.
* Tighten `CoreDb` API for accounts
why:
Apart from cruft, the way to fetch the accounts state root via a
`CoreDbColRef` record was unnecessarily complicated.
* Extend `CoreDb` API for accounts to cover storage tries
why:
In future, this will make the notion of column objects obsolete. Storage
trees will then be indexed by the account address rather than the vertex
ID equivalent like a `CoreDbColRef`.
* Apply new/extended accounts API to ledger and tests
details:
This makes the `distinct_ledger` module obsolete
* Remove column object constructors
why:
They were needed as an abstraction of MPT sub-trees including storage
trees. Now, storage trees are handled by the account (e.g. via address)
they belong to and all other trees can be identified by a constant well
known vertex ID. So there is no need for column objects anymore.
Still there are some left-over column object methods wnich will be
removed next.
* Remove `serialise()` and `PayloadRef` from default Aristo API
why:
Not needed. `PayloadRef` was used for unstructured/unknown payload
formats (account or blob) and `serialise()` was used for decodng
`PayloadRef`. Now it is known in advance what the payload looks
like.
* Added query function `hasStorageData()` whether a storage area exists
why:
Useful for supporting `slotStateEmpty()` of the `CoreDb` API
* In the `Ledger` replace `storage.stateEmpty()` by `slotStateEmpty()`
* On Aristo, hide the storage root/vertex ID in the `PayloadRef`
why:
The storage vertex ID is fully controlled by Aristo while the
`AristoAccount` object is controlled by the application. With the
storage root part of the `AristoAccount` object, there was a useless
administrative burden to keep that storage root field up to date.
* Remove cruft, update comments etc.
* Update changed MPT access paradigms
why:
Fixes verified proxy tests
* Fluffy cosmetics
This buffer eleminates a large part of allocations during MPT traversal,
reducing overall memory usage and GC pressure.
Ideally, we would use it throughout in the API instead of
`openArray[byte]` since the built-in length limit appropriately exposes
the natural 64-nibble depth constraint that `openArray` fails to
capture.
* Remove unused `merge*()` functions (for production)
details:
Some functionality moved to test suite
* Make sure that only `AccountData` leaf type is exactly used on VertexID(1)
* clean up payload type
* Provide dedicated functions for merging accounts and storage trees
why:
Storage trees are always linked to an account, so there is no need
for an application to fiddle about (e.e. creating, re-cycling) with
storage tree vertex IDs.
* CoreDb: Disable tracer functionality
why:
Must be updated to accommodate new/changed `Aristo` functions.
* CoreDb: Use new `mergeXXX()` functions
why:
Makes explicit vertex ID management obsolete for creating new
storage trees.
* Remove `mergePayload()` and other cruft from API, `aristo_merge`, etc.
* clean up merge functions
details:
The merge implementation `mergePayloadImpl()` does not need to be super
generic anymore as all the edge cases are covered by the specialised
functions `mergeAccountPayload()`, `mergeGenericData()`, and
`mergeStorageData()`.
* No tracer available at the moment, so disable offending tests
* Code cosmetics
* Re-org `aristo_merge`, internally split into sub-modules
why:
Became a burden for maintenance because it hosts two different
functionalities under the same merge paradigm: account/data merge
and snap proof merge where the latter produces a partial trie.
* Fix CoreDb tracer
* Ledger: fix potential account vs. storage tree sync problems
* Remove bound on the size of removable whole storage trees
* Activate `test_tracer_json`