nimbus-eth1/nimbus/sync/flare/README.md

239 lines
8.8 KiB
Markdown
Raw Normal View History

Syncing
=======
Syncing blocks is performed in two partially overlapping phases
* loading the header chains into separate database tables
* removing headers from the headers chain, fetching the rest of the
block the header belongs to and executing it
Header chains
-------------
The header chains are the triple of
* a consecutively linked chain of headers starting starting at Genesis
* followed by a sequence of missing headers
* followed by a consecutively linked chain of headers ending up at a
finalised block header received from the consensus layer
A sequence *@[h(1),h(2),..]* of block headers is called a consecutively
linked chain if
* block numbers join without gaps, i.e. *h(n).number+1 == h(n+1).number*
* parent hashes match, i.e. *h(n).hash == h(n+1).parentHash*
General header chains layout diagram
G B L F (1)
o----------------o---------------------o----------------o--->
| <-- linked --> | <-- unprocessed --> | <-- linked --> |
Here, the single upper letter symbols *G*, *B*, *L*, *F* denote block numbers.
For convenience, these letters are also identified with its associated block
header or the full block. Saying *"the header G"* is short for *"the header
with block number G"*.
Meaning of *G*, *B*, *L*, *F*:
* *G* -- Genesis block number *#0*
* *B* -- base, maximal block number of linked chain starting at *G*
* *L* -- least, minimal block number of linked chain ending at *F* with *B <= L*
* *F* -- final, some finalised block
This definition implies *G <= B <= L <= F* and the header chains can uniquely
be described by the triple of block numbers *(B,L,F)*.
Flare sync (#2627) * Cosmetics, small fixes, add stashed headers verifier * Remove direct `Era1` support why: Era1 is indirectly supported by using the import tool before syncing. * Clarify database persistent save function. why: Function relied on the last saved state block number which was wrong. It now relies on the tx-level. If it is 0, then data are saved directly. Otherwise the task that owns the tx will do it. * Extracted configuration constants into separate file * Enable single peer mode for debugging * Fix peer losing issue in multi-mode details: Running concurrent download peers was previously programmed as running a batch downloading and storing ~8k headers and then leaving the `async` function to be restarted by a scheduler. This was unfortunate because of occasionally occurring long waiting times for restart. While the time gap until restarting were typically observed a few millisecs, there were always a few outliers which well exceed several seconds. This seemed to let remote peers run into timeouts. * Prefix function names `unprocXxx()` and `stagedYyy()` by `headers` why: There will be other `unproc` and `staged` modules. * Remove cruft, update logging * Fix accounting issue details: When staging after fetching headers from the network, there was an off by 1 error occurring when the result was by one smaller than requested. Also, a whole range was mis-accounted when a peer was terminating connection immediately after responding. * Fix slow/error header accounting when fetching why: Originally set for detecting slow headers in a row, the counter was wrongly extended to general errors. * Ban peers for a while that respond with too few headers continuously why: Some peers only returned one header at a time. If these peers sit on a farm, they might collectively slow down the download process. * Update RPC beacon header updater why: Old function hook has slightly changed its meaning since it was used for snap sync. Also, the old hook is used by other functions already. * Limit number of peers or set to single peer mode details: Merge several concepts, single peer mode being one of it. * Some code clean up, fixings for removing of compiler warnings * De-noise header fetch related sources why: Header download looks relatively stable, so general debugging is not needed, anymore. This is the equivalent of removing the scaffold from the part of the building where work has completed. * More clean up and code prettification for headers stuff * Implement body fetch and block import details: Available headers are used stage blocks by combining existing headers with newly fetched blocks. Then these blocks are imported/executed via `persistBlocks()`. * Logger cosmetics and cleanup * Remove staged block queue debugging details: Feature still available, just not executed anymore * Docu, logging update * Update/simplify `runDaemon()` * Re-calibrate block body requests and soft config for import blocks batch why: * For fetching, larger fetch requests are mostly truncated anyway on MainNet. * For executing, smaller batch sizes reduce the memory needed for the price of longer execution times. * Update metrics counters * Docu update * Some fixes, formatting updates, etc. * Update `borrowed` type: uint -. uint64 also: Always convert to `uint64` rather than `uint` where appropriate
2024-09-27 15:07:42 +00:00
### Storage of header chains:
Some block numbers from the set *{w|G<=w<=B}* may correspond to finalised
blocks which may be stored anywhere. If some block numbers do not correspond
to finalised blocks, then the headers must reside in the *flareHeader*
database table. Of course, due to being finalised such block numbers constitute
a sub-chain starting at *G*.
The block numbers from the set *{w|L<=w<=F}* must reside in the *flareHeader*
database table. They do not correspond to finalised blocks.
Flare sync (#2627) * Cosmetics, small fixes, add stashed headers verifier * Remove direct `Era1` support why: Era1 is indirectly supported by using the import tool before syncing. * Clarify database persistent save function. why: Function relied on the last saved state block number which was wrong. It now relies on the tx-level. If it is 0, then data are saved directly. Otherwise the task that owns the tx will do it. * Extracted configuration constants into separate file * Enable single peer mode for debugging * Fix peer losing issue in multi-mode details: Running concurrent download peers was previously programmed as running a batch downloading and storing ~8k headers and then leaving the `async` function to be restarted by a scheduler. This was unfortunate because of occasionally occurring long waiting times for restart. While the time gap until restarting were typically observed a few millisecs, there were always a few outliers which well exceed several seconds. This seemed to let remote peers run into timeouts. * Prefix function names `unprocXxx()` and `stagedYyy()` by `headers` why: There will be other `unproc` and `staged` modules. * Remove cruft, update logging * Fix accounting issue details: When staging after fetching headers from the network, there was an off by 1 error occurring when the result was by one smaller than requested. Also, a whole range was mis-accounted when a peer was terminating connection immediately after responding. * Fix slow/error header accounting when fetching why: Originally set for detecting slow headers in a row, the counter was wrongly extended to general errors. * Ban peers for a while that respond with too few headers continuously why: Some peers only returned one header at a time. If these peers sit on a farm, they might collectively slow down the download process. * Update RPC beacon header updater why: Old function hook has slightly changed its meaning since it was used for snap sync. Also, the old hook is used by other functions already. * Limit number of peers or set to single peer mode details: Merge several concepts, single peer mode being one of it. * Some code clean up, fixings for removing of compiler warnings * De-noise header fetch related sources why: Header download looks relatively stable, so general debugging is not needed, anymore. This is the equivalent of removing the scaffold from the part of the building where work has completed. * More clean up and code prettification for headers stuff * Implement body fetch and block import details: Available headers are used stage blocks by combining existing headers with newly fetched blocks. Then these blocks are imported/executed via `persistBlocks()`. * Logger cosmetics and cleanup * Remove staged block queue debugging details: Feature still available, just not executed anymore * Docu, logging update * Update/simplify `runDaemon()` * Re-calibrate block body requests and soft config for import blocks batch why: * For fetching, larger fetch requests are mostly truncated anyway on MainNet. * For executing, smaller batch sizes reduce the memory needed for the price of longer execution times. * Update metrics counters * Docu update * Some fixes, formatting updates, etc. * Update `borrowed` type: uint -. uint64 also: Always convert to `uint64` rather than `uint` where appropriate
2024-09-27 15:07:42 +00:00
### Header chains initialisation:
Minimal layout on a pristine system
G (2)
B
L
F
o--->
When first initialised, the header chains are set to *(G,G,G)*.
Flare sync (#2627) * Cosmetics, small fixes, add stashed headers verifier * Remove direct `Era1` support why: Era1 is indirectly supported by using the import tool before syncing. * Clarify database persistent save function. why: Function relied on the last saved state block number which was wrong. It now relies on the tx-level. If it is 0, then data are saved directly. Otherwise the task that owns the tx will do it. * Extracted configuration constants into separate file * Enable single peer mode for debugging * Fix peer losing issue in multi-mode details: Running concurrent download peers was previously programmed as running a batch downloading and storing ~8k headers and then leaving the `async` function to be restarted by a scheduler. This was unfortunate because of occasionally occurring long waiting times for restart. While the time gap until restarting were typically observed a few millisecs, there were always a few outliers which well exceed several seconds. This seemed to let remote peers run into timeouts. * Prefix function names `unprocXxx()` and `stagedYyy()` by `headers` why: There will be other `unproc` and `staged` modules. * Remove cruft, update logging * Fix accounting issue details: When staging after fetching headers from the network, there was an off by 1 error occurring when the result was by one smaller than requested. Also, a whole range was mis-accounted when a peer was terminating connection immediately after responding. * Fix slow/error header accounting when fetching why: Originally set for detecting slow headers in a row, the counter was wrongly extended to general errors. * Ban peers for a while that respond with too few headers continuously why: Some peers only returned one header at a time. If these peers sit on a farm, they might collectively slow down the download process. * Update RPC beacon header updater why: Old function hook has slightly changed its meaning since it was used for snap sync. Also, the old hook is used by other functions already. * Limit number of peers or set to single peer mode details: Merge several concepts, single peer mode being one of it. * Some code clean up, fixings for removing of compiler warnings * De-noise header fetch related sources why: Header download looks relatively stable, so general debugging is not needed, anymore. This is the equivalent of removing the scaffold from the part of the building where work has completed. * More clean up and code prettification for headers stuff * Implement body fetch and block import details: Available headers are used stage blocks by combining existing headers with newly fetched blocks. Then these blocks are imported/executed via `persistBlocks()`. * Logger cosmetics and cleanup * Remove staged block queue debugging details: Feature still available, just not executed anymore * Docu, logging update * Update/simplify `runDaemon()` * Re-calibrate block body requests and soft config for import blocks batch why: * For fetching, larger fetch requests are mostly truncated anyway on MainNet. * For executing, smaller batch sizes reduce the memory needed for the price of longer execution times. * Update metrics counters * Docu update * Some fixes, formatting updates, etc. * Update `borrowed` type: uint -. uint64 also: Always convert to `uint64` rather than `uint` where appropriate
2024-09-27 15:07:42 +00:00
### Updating header chains:
A header chain with an non empty open interval *(B,L)* can be updated only by
increasing *B* or decreasing *L* by adding headers so that the linked chain
condition is not violated.
Only when the open interval *(B,L)* vanishes the right end *F* can be increased
by *Z* say. Then
* *B==L* beacuse interval *(B,L)* is empty
* *B==F* because *B* is maximal
and the header chains *(F,F,F)* (depicted in *(3)*) can be set to *(B,Z,Z)*
(as depicted in *(4)*.)
Layout before updating of *F*
B (3)
L
G F Z
o----------------o---------------------o---->
| <-- linked --> |
New layout with *Z*
L' (4)
G B F'
o----------------o---------------------o---->
| <-- linked --> | <-- unprocessed --> |
with *L'=Z* and *F'=Z*.
Note that diagram *(3)* is a generalisation of *(2)*.
Flare sync (#2627) * Cosmetics, small fixes, add stashed headers verifier * Remove direct `Era1` support why: Era1 is indirectly supported by using the import tool before syncing. * Clarify database persistent save function. why: Function relied on the last saved state block number which was wrong. It now relies on the tx-level. If it is 0, then data are saved directly. Otherwise the task that owns the tx will do it. * Extracted configuration constants into separate file * Enable single peer mode for debugging * Fix peer losing issue in multi-mode details: Running concurrent download peers was previously programmed as running a batch downloading and storing ~8k headers and then leaving the `async` function to be restarted by a scheduler. This was unfortunate because of occasionally occurring long waiting times for restart. While the time gap until restarting were typically observed a few millisecs, there were always a few outliers which well exceed several seconds. This seemed to let remote peers run into timeouts. * Prefix function names `unprocXxx()` and `stagedYyy()` by `headers` why: There will be other `unproc` and `staged` modules. * Remove cruft, update logging * Fix accounting issue details: When staging after fetching headers from the network, there was an off by 1 error occurring when the result was by one smaller than requested. Also, a whole range was mis-accounted when a peer was terminating connection immediately after responding. * Fix slow/error header accounting when fetching why: Originally set for detecting slow headers in a row, the counter was wrongly extended to general errors. * Ban peers for a while that respond with too few headers continuously why: Some peers only returned one header at a time. If these peers sit on a farm, they might collectively slow down the download process. * Update RPC beacon header updater why: Old function hook has slightly changed its meaning since it was used for snap sync. Also, the old hook is used by other functions already. * Limit number of peers or set to single peer mode details: Merge several concepts, single peer mode being one of it. * Some code clean up, fixings for removing of compiler warnings * De-noise header fetch related sources why: Header download looks relatively stable, so general debugging is not needed, anymore. This is the equivalent of removing the scaffold from the part of the building where work has completed. * More clean up and code prettification for headers stuff * Implement body fetch and block import details: Available headers are used stage blocks by combining existing headers with newly fetched blocks. Then these blocks are imported/executed via `persistBlocks()`. * Logger cosmetics and cleanup * Remove staged block queue debugging details: Feature still available, just not executed anymore * Docu, logging update * Update/simplify `runDaemon()` * Re-calibrate block body requests and soft config for import blocks batch why: * For fetching, larger fetch requests are mostly truncated anyway on MainNet. * For executing, smaller batch sizes reduce the memory needed for the price of longer execution times. * Update metrics counters * Docu update * Some fixes, formatting updates, etc. * Update `borrowed` type: uint -. uint64 also: Always convert to `uint64` rather than `uint` where appropriate
2024-09-27 15:07:42 +00:00
### Complete header chain:
The header chain is *relatively complete* if it satisfies clause *(3)* above
for *G < B*. It is *fully complete* if *F==Z*. It should be obvious that the
latter condition is temporary only on a live system (as *Z* is permanently
updated.)
If a *relatively complete* header chain is reached for the first time, the
execution layer can start running an importer in the background compiling
or executing blocks (starting from block number *#1*.) So the ledger database
state will be updated incrementally.
Flare sync (#2627) * Cosmetics, small fixes, add stashed headers verifier * Remove direct `Era1` support why: Era1 is indirectly supported by using the import tool before syncing. * Clarify database persistent save function. why: Function relied on the last saved state block number which was wrong. It now relies on the tx-level. If it is 0, then data are saved directly. Otherwise the task that owns the tx will do it. * Extracted configuration constants into separate file * Enable single peer mode for debugging * Fix peer losing issue in multi-mode details: Running concurrent download peers was previously programmed as running a batch downloading and storing ~8k headers and then leaving the `async` function to be restarted by a scheduler. This was unfortunate because of occasionally occurring long waiting times for restart. While the time gap until restarting were typically observed a few millisecs, there were always a few outliers which well exceed several seconds. This seemed to let remote peers run into timeouts. * Prefix function names `unprocXxx()` and `stagedYyy()` by `headers` why: There will be other `unproc` and `staged` modules. * Remove cruft, update logging * Fix accounting issue details: When staging after fetching headers from the network, there was an off by 1 error occurring when the result was by one smaller than requested. Also, a whole range was mis-accounted when a peer was terminating connection immediately after responding. * Fix slow/error header accounting when fetching why: Originally set for detecting slow headers in a row, the counter was wrongly extended to general errors. * Ban peers for a while that respond with too few headers continuously why: Some peers only returned one header at a time. If these peers sit on a farm, they might collectively slow down the download process. * Update RPC beacon header updater why: Old function hook has slightly changed its meaning since it was used for snap sync. Also, the old hook is used by other functions already. * Limit number of peers or set to single peer mode details: Merge several concepts, single peer mode being one of it. * Some code clean up, fixings for removing of compiler warnings * De-noise header fetch related sources why: Header download looks relatively stable, so general debugging is not needed, anymore. This is the equivalent of removing the scaffold from the part of the building where work has completed. * More clean up and code prettification for headers stuff * Implement body fetch and block import details: Available headers are used stage blocks by combining existing headers with newly fetched blocks. Then these blocks are imported/executed via `persistBlocks()`. * Logger cosmetics and cleanup * Remove staged block queue debugging details: Feature still available, just not executed anymore * Docu, logging update * Update/simplify `runDaemon()` * Re-calibrate block body requests and soft config for import blocks batch why: * For fetching, larger fetch requests are mostly truncated anyway on MainNet. * For executing, smaller batch sizes reduce the memory needed for the price of longer execution times. * Update metrics counters * Docu update * Some fixes, formatting updates, etc. * Update `borrowed` type: uint -. uint64 also: Always convert to `uint64` rather than `uint` where appropriate
2024-09-27 15:07:42 +00:00
Imported block chain
--------------------
The following imported block chain diagram amends the layout *(1)*:
G T B L F (5)
o------------------o-------o---------------------o----------------o-->
| <-- imported --> | | | |
| <------- linked ------> | <-- unprocessed --> | <-- linked --> |
where *T* is the number of the last imported and executed block. Coincidentally,
*T* also refers to the global state of the ledger database.
The headers corresponding to the half open interval `(T,B]` can be completed by
fetching block bodies and then imported/executed.
Running the sync process for *MainNet*
--------------------------------------
For syncing, a beacon node is needed that regularly informs via *RPC* of a
recently finalised block header.
The beacon node program used here is the *nimbus_beacon_node* binary from the
*nimbus-eth2* project (any other will do.) *Nimbus_beacon_node* is started as
./run-mainnet-beacon-node.sh \
--web3-url=http://127.0.0.1:8551 \
--jwt-secret=/tmp/jwtsecret
where *http://127.0.0.1:8551* is the URL of the sync process that receives the
finalised block header (here on the same physical machine) and `/tmp/jwtsecret`
is the shared secret file needed for mutual communication authentication.
It will take a while for *nimbus_beacon_node* to catch up (see the
[Nimbus Guide](https://nimbus.guide/quick-start.html) for details.)
### Starting `nimbus` for syncing
As the sync process is quite slow, it makes sense to pre-load the database
with data from an `Era1` archive (if available) before starting the real
sync process. The command would be something like
./build/nimbus import \
--era1-dir:/path/to/main-era1/repo \
...
which will take a while for the full *MainNet* era1 repository (but way faster
than the sync.)
On a system with memory considerably larger than *8GiB* the *nimbus*
binary is started on the same machine where the beacon node runs as
./build/nimbus \
--network=mainnet \
--sync-mode=flare \
--engine-api=true \
--engine-api-port=8551 \
--engine-api-ws=true \
--jwt-secret=/tmp/jwtsecret \
...
Note that *--engine-api-port=8551* and *--jwt-secret=/tmp/jwtsecret* match
the corresponding options from the *nimbus-eth2* beacon source example.
### Syncing on a low memory machine
On a system with memory with *8GiB* the following additional options proved
useful for *nimbus* to reduce the memory footprint.
For the *Era1* pre-load (if any) the following extra options apply to
"*nimbus import*":
--chunk-size=1024
--debug-rocksdb-row-cache-size=512000
--debug-rocksdb-block-cache-size=1500000
To start syncing, the following additional options apply to *nimbus*:
--debug-flare-chunk-size=384
--debug-rocksdb-max-open-files=384
--debug-rocksdb-write-buffer-size=50331648
--debug-rocksdb-block-cache-size=1073741824
--debug-rdb-key-cache-size=67108864
--debug-rdb-vtx-cache-size=268435456
Also, to reduce the backlog for *nimbus-eth2* stored on disk, the following
changes might be considered. For file
*nimbus-eth2/vendor/mainnet/metadata/config.yaml* change setting constants:
MIN_EPOCHS_FOR_BLOCK_REQUESTS: 33024
MIN_EPOCHS_FOR_BLOB_SIDECARS_REQUESTS: 4096
to
MIN_EPOCHS_FOR_BLOCK_REQUESTS: 8
MIN_EPOCHS_FOR_BLOB_SIDECARS_REQUESTS: 8
Caveat: These changes are not useful when running *nimbus_beacon_node* as a
production system.
Metrics
-------
The following metrics are defined in *worker/update/metrics.nim* which will
be available if *nimbus* is compiled with the additional make flags
*NIMFLAGS="-d:metrics \-\-threads:on"*:
| *Variable* | *Logic type* | *Short description* |
|:------------------------------|:------------:|:--------------------|
| | | |
| flare_state_block_number | block height | **T**, *increasing* |
| flare_base_block_number | block height | **B**, *increasing* |
| flare_least_block_number | block height | **L** |
| flare_final_block_number | block height | **F**, *increasing* |
| flare_beacon_block_number | block height | **Z**, *increasing* |
| | | |
| flare_headers_staged_queue_len| size | # of staged header list records |
| flare_headers_unprocessed | size | # of accumulated header block numbers|
| flare_blocks_staged_queue_len | size | # of staged block list records |
| flare_blocks_unprocessed | size | # of accumulated body block numbers |
| | | |
| flare_number_of_buddies | size | # of working peers |