Apply some basic rocksdb options (#2339)

These options, inspired by Nethermind and general internet wisdom, bring
the database size down to 2/3 without affecting throughput. In theory,
they should also bring down memory usage and/or make more efficient use
of whatever memory is already assigned to rocksdb but this needs
verification in a longer test at synced-mainnet sizes.

In the meantime, they make testing easier by removing some noise that
the profiler says are bad, such as excessive SkipList access (countered
by bloom filters).
This commit is contained in:
Jacek Sieka 2024-06-12 14:52:27 +02:00 committed by GitHub
parent 1b784695d5
commit 54f793f946
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 80 additions and 6 deletions

View File

@ -43,12 +43,49 @@ proc init*(
except OSError, IOError:
return err((RdbBeCantCreateDataDir, ""))
let
cfOpts = defaultColFamilyOptions()
# TODO the configuration options below have not been tuned but are rather
# based on gut feeling, guesses and by looking at other clients - it
# would make sense to test different settings and combinations once the
# data model itself has settled down as their optimal values will depend
# on the shape of the data - it'll also be different per column family..
let cfOpts = defaultColFamilyOptions()
if opts.writeBufferSize > 0:
cfOpts.setWriteBufferSize(opts.writeBufferSize)
# Without this option, the WAL might never get flushed since a small column
# family (like the admin CF) with only tiny writes might keep it open - this
# negatively affects startup times since the WAL is replayed on every startup.
# https://github.com/facebook/rocksdb/blob/af50823069818fc127438e39fef91d2486d6e76c/include/rocksdb/options.h#L719
# Flushing the oldest
let writeBufferSize =
if opts.writeBufferSize > 0:
opts.writeBufferSize
else:
64 * 1024 * 1024 # TODO read from rocksdb?
cfOpts.setMaxTotalWalSize(2 * writeBufferSize)
# When data is written to rocksdb, it is first put in an in-memory table
# whose index is a skip list. Since the mem table holds the most recent data,
# all reads must go through this skiplist which results in slow lookups for
# already-written data.
# We enable a bloom filter on the mem table to avoid this lookup in the cases
# where the data is actually on disk already (ie wasn't updated recently).
# TODO there's also a hashskiplist that has both a hash index and a skip list
# which maybe could be used - uses more memory, requires a key prefix
# extractor
cfOpts.setMemtableWholeKeyFiltering(true)
cfOpts.setMemtablePrefixBloomSizeRatio(0.1)
# LZ4 seems to cut database size to 2/3 roughly, at the time of writing
# Using it for the bottom-most level means it applies to 90% of data but
# delays compression until data has settled a bit, which seems like a
# reasonable tradeoff.
# TODO evaluate zstd compression with a trained dictionary
# https://github.com/facebook/rocksdb/wiki/Compression
cfOpts.setBottommostCompression(Compression.lz4Compression)
let
cfs = @[initColFamilyDescriptor(AdmCF, cfOpts),
initColFamilyDescriptor(VtxCF, cfOpts),
@ -60,15 +97,52 @@ proc init*(
dbOpts.setMaxBytesForLevelBase(opts.writeBufferSize)
if opts.rowCacheSize > 0:
# Good for GET queries, which is what we do most of the time - if we start
# using range queries, we should probably give more attention to the block
# cache
# https://github.com/facebook/rocksdb/blob/af50823069818fc127438e39fef91d2486d6e76c/include/rocksdb/options.h#L1276
dbOpts.setRowCache(cacheCreateLRU(opts.rowCacheSize))
if opts.blockCacheSize > 0:
# We mostly look up data we know is there, so we don't need filters at the
# last level of the database - this option saves 90% bloom filter memory usage
# TODO verify this point
# https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md#indexes-and-filter-blocks
# https://github.com/facebook/rocksdb/blob/af50823069818fc127438e39fef91d2486d6e76c/include/rocksdb/advanced_options.h#L696
dbOpts.setOptimizeFiltersForHits(true)
let tableOpts = defaultTableOptions()
# This bloom filter helps avoid having to read multiple SST files when looking
# for a value.
# A 9.9-bits-per-key ribbon filter takes ~7 bits per key and has a 1% false
# positive rate which feels like a good enough starting point, though this
# should be better investigated.
# https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#ribbon-filter
# https://github.com/facebook/rocksdb/blob/d64eac28d32a025770cba641ea04e697f475cdd6/include/rocksdb/filter_policy.h#L208
tableOpts.setFilterPolicy(createRibbonHybrid(9.9))
if opts.blockCacheSize > 0:
tableOpts.setBlockCache(cacheCreateLRU(opts.rowCacheSize))
# Single-level indices might cause long stalls due to their large size -
# two-level indexing allows the first level to be kept in memory at all times
# while the second level is partitioned resulting in smoother loading
# https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters#how-to-use-it
tableOpts.setIndexType(IndexType.twoLevelIndexSearch)
tableOpts.setPinTopLevelIndexAndFilter(true)
tableOpts.setCacheIndexAndFilterBlocksWithHighPriority(true)
tableOpts.setPartitionFilters(true) # TODO do we need this?
# This option adds a small hash index to each data block, presumably speeding
# up Get queries (but again not range queries) - takes up space, apparently
# a good tradeoff for most workloads
# https://github.com/facebook/rocksdb/wiki/Data-Block-Hash-Index
tableOpts.setDataBlockIndexType(DataBlockIndexType.binarySearchAndHash)
tableOpts.setDataBlockHashRatio(0.75)
dbOpts.setBlockBasedTableFactory(tableOpts)
# Reserve a family corner for `Aristo` on the database
let baseDb = openRocksDb(dataDir, dbOpts, columnFamilies=cfs).valueOr:
let baseDb = openRocksDb(dataDir, dbOpts, columnFamilies = cfs).valueOr:
raiseAssert initFailed & " cannot create base descriptor: " & error
# Initialise column handlers (this stores implicitely `baseDb`)

2
vendor/nim-rocksdb vendored

@ -1 +1 @@
Subproject commit e34c8e825cf0d9b3dfdb9bec64b7e8b05de69a24
Subproject commit 138dadac9c8a46462059bc136c953bb2fa41fbbe