Apply some basic rocksdb options (#2339)
These options, inspired by Nethermind and general internet wisdom, bring the database size down to 2/3 without affecting throughput. In theory, they should also bring down memory usage and/or make more efficient use of whatever memory is already assigned to rocksdb but this needs verification in a longer test at synced-mainnet sizes. In the meantime, they make testing easier by removing some noise that the profiler says are bad, such as excessive SkipList access (countered by bloom filters).
This commit is contained in:
parent
1b784695d5
commit
54f793f946
|
@ -43,12 +43,49 @@ proc init*(
|
|||
except OSError, IOError:
|
||||
return err((RdbBeCantCreateDataDir, ""))
|
||||
|
||||
let
|
||||
cfOpts = defaultColFamilyOptions()
|
||||
# TODO the configuration options below have not been tuned but are rather
|
||||
# based on gut feeling, guesses and by looking at other clients - it
|
||||
# would make sense to test different settings and combinations once the
|
||||
# data model itself has settled down as their optimal values will depend
|
||||
# on the shape of the data - it'll also be different per column family..
|
||||
let cfOpts = defaultColFamilyOptions()
|
||||
|
||||
if opts.writeBufferSize > 0:
|
||||
cfOpts.setWriteBufferSize(opts.writeBufferSize)
|
||||
|
||||
# Without this option, the WAL might never get flushed since a small column
|
||||
# family (like the admin CF) with only tiny writes might keep it open - this
|
||||
# negatively affects startup times since the WAL is replayed on every startup.
|
||||
# https://github.com/facebook/rocksdb/blob/af50823069818fc127438e39fef91d2486d6e76c/include/rocksdb/options.h#L719
|
||||
# Flushing the oldest
|
||||
let writeBufferSize =
|
||||
if opts.writeBufferSize > 0:
|
||||
opts.writeBufferSize
|
||||
else:
|
||||
64 * 1024 * 1024 # TODO read from rocksdb?
|
||||
|
||||
cfOpts.setMaxTotalWalSize(2 * writeBufferSize)
|
||||
|
||||
# When data is written to rocksdb, it is first put in an in-memory table
|
||||
# whose index is a skip list. Since the mem table holds the most recent data,
|
||||
# all reads must go through this skiplist which results in slow lookups for
|
||||
# already-written data.
|
||||
# We enable a bloom filter on the mem table to avoid this lookup in the cases
|
||||
# where the data is actually on disk already (ie wasn't updated recently).
|
||||
# TODO there's also a hashskiplist that has both a hash index and a skip list
|
||||
# which maybe could be used - uses more memory, requires a key prefix
|
||||
# extractor
|
||||
cfOpts.setMemtableWholeKeyFiltering(true)
|
||||
cfOpts.setMemtablePrefixBloomSizeRatio(0.1)
|
||||
|
||||
# LZ4 seems to cut database size to 2/3 roughly, at the time of writing
|
||||
# Using it for the bottom-most level means it applies to 90% of data but
|
||||
# delays compression until data has settled a bit, which seems like a
|
||||
# reasonable tradeoff.
|
||||
# TODO evaluate zstd compression with a trained dictionary
|
||||
# https://github.com/facebook/rocksdb/wiki/Compression
|
||||
cfOpts.setBottommostCompression(Compression.lz4Compression)
|
||||
|
||||
let
|
||||
cfs = @[initColFamilyDescriptor(AdmCF, cfOpts),
|
||||
initColFamilyDescriptor(VtxCF, cfOpts),
|
||||
|
@ -60,15 +97,52 @@ proc init*(
|
|||
dbOpts.setMaxBytesForLevelBase(opts.writeBufferSize)
|
||||
|
||||
if opts.rowCacheSize > 0:
|
||||
# Good for GET queries, which is what we do most of the time - if we start
|
||||
# using range queries, we should probably give more attention to the block
|
||||
# cache
|
||||
# https://github.com/facebook/rocksdb/blob/af50823069818fc127438e39fef91d2486d6e76c/include/rocksdb/options.h#L1276
|
||||
dbOpts.setRowCache(cacheCreateLRU(opts.rowCacheSize))
|
||||
|
||||
# We mostly look up data we know is there, so we don't need filters at the
|
||||
# last level of the database - this option saves 90% bloom filter memory usage
|
||||
# TODO verify this point
|
||||
# https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md#indexes-and-filter-blocks
|
||||
# https://github.com/facebook/rocksdb/blob/af50823069818fc127438e39fef91d2486d6e76c/include/rocksdb/advanced_options.h#L696
|
||||
dbOpts.setOptimizeFiltersForHits(true)
|
||||
|
||||
let tableOpts = defaultTableOptions()
|
||||
# This bloom filter helps avoid having to read multiple SST files when looking
|
||||
# for a value.
|
||||
# A 9.9-bits-per-key ribbon filter takes ~7 bits per key and has a 1% false
|
||||
# positive rate which feels like a good enough starting point, though this
|
||||
# should be better investigated.
|
||||
# https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#ribbon-filter
|
||||
# https://github.com/facebook/rocksdb/blob/d64eac28d32a025770cba641ea04e697f475cdd6/include/rocksdb/filter_policy.h#L208
|
||||
tableOpts.setFilterPolicy(createRibbonHybrid(9.9))
|
||||
|
||||
if opts.blockCacheSize > 0:
|
||||
let tableOpts = defaultTableOptions()
|
||||
tableOpts.setBlockCache(cacheCreateLRU(opts.rowCacheSize))
|
||||
dbOpts.setBlockBasedTableFactory(tableOpts)
|
||||
|
||||
# Single-level indices might cause long stalls due to their large size -
|
||||
# two-level indexing allows the first level to be kept in memory at all times
|
||||
# while the second level is partitioned resulting in smoother loading
|
||||
# https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters#how-to-use-it
|
||||
tableOpts.setIndexType(IndexType.twoLevelIndexSearch)
|
||||
tableOpts.setPinTopLevelIndexAndFilter(true)
|
||||
tableOpts.setCacheIndexAndFilterBlocksWithHighPriority(true)
|
||||
tableOpts.setPartitionFilters(true) # TODO do we need this?
|
||||
|
||||
# This option adds a small hash index to each data block, presumably speeding
|
||||
# up Get queries (but again not range queries) - takes up space, apparently
|
||||
# a good tradeoff for most workloads
|
||||
# https://github.com/facebook/rocksdb/wiki/Data-Block-Hash-Index
|
||||
tableOpts.setDataBlockIndexType(DataBlockIndexType.binarySearchAndHash)
|
||||
tableOpts.setDataBlockHashRatio(0.75)
|
||||
|
||||
dbOpts.setBlockBasedTableFactory(tableOpts)
|
||||
|
||||
# Reserve a family corner for `Aristo` on the database
|
||||
let baseDb = openRocksDb(dataDir, dbOpts, columnFamilies=cfs).valueOr:
|
||||
let baseDb = openRocksDb(dataDir, dbOpts, columnFamilies = cfs).valueOr:
|
||||
raiseAssert initFailed & " cannot create base descriptor: " & error
|
||||
|
||||
# Initialise column handlers (this stores implicitely `baseDb`)
|
||||
|
|
|
@ -1 +1 @@
|
|||
Subproject commit e34c8e825cf0d9b3dfdb9bec64b7e8b05de69a24
|
||||
Subproject commit 138dadac9c8a46462059bc136c953bb2fa41fbbe
|
Loading…
Reference in New Issue