Apply some basic rocksdb options (#2339)

These options, inspired by Nethermind and general internet wisdom, bring the database size down to 2/3 without affecting throughput. In theory, they should also bring down memory usage and/or make more efficient use of whatever memory is already assigned to rocksdb but this needs verification in a longer test at synced-mainnet sizes. In the meantime, they make testing easier by removing some noise that the profiler says are bad, such as excessive SkipList access (countered by bloom filters).
2025-02-03 07:45:18 +00:00 · 2024-06-12 14:52:27 +02:00 · 2024-06-12 14:52:27 +02:00 · 54f793f946
commit 54f793f946
parent 1b784695d5
2 changed files with 80 additions and 6 deletions
--- a/nimbus/db/aristo/aristo_init/rocks_db/rdb_init.nim
+++ b/nimbus/db/aristo/aristo_init/rocks_db/rdb_init.nim
@ -43,12 +43,49 @@ proc init*(
  except OSError, IOError:
    return err((RdbBeCantCreateDataDir, ""))

-  let
-    cfOpts = defaultColFamilyOptions()
+  # TODO the configuration options below have not been tuned but are rather
+  #      based on gut feeling, guesses and by looking at other clients - it
+  #      would make sense to test different settings and combinations once the
+  #      data model itself has settled down as their optimal values will depend
+  #      on the shape of the data - it'll also be different per column family..
+  let cfOpts = defaultColFamilyOptions()

  if opts.writeBufferSize > 0:
    cfOpts.setWriteBufferSize(opts.writeBufferSize)

+  # Without this option, the WAL might never get flushed since a small column
+  # family (like the admin CF) with only tiny writes might keep it open - this
+  # negatively affects startup times since the WAL is replayed on every startup.
+  # https://github.com/facebook/rocksdb/blob/af50823069818fc127438e39fef91d2486d6e76c/include/rocksdb/options.h#L719
+  # Flushing the oldest
+  let writeBufferSize =
+    if opts.writeBufferSize > 0:
+      opts.writeBufferSize
+    else:
+      64 * 1024 * 1024 # TODO read from rocksdb?
+
+  cfOpts.setMaxTotalWalSize(2 * writeBufferSize)
+
+  # When data is written to rocksdb, it is first put in an in-memory table
+  # whose index is a skip list. Since the mem table holds the most recent data,
+  # all reads must go through this skiplist which results in slow lookups for
+  # already-written data.
+  # We enable a bloom filter on the mem table to avoid this lookup in the cases
+  # where the data is actually on disk already (ie wasn't updated recently).
+  # TODO there's also a hashskiplist that has both a hash index and a skip list
+  #      which maybe could be used - uses more memory, requires a key prefix
+  #      extractor
+  cfOpts.setMemtableWholeKeyFiltering(true)
+  cfOpts.setMemtablePrefixBloomSizeRatio(0.1)
+
+  # LZ4 seems to cut database size to 2/3 roughly, at the time of writing
+  # Using it for the bottom-most level means it applies to 90% of data but
+  # delays compression until data has settled a bit, which seems like a
+  # reasonable tradeoff.
+  # TODO evaluate zstd compression with a trained dictionary
+  # https://github.com/facebook/rocksdb/wiki/Compression
+  cfOpts.setBottommostCompression(Compression.lz4Compression)
+
  let
    cfs = @[initColFamilyDescriptor(AdmCF, cfOpts),
            initColFamilyDescriptor(VtxCF, cfOpts),
@ -60,15 +97,52 @@ proc init*(
  dbOpts.setMaxBytesForLevelBase(opts.writeBufferSize)

  if opts.rowCacheSize > 0:
+    # Good for GET queries, which is what we do most of the time - if we start
+    # using range queries, we should probably give more attention to the block
+    # cache
+    # https://github.com/facebook/rocksdb/blob/af50823069818fc127438e39fef91d2486d6e76c/include/rocksdb/options.h#L1276
    dbOpts.setRowCache(cacheCreateLRU(opts.rowCacheSize))

+  # We mostly look up data we know is there, so we don't need filters at the
+  # last level of the database - this option saves 90% bloom filter memory usage
+  # TODO verify this point
+  # https://github.com/EighteenZi/rocksdb_wiki/blob/master/Memory-usage-in-RocksDB.md#indexes-and-filter-blocks
+  # https://github.com/facebook/rocksdb/blob/af50823069818fc127438e39fef91d2486d6e76c/include/rocksdb/advanced_options.h#L696
+  dbOpts.setOptimizeFiltersForHits(true)
+
+  let tableOpts = defaultTableOptions()
+  # This bloom filter helps avoid having to read multiple SST files when looking
+  # for a value.
+  # A 9.9-bits-per-key ribbon filter takes ~7 bits per key and has a 1% false
+  # positive rate which feels like a good enough starting point, though this
+  # should be better investigated.
+  # https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#ribbon-filter
+  # https://github.com/facebook/rocksdb/blob/d64eac28d32a025770cba641ea04e697f475cdd6/include/rocksdb/filter_policy.h#L208
+  tableOpts.setFilterPolicy(createRibbonHybrid(9.9))
+
  if opts.blockCacheSize > 0:
-    let tableOpts = defaultTableOptions()
    tableOpts.setBlockCache(cacheCreateLRU(opts.rowCacheSize))
-    dbOpts.setBlockBasedTableFactory(tableOpts)
+
+  # Single-level indices might cause long stalls due to their large size -
+  # two-level indexing allows the first level to be kept in memory at all times
+  # while the second level is partitioned resulting in smoother loading
+  # https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters#how-to-use-it
+  tableOpts.setIndexType(IndexType.twoLevelIndexSearch)
+  tableOpts.setPinTopLevelIndexAndFilter(true)
+  tableOpts.setCacheIndexAndFilterBlocksWithHighPriority(true)
+  tableOpts.setPartitionFilters(true) # TODO do we need this?
+
+  # This option adds a small hash index to each data block, presumably speeding
+  # up Get queries (but again not range queries) - takes up space, apparently
+  # a good tradeoff for most workloads
+  # https://github.com/facebook/rocksdb/wiki/Data-Block-Hash-Index
+  tableOpts.setDataBlockIndexType(DataBlockIndexType.binarySearchAndHash)
+  tableOpts.setDataBlockHashRatio(0.75)
+
+  dbOpts.setBlockBasedTableFactory(tableOpts)

  # Reserve a family corner for `Aristo` on the database
-  let baseDb = openRocksDb(dataDir, dbOpts, columnFamilies=cfs).valueOr:
+  let baseDb = openRocksDb(dataDir, dbOpts, columnFamilies = cfs).valueOr:
    raiseAssert initFailed & " cannot create base descriptor: " & error

  # Initialise column handlers (this stores implicitely `baseDb`)
--- a/vendor/nim-rocksdb
+++ b/vendor/nim-rocksdb
@ -1 +1 @@
-Subproject commit e34c8e825cf0d9b3dfdb9bec64b7e8b05de69a24
+Subproject commit 138dadac9c8a46462059bc136c953bb2fa41fbbe