nim-bloom-filter/README.md

# nim-bloom
***NOTE: THIS IMPLEMENTATION IS NOT PEER-REVIEWED YET. PLEASE USE WITH CAUTION.***

A high-performance Bloom filter implementation in Nim offering standard and custom hash function options with different performance characteristics and false positive rates.

## Features

- Fast string element insertion and lookup
- Configurable error rates
- Choice between standard Nim hash and custom MurmurHash3 (128-bit or 32-bit)
- Optimized for supporting different use cases of speed and accuracy
- Comprehensive test suite and benchmarks

## Usage

Basic usage (defaults to MurmurHash3_128):
```nim
import bloom2

# Initialize with default hash (MurmurHash3_128)
var bf = initializeBloomFilter(capacity = 10000, errorRate = 0.01)

# Or explicitly specify hash type
var bf32 = initializeBloomFilter(
  capacity = 10000, 
  errorRate = 0.01,
  hashType = htMurmur32  # Use 32-bit implementation
)

# Basic operations
bf.insert("test")
assert bf.lookup("test")
```

## Hash Function Selection

1. Use MurmurHash3_128 (default) when:
    - You need the best balance of performance and accuracy
    - Memory isn't severely constrained
    - Working with large datasets
    - False positive rates are important

2. Use MurmurHash3_32 when:
    - Running on 32-bit systems
    - Memory is constrained
    - Working with smaller datasets
    - String concatenation overhead for second hash, causing higher insertion and lookup times, is acceptable.

3. Use NimHash when:
    - Consistency with Nim's hashing is important
    - Working with smaller datasets where performance is less critical
    - Future availability of better hash functions or performant implementations

Nim's Hash Implementation:
  - Default (no flags): Uses FarmHash implementation
  - With `-d:nimStringHash2`: Uses Nim's MurmurHash3_32 implementation
  - Our implementation allows explicit choice regardless of compilation flags and our MurmurHash3_32 performs better because of directly using a native C Implementation

## Performance Characteristics
### For 1M items - Random Strings
```
Insertion Speed:
MurmurHash3_128: ~6.8M ops/sec
MurmurHash3_32:  ~5.9M ops/sec
FarmHash:        ~2.1M ops/sec

False Positive Rates:
MurmurHash3_128: ~0.84%
MurmurHash3_32:  ~0.83%
FarmHash:        ~0.82%
```

These measurements show MurmurHash3_128's balanced performance profile, offering best speed and competitive false positive rates.

Performance will vary based on:
- Choice of hash function
- Hardware specifications
- Data size and memory access patterns (inside vs outside cache)
- Compiler optimizations

For detailed benchmarks across different data patterns and sizes, see [benches](benches/).

## Implementation Details

### Double Hashing Technique
This implmentation uses the Kirsch-Mitzenmacher method to generate k hash values from two initial hashes. The implementation varies by hash type:

1. MurmurHash3_128:
```nim
h(i) = abs((hash1 + i * hash2) mod m)
```
- Uses both 64-bit hashes from 128-bit output
- Natural double-hash implementation

2. MurmurHash3_32:
```nim
let baseHash = murmurHash32(item, 0'u32)
let secondHash = murmurHash32(item & " b", 0'u32)
```
- Uses string concatention by default for the second hash
- Bit Rotation for second hash provides sufficient randomness in some use cases while being much faster than string concatenation (but results in higher FP rate)
- Choose between bit rotation or string concatenation as per your use-case.

3. Nim's Hash:
```nim
  let
    hashA = abs(hash(item)) mod maxValue
    hashB = abs(hash(item & " b")) mod maxValue
  h(i) = abs((hashA + n * hashB)) mod maxValue
```
- Farm Hash or Nim's Murmur Hash based (if compliation flag is passed)
- Uses string concatention by default.
- Lower FP rate than bit rotation but comes at the cost of higher insertion and lookup times.

*Tip:* Bit rotation values can be configurable as well. Use prime numbers for better mixing: 7, 11, 13, 17 for 32-bit; 21, 23, 27, 33 for 64-bit. Smaller rotations provides lesser mixing but as faster than higher rotations.

## Testing

Run the test suite:
```bash
nimble test
```
Initial commit 2024-12-09 13:22:08 +04:00			`# nim-bloom`
			`*NOTE: THIS IMPLEMENTATION IS NOT PEER-REVIEWED YET. PLEASE USE WITH CAUTION.*`

			`A high-performance Bloom filter implementation in Nim offering standard and custom hash function options with different performance characteristics and false positive rates.`

			`## Features`

			`- Fast string element insertion and lookup`
			`- Configurable error rates`
			`- Choice between standard Nim hash and custom MurmurHash3 (128-bit or 32-bit)`
			`- Optimized for supporting different use cases of speed and accuracy`
			`- Comprehensive test suite and benchmarks`

			`## Usage`

			`Basic usage (defaults to MurmurHash3_128):`
			```nim
			`import bloom2`

			`# Initialize with default hash (MurmurHash3_128)`
			`var bf = initializeBloomFilter(capacity = 10000, errorRate = 0.01)`

			`# Or explicitly specify hash type`
			`var bf32 = initializeBloomFilter(`
			`capacity = 10000,`
			`errorRate = 0.01,`
			`hashType = htMurmur32 # Use 32-bit implementation`
			`)`

			`# Basic operations`
			`bf.insert("test")`
			`assert bf.lookup("test")`
			```

			`## Hash Function Selection`

			`1. Use MurmurHash3_128 (default) when:`
			`- You need the best balance of performance and accuracy`
			`- Memory isn't severely constrained`
			`- Working with large datasets`
			`- False positive rates are important`

			`2. Use MurmurHash3_32 when:`
			`- Running on 32-bit systems`
			`- Memory is constrained`
			`- Working with smaller datasets`
			`- String concatenation overhead for second hash, causing higher insertion and lookup times, is acceptable.`

			`3. Use NimHash when:`
			`- Consistency with Nim's hashing is important`
			`- Working with smaller datasets where performance is less critical`
			`- Future availability of better hash functions or performant implementations`

			`Nim's Hash Implementation:`
			`- Default (no flags): Uses FarmHash implementation`
			- With `-d:nimStringHash2`: Uses Nim's MurmurHash3_32 implementation
			`- Our implementation allows explicit choice regardless of compilation flags and our MurmurHash3_32 performs better because of directly using a native C Implementation`

			`## Performance Characteristics`
			`### For 1M items - Random Strings`
			```
			`Insertion Speed:`
			`MurmurHash3_128: ~6.8M ops/sec`
			`MurmurHash3_32: ~5.9M ops/sec`
			`FarmHash: ~2.1M ops/sec`

			`False Positive Rates:`
			`MurmurHash3_128: ~0.84%`
			`MurmurHash3_32: ~0.83%`
			`FarmHash: ~0.82%`
			```

			`These measurements show MurmurHash3_128's balanced performance profile, offering best speed and competitive false positive rates.`

			`Performance will vary based on:`
			`- Choice of hash function`
			`- Hardware specifications`
			`- Data size and memory access patterns (inside vs outside cache)`
			`- Compiler optimizations`

			`For detailed benchmarks across different data patterns and sizes, see [benches](benches/).`

			`## Implementation Details`

			`### Double Hashing Technique`
			`This implmentation uses the Kirsch-Mitzenmacher method to generate k hash values from two initial hashes. The implementation varies by hash type:`

			`1. MurmurHash3_128:`
			```nim
			`h(i) = abs((hash1 + i * hash2) mod m)`
			```
			`- Uses both 64-bit hashes from 128-bit output`
			`- Natural double-hash implementation`

			`2. MurmurHash3_32:`
			```nim
			`let baseHash = murmurHash32(item, 0'u32)`
			`let secondHash = murmurHash32(item & " b", 0'u32)`
			```
			`- Uses string concatention by default for the second hash`
			`- Bit Rotation for second hash provides sufficient randomness in some use cases while being much faster than string concatenation (but results in higher FP rate)`
			`- Choose between bit rotation or string concatenation as per your use-case.`

			`3. Nim's Hash:`
			```nim
			`let`
			`hashA = abs(hash(item)) mod maxValue`
			`hashB = abs(hash(item & " b")) mod maxValue`
			`h(i) = abs((hashA + n * hashB)) mod maxValue`
			```
			`- Farm Hash or Nim's Murmur Hash based (if compliation flag is passed)`
			`- Uses string concatention by default.`
			`- Lower FP rate than bit rotation but comes at the cost of higher insertion and lookup times.`

			`Tip: Bit rotation values can be configurable as well. Use prime numbers for better mixing: 7, 11, 13, 17 for 32-bit; 21, 23, 27, 33 for 64-bit. Smaller rotations provides lesser mixing but as faster than higher rotations.`

			`## Testing`

			`Run the test suite:`
			```bash
			`nimble test`
			```