Core LSM Principle
“Instead of updating data in place, append all changes and merge them later”
Understanding Log-Structured Merge Trees through FerrisDB’s implementation.
Difficulty: 📙 Intermediate • Reading time: 25 minutes
Have you ever wondered why your web application slows down when lots of users start writing data at the same time? Or why some databases can handle millions of writes per second while others struggle with thousands?
The problem lies in how traditional databases handle writes. Imagine you’re managing a customer database for an e-commerce site. Every time someone places an order, updates their profile, or adds items to their cart, your database needs to write that information to disk.
Traditional B-tree databases face a fundamental challenge:
When you write data, the database has to:
Real-world analogy: It’s like having to find the exact right spot in a filing cabinet, pull out the entire folder, add your single document, and put it back - for every single document. If the cabinet is across the building (representing disk storage), this gets incredibly slow!
LSM-Trees solve this by completely rethinking how databases handle writes.
LSM-Trees (Log-Structured Merge Trees) solve the write performance problem with a simple but revolutionary idea:
Core LSM Principle
“Instead of updating data in place, append all changes and merge them later”
Think of it like a restaurant’s order system:
This transforms expensive random writes (finding the exact spot) into fast sequential writes (just append to the end).
graph LR Write[Write Request] --> WAL[WAL<br/>Sequential Write Log] WAL --> MemTable[MemTable<br/>In-Memory Skip List] MemTable -->|"Flush when full"| SSTable1[SSTable 1<br/>Immutable File] MemTable -->|"Flush when full"| SSTable2[SSTable 2<br/>Immutable File] SSTable1 -->|Compaction| SSTable3[Merged SSTable] SSTable2 -->|Compaction| SSTable3 style WAL fill:#f9f,stroke:#333,stroke-width:2px style MemTable fill:#9ff,stroke:#333,stroke-width:2px style SSTable1 fill:#ff9,stroke:#333,stroke-width:2px style SSTable2 fill:#ff9,stroke:#333,stroke-width:2px style SSTable3 fill:#9f9,stroke:#333,stroke-width:2px
Key principles:
Let’s see how FerrisDB implements each component of the LSM-Tree:
// ferrisdb-storage/src/memtable/mod.rs:49-67pub struct MemTable { /// Uses Arc for shared ownership in LSM-tree scenarios: /// - Storage engine keeps immutable MemTables for reads during flush /// - Background threads flush MemTable to SSTable /// - Iterators need concurrent access without blocking writes skiplist: Arc<SkipList>, memory_usage: AtomicUsize, max_size: usize,}
impl MemTable { pub fn new(max_size: usize) -> Self { Self { skiplist: Arc::new(SkipList::new()), memory_usage: AtomicUsize::new(0), max_size, } }}
Key design decisions:
The MemTable is where all writes initially go - think of it as the “notepad” in our restaurant analogy:
// ferrisdb-storage/src/memtable/mod.rs:81-95pub fn put(&self, key: Key, value: Value, timestamp: Timestamp) -> Result<()> { let size_estimate = key.len() + value.len() + 64; // 64 bytes overhead estimate
self.skiplist.insert(key, value, timestamp, Operation::Put);
let new_usage = self .memory_usage .fetch_add(size_estimate, Ordering::Relaxed);
if new_usage + size_estimate > self.max_size { return Err(Error::MemTableFull); }
Ok(())}
How it works:
When the MemTable gets full, it’s flushed to an SSTable (Sorted String Table):
// ferrisdb-storage/src/sstable/mod.rs:179-188pub struct SSTableEntry { /// The internal key (user_key + timestamp) pub key: InternalKey, /// The value associated with this key version pub value: Value, /// The operation type (Put/Delete) for this entry pub operation: Operation,}
Performance characteristics:
Write Performance Improvement
Read Performance Trade-off
FerrisDB implements size-tiered compaction:
// ferrisdb-storage/src/storage_engine.rs (conceptual)pub fn size_tiered_compaction(&mut self) -> Result<()> { // Group SSTables by similar size let mut size_tiers: BTreeMap<u64, Vec<SSTable>> = BTreeMap::new();
for sstable in &self.sstables { let size_tier = size_tier_for_size(sstable.metadata.file_size); size_tiers.entry(size_tier).or_default().push(sstable.clone()); }
// Compact tiers with too many SSTables for (tier_size, sstables) in size_tiers { if sstables.len() >= self.config.max_sstables_per_tier { self.compact_sstables(sstables)?; } }
Ok(())}
How compaction works:
Planned optimizations
Research directions
// Compare sequential vs random writeslet mut memtable = MemTable::new(1024 * 1024);
// Sequential writes (fast)for i in 0..1000 { let key = format!("key_{:06}", i); memtable.put(key.into_bytes(), b"value".to_vec(), i)?;}
// Random writes (still fast in MemTable!)for i in 0..1000 { let key = format!("key_{:06}", rand::random::<u32>()); memtable.put(key.into_bytes(), b"value".to_vec(), i)?;}
# Watch memory usage during bulk writescargo run --example bulk_insert -- --count 100000 --observe-memory
Key metrics to watch:
Debugging techniques:
cargo run --bin wal-reader
to examine write patternsHow other databases handle this:
Database | Compaction Strategy | Key Feature |
---|---|---|
Cassandra | Size-tiered (like FerrisDB) | Wide column support |
LevelDB/RocksDB | Leveled compaction | Better read performance |
ScyllaDB | Per-core LSM trees | Better parallelism |
Operational concerns:
ferrisdb-storage/src/
- Complete LSM-tree implementationferrisdb-storage/src/memtable/mod.rs
- In-memory bufferferrisdb-storage/src/sstable/
- Persistent storage formatferrisdb-storage/src/storage_engine.rs
- Integration test examplesPart of the FerrisDB Learning Journey. Built with ❤️ by a human and an AI.