LSM-Trees Explained

Understanding Log-Structured Merge Trees through FerrisDB’s implementation.

Difficulty: 📙 Intermediate • Reading time: 25 minutes

The Problem & Why It Matters

Have you ever wondered why your web application slows down when lots of users start writing data at the same time? Or why some databases can handle millions of writes per second while others struggle with thousands?

The problem lies in how traditional databases handle writes. Imagine you’re managing a customer database for an e-commerce site. Every time someone places an order, updates their profile, or adds items to their cart, your database needs to write that information to disk.

Traditional B-tree databases face a fundamental challenge:

When you write data, the database has to:

Find the exact right spot on disk (like finding a specific page in a massive filing cabinet)
Read the entire page into memory
Modify just one small part
Write the entire page back to disk
Update multiple index pages

Real-world analogy: It’s like having to find the exact right spot in a filing cabinet, pull out the entire folder, add your single document, and put it back - for every single document. If the cabinet is across the building (representing disk storage), this gets incredibly slow!

LSM-Trees solve this by completely rethinking how databases handle writes.

Conceptual Overview

The Core Idea

LSM-Trees (Log-Structured Merge Trees) solve the write performance problem with a simple but revolutionary idea:

Core LSM Principle

“Instead of updating data in place, append all changes and merge them later”

Think of it like a restaurant’s order system:

Taking orders (writes): Waiters don’t run to the kitchen after each order. They write orders on a notepad and collect multiple orders first
Batching to kitchen (flush): When the notepad is full, they give all orders to the kitchen at once
Kitchen organization (compaction): The kitchen organizes orders by table and cooking time for efficiency

This transforms expensive random writes (finding the exact spot) into fast sequential writes (just append to the end).

Visual Architecture

graph LR
    Write[Write Request] --> WAL[WAL<br/>Sequential Write Log]
    WAL --> MemTable[MemTable<br/>In-Memory Skip List]
    MemTable -->|"Flush when full"| SSTable1[SSTable 1<br/>Immutable File]
    MemTable -->|"Flush when full"| SSTable2[SSTable 2<br/>Immutable File]
    SSTable1 -->|Compaction| SSTable3[Merged SSTable]
    SSTable2 -->|Compaction| SSTable3

    style WAL fill:#f9f,stroke:#333,stroke-width:2px
    style MemTable fill:#9ff,stroke:#333,stroke-width:2px
    style SSTable1 fill:#ff9,stroke:#333,stroke-width:2px
    style SSTable2 fill:#ff9,stroke:#333,stroke-width:2px
    style SSTable3 fill:#9f9,stroke:#333,stroke-width:2px

Key principles:

Sequential writes: Like writing in a journal - always append, never erase
Memory buffering: Collect writes in RAM before going to disk (like the notepad)
Background merging: Clean up and organize data when the system isn’t busy

FerrisDB Implementation Deep Dive

Core Data Structures

Let’s see how FerrisDB implements each component of the LSM-Tree:

// ferrisdb-storage/src/memtable/mod.rs:49-67
pub struct MemTable {
    /// Uses Arc for shared ownership in LSM-tree scenarios:
    /// - Storage engine keeps immutable MemTables for reads during flush
    /// - Background threads flush MemTable to SSTable
    /// - Iterators need concurrent access without blocking writes
    skiplist: Arc<SkipList>,
    memory_usage: AtomicUsize,
    max_size: usize,
}

impl MemTable {
    pub fn new(max_size: usize) -> Self {
        Self {
            skiplist: Arc::new(SkipList::new()),
            memory_usage: AtomicUsize::new(0),
            max_size,
        }
    }
}

Key design decisions:

Skip List choice: Provides O(log n) performance with lock-free reads (better than B-trees for concurrent access)
Arc wrapping: Allows safe sharing between read operations and background flush threads
Memory tracking: Knows when to flush to disk before running out of RAM

Implementation Details

Component 1: MemTable - Fast In-Memory Writes

The MemTable is where all writes initially go - think of it as the “notepad” in our restaurant analogy:

// ferrisdb-storage/src/memtable/mod.rs:81-95
pub fn put(&self, key: Key, value: Value, timestamp: Timestamp) -> Result<()> {
    let size_estimate = key.len() + value.len() + 64; // 64 bytes overhead estimate

    self.skiplist.insert(key, value, timestamp, Operation::Put);

    let new_usage = self
        .memory_usage
        .fetch_add(size_estimate, Ordering::Relaxed);

    if new_usage + size_estimate > self.max_size {
        return Err(Error::MemTableFull);
    }

    Ok(())
}

How it works:

Immediate write: Data goes straight into memory (like writing on the notepad)
Size tracking: Keeps track of how much memory is used
Full detection: Signals when it’s time to “give orders to the kitchen” (flush to disk)

Component 2: SSTables - Immutable Sorted Files

When the MemTable gets full, it’s flushed to an SSTable (Sorted String Table):

// ferrisdb-storage/src/sstable/mod.rs:179-188
pub struct SSTableEntry {
    /// The internal key (user_key + timestamp)
    pub key: InternalKey,
    /// The value associated with this key version
    pub value: Value,
    /// The operation type (Put/Delete) for this entry
    pub operation: Operation,
}

Performance characteristics:

Time complexity: O(log n) lookups using binary search within blocks
Space complexity: Compressed storage on disk with block-based organization
I/O patterns: Only reads necessary blocks, not entire files

Performance Analysis

Mathematical Analysis

Write Performance Improvement

Traditional B-tree: O(log n) random I/O operations per write
LSM-Tree MemTable: O(1) memory operation per write
Improvement: ~100x faster on traditional drives

Read Performance Trade-off

Best case (in MemTable): O(log n) memory operation
Worst case (oldest SSTable): O(k × log n) where k = levels
Range queries: O(n) scan across sorted data

Trade-off Analysis

✅ Excellent write performance: Sequential writes are much faster than random writes
✅ High write throughput: Can batch multiple operations efficiently
✅ Good compression: Sequential data on disk compresses well
✅ Crash recovery: WAL ensures no data loss on system failures

Advanced Topics

Compaction Strategies

FerrisDB implements size-tiered compaction:

// ferrisdb-storage/src/storage_engine.rs (conceptual)
pub fn size_tiered_compaction(&mut self) -> Result<()> {
    // Group SSTables by similar size
    let mut size_tiers: BTreeMap<u64, Vec<SSTable>> = BTreeMap::new();

    for sstable in &self.sstables {
        let size_tier = size_tier_for_size(sstable.metadata.file_size);
        size_tiers.entry(size_tier).or_default().push(sstable.clone());
    }

    // Compact tiers with too many SSTables
    for (tier_size, sstables) in size_tiers {
        if sstables.len() >= self.config.max_sstables_per_tier {
            self.compact_sstables(sstables)?;
        }
    }

    Ok(())
}

How compaction works:

Background process: Runs when system isn’t busy serving requests
Merge operation: Combines multiple sorted files into one larger sorted file
Garbage collection: Removes outdated versions and deleted entries

Future Improvements

Planned optimizations

Leveled compaction: More sophisticated merging strategy
Bloom filters: Skip reading SSTables that definitely don’t contain a key

Research directions

Universal compaction: Hybrid approach balancing read and write
Partitioned compaction: Parallel compaction for better utilization

Hands-On Exploration

// Compare sequential vs random writes
let mut memtable = MemTable::new(1024 * 1024);

// Sequential writes (fast)
for i in 0..1000 {
    let key = format!("key_{:06}", i);
    memtable.put(key.into_bytes(), b"value".to_vec(), i)?;
}

// Random writes (still fast in MemTable!)
for i in 0..1000 {
    let key = format!("key_{:06}", rand::random::<u32>());
    memtable.put(key.into_bytes(), b"value".to_vec(), i)?;
}

# Watch memory usage during bulk writes
cargo run --example bulk_insert -- --count 100000 --observe-memory

Debugging & Observability

Key metrics to watch:

MemTable memory usage: How close to flush threshold
SSTable count per level: Indicates compaction health
Write amplification ratio: How much extra work compaction is doing

Debugging techniques:

WAL inspection: Use cargo run --bin wal-reader to examine write patterns
SSTable analysis: Check key distributions and overlaps between files

Real-World Context

Industry Comparison

How other databases handle this:

Database	Compaction Strategy	Key Feature
Cassandra	Size-tiered (like FerrisDB)	Wide column support
LevelDB/RocksDB	Leveled compaction	Better read performance
ScyllaDB	Per-core LSM trees	Better parallelism

Historical Evolution

1996: Log-Structured File System paper introduces core concepts
2006: Google’s BigTable popularizes LSM-trees for large-scale systems
2011: LevelDB makes LSM-trees accessible to application developers
2023: Modern implementations focus on NVMe optimization and cloud storage

Common Pitfalls & Best Practices

Implementation Pitfalls

Production Considerations

Operational concerns:

Disk space monitoring: LSM-trees can temporarily use 2x space during compaction
Compaction throttling: Avoid overloading I/O during peak hours
Backup strategies: Consider SSTable-level backups for faster recovery

Summary & Key Takeaways

Core Concepts Learned

Sequential writes are much faster than random writes: LSM-trees exploit this by deferring organization
Memory buffering enables write batching: MemTables collect writes before expensive disk operations
Background compaction trades write amplification for read performance: System does extra work later to maintain good read speeds

When to Apply This Knowledge

Use LSM-trees when: Your application has write-heavy workloads (logging, analytics, time-series data)
Consider alternatives when: Read performance is more critical than write performance
Implementation complexity: Moderate - requires careful coordination between components

LSM-Trees Explained

The Problem & Why It Matters

Conceptual Overview

The Core Idea

Visual Architecture

FerrisDB Implementation Deep Dive

Core Data Structures

Implementation Details

Component 1: MemTable - Fast In-Memory Writes

Component 2: SSTables - Immutable Sorted Files

Performance Analysis

Mathematical Analysis

Trade-off Analysis

Advanced Topics

Compaction Strategies

Future Improvements

Hands-On Exploration

Try It Yourself

Debugging & Observability

Real-World Context

Industry Comparison

Historical Evolution

Common Pitfalls & Best Practices

Implementation Pitfalls

Production Considerations

Summary & Key Takeaways

Core Concepts Learned

When to Apply This Knowledge

Further Reading & References

Academic Papers

Industry Resources

FerrisDB Code Exploration

LSM-Trees Explained

The Problem & Why It Matters

Conceptual Overview

The Core Idea

Visual Architecture

FerrisDB Implementation Deep Dive

Core Data Structures

Implementation Details

Component 1: MemTable - Fast In-Memory Writes

Component 2: SSTables - Immutable Sorted Files

Performance Analysis

Mathematical Analysis

Trade-off Analysis

Advanced Topics

Compaction Strategies

Future Improvements

Hands-On Exploration

Try It Yourself

Debugging & Observability

Real-World Context

Industry Comparison

Historical Evolution

Common Pitfalls & Best Practices

Implementation Pitfalls

Production Considerations

Summary & Key Takeaways

Core Concepts Learned

When to Apply This Knowledge

Further Reading & References

Related FerrisDB Articles

Academic Papers

Industry Resources

FerrisDB Code Exploration