Quick Check
Try answering these questions:
- Why are keys sorted in an SSTable?
- Whatβs the purpose of restart points?
- How does a bloom filter speed up lookups?
- Why are SSTables immutable?
How databases organize and access data on disk for optimal performance.
Difficulty: π Intermediate β’ Reading time: 30 minutes
Imagine youβre building a web application that needs to store millions of user profiles, product listings, or transaction records. Your in-memory cache (MemTable) is getting full, and you need to save this data to disk. But hereβs the challenge:
Real-world problems CRUD developers face:
Traditional approaches have painful trade-offs:
SSTable (Sorted String Table) solves these problems elegantly. Think of it as a phone book for your data - organized, indexed, and easy to search without reading every page.
SSTables are like a well-organized library:
Without SSTables (pile of books):
With SSTables (organized library):
ββββββββββββββββββββ Data Block 1 β β "A-C" authors (sorted books)βββββββββββββββββββ€β Data Block 2 β β "D-F" authorsβββββββββββββββββββ€β Data Block 3 β β "G-M" authorsβββββββββββββββββββ€β Data Block 4 β β "N-Z" authorsβββββββββββββββββββ€β Index Block β β Card catalog (which shelf has what)βββββββββββββββββββ€β Bloom Filter β β Quick "do we have this?" checkβββββββββββββββββββ€β Footer β β Information desk (where everything is)βββββββββββββββββββ
Key principles:
Letβs examine how FerrisDB structures SSTable data:
// ferrisdb-storage/src/sstable/mod.rs:180-204pub struct SSTableEntry { /// The internal key (user_key + timestamp) pub key: InternalKey, /// The value associated with this key version pub value: Value,}
pub struct InternalKey { /// The user-provided key pub user_key: UserKey, /// When this version was written (for MVCC) pub timestamp: u64, /// Type of operation (Put or Delete) pub key_type: KeyType,}
Each data block in FerrisDB follows this structure:
// Conceptual structure (not actual code)struct DataBlock { entries: Vec<Entry>, // Sorted key-value pairs restarts: Vec<u32>, // Restart points for compression num_restarts: u32, // How many restart points}
The magic of restart points:
Instead of storing full keys, we use prefix compression:
Full keys: Compressed:user:1234 β user:1234 (restart point)user:1235 β 5 (shared=7, non_shared=1)user:1236 β 6 (shared=7, non_shared=1)
This reduces storage by 60-80% for typical workloads!
Data Entry Encoding
[shared_key_len: varint][non_shared_key_len: varint][value_len: varint][non_shared_key_bytes][value_bytes]
Block Trailer
[restart_offset_1][restart_offset_2]...[restart_offset_n][num_restarts: u32]
Footer Format
[index_block_handle: BlockHandle][bloom_filter_handle: BlockHandle][magic_number: u64]
Letβs say youβre storing user session data:
// Your application codedb.put("session:abc123", SessionData { user_id: "user:42", expires_at: "2024-12-31", permissions: vec!["read", "write"],});
Hereβs how it flows through the system:
The result? A compact, searchable file that can find any key with just 2-3 disk reads!
Operation | Time Complexity | Disk Reads | Notes |
---|---|---|---|
Point lookup | O(log n) | 1-3 | Bloom filter + index + data block |
Range scan | O(k) | k/entries_per_block | Sequential reads are fast |
Key exists? | O(1) | 0-1 | Bloom filter check first |
Problem: 512-byte blocks require excessive disk seeks
Solution: Use 4KB-16KB blocks (matches OS page size)
Problem: Storing full keys wastes space
Solution: Prefix compression within blocks
Problem: 1GB SSTable = 10MB index in memory
Solution: Two-level indexes or partitioned indexes
SSTables are the persistent layer of LSM trees:
Write Path:βββββββββββ βββββββββββ ββββββββββββ Write ββββββΆβMemTableββββββΆβ SSTable ββββββββββββ βββββββββββ βββββββββββ β (flush when full) βββββββββββ β SSTable β (Level 0) βββββββββββ
Read Path:βββββββββββ βββββββββββ ββββββββββββ Read ββββββΆβMemTableββββββΆβ SSTable ββββββββββββ βββββββββββ βββββββββββ (newest β oldest)
Quick Check
Try answering these questions:
Try implementing a simple SSTable writer:
struct SimpleSSTableWriter { file: File, current_block: Vec<Entry>, index: Vec<IndexEntry>,}
impl SimpleSSTableWriter { fn add(&mut self, key: &[u8], value: &[u8]) { // Your code here: // 1. Add to current_block // 2. If block full, flush it // 3. Update index }}
Before reading a block, check if the key might exist:
// Bloom filter basicspub struct BloomFilter { bits: Vec<bool>, hash_count: usize,}
impl BloomFilter { pub fn might_contain(&self, key: &[u8]) -> bool { for i in 0..self.hash_count { let hash = self.hash_key(key, i); let bit_pos = hash % self.bits.len();
if !self.bits[bit_pos] { return false; // Definitely not present } } true // Might be present (or false positive) }}
Bloom filter properties:
FerrisDB can compress blocks before writing:
// Block compression optionspub enum CompressionType { None, Snappy, // Fast compression, moderate ratio LZ4, // Very fast, good for real-time Zstd, // Best ratio, slower}
// Compression happens at block levelfn compress_block(data: &[u8], compression: CompressionType) -> Vec<u8> { match compression { CompressionType::LZ4 => lz4::compress(data), CompressionType::Snappy => snappy::compress(data), // ... etc }}
Compression trade-offs:
Exercise 1: Analyze SSTable structure
# Create a sample SSTablecargo run --example create_sstable -- --entries 1000
# Examine the binary structurehexdump -C sample.sst | head -50
# Look for the magic number at the endtail -c 40 sample.sst | hexdump -C
Exercise 2: Benchmark lookup performance
use std::time::Instant;
fn benchmark_sstable_lookups() { let reader = SSTableReader::open("sample.sst").unwrap();
// Random lookups let start = Instant::now(); for _ in 0..1000 { let key = format!("user:{}", rand::random::<u32>() % 1000); reader.get(key.as_bytes(), u64::MAX); } println!("Random lookups: {:?}/op", start.elapsed() / 1000);
// Sequential scan let start = Instant::now(); let mut count = 0; for entry in reader.iter() { count += 1; } println!("Sequential scan: {} entries in {:?}", count, start.elapsed());}
Key metrics to watch:
Debugging techniques:
cargo run --bin sstable-dump
How other databases implement persistent storage:
Database | Storage Format | Key Features |
---|---|---|
LevelDB/RocksDB | SSTable with levels | Leveled compaction |
Cassandra | SSTables | Wide column support |
HBase | StoreFiles (SSTable variant) | HDFS integration |
PostgreSQL | Heap files with B-tree | Different approach |
Operational concerns:
ferrisdb-storage/src/sstable/mod.rs
- Data structuresferrisdb-storage/src/sstable/writer.rs
- Creating SSTablesferrisdb-storage/src/sstable/reader.rs
- Querying SSTablesferrisdb-storage/src/sstable/tests.rs
- Usage examplesPart of the FerrisDB Learning Journey. Built with β€οΈ by a human and an AI.