Skip to content

WAL and Crash Recovery

How Write-Ahead Logs ensure data durability and enable crash recovery in database systems.

Difficulty: 📗 Beginner • Reading time: 15 minutes

Imagine you’re running an e-commerce website. A customer completes their order, sees the “Order Successful!” message, and then… your server crashes. When it restarts, is their order still there? Or did it vanish into the digital void?

This is the fundamental durability problem that every database must solve. Without proper crash recovery, you risk:

The problem is that computers use two types of storage:

  1. RAM (Memory): Super fast but disappears when power is lost
  2. Disk (Storage): Slower but survives power loss

If your database only writes to memory for speed, all data is lost on crash. If it writes to disk for every operation, it becomes painfully slow. Write-Ahead Logging (WAL) solves this dilemma elegantly.

Write-Ahead Logging follows a simple principle:

WAL Core Principle

“Write changes to a log file BEFORE updating the actual data”

Think of it like a restaurant’s order system:

  1. Taking the order (WAL write): Waiter writes order on paper first
  2. Kitchen preparation (Memory update): Cook starts preparing the meal
  3. Order tracking (Recovery): If cook forgets, the written order still exists

Even if the kitchen catches fire (system crash), the written orders survive, and a new cook can continue where the previous one left off.

User Request → WAL (Disk) → MemTable (RAM) → SSTable (Disk)
↓ ↓ ↓
"Success" Durability Fast Access Long-term Storage

Let’s understand WAL through a familiar example - ATM withdrawals:

1. Customer withdraws $100
2. Update balance in memory: $1000 → $900
3. Power failure! 💥
4. Server restarts
5. Balance shows $1000 (memory lost)
6. Customer has cash but account shows no withdrawal!

Let’s see how FerrisDB implements WAL:

ferrisdb-storage/src/wal/log_entry.rs
pub struct LogEntry {
pub sequence_number: u64, // Unique ID for ordering
pub key: Vec<u8>, // What was changed
pub value: Option<Vec<u8>>, // New value (None = delete)
pub timestamp: u64, // When it happened
}
// Simplified from FerrisDB code
pub fn append(&mut self, entry: &LogEntry) -> Result<()> {
// 1. Serialize the entry
let serialized = bincode::serialize(entry)?;
// 2. Write length prefix (so we know where entries start/end)
self.writer.write_u32(serialized.len() as u32)?;
// 3. Write the actual data
self.writer.write_all(&serialized)?;
// 4. Force to disk (fsync) - THIS IS CRITICAL!
self.writer.sync_all()?;
Ok(())
}
// How FerrisDB recovers after a crash
pub fn recover(&mut self) -> Result<Vec<LogEntry>> {
let mut entries = Vec::new();
loop {
// Read entry length
let len = match self.reader.read_u32() {
Ok(len) => len,
Err(_) => break, // End of log
};
// Read entry data
let mut buffer = vec![0; len as usize];
self.reader.read_exact(&mut buffer)?;
// Deserialize and collect
let entry: LogEntry = bincode::deserialize(&buffer)?;
entries.push(entry);
}
// Replay all entries to reconstruct state
for entry in &entries {
self.apply_to_memtable(entry)?;
}
Ok(entries)
}

Write Performance

  • Sequential writes: 100-500 MB/s
  • Latency: 1-10ms per sync
  • Batching: Group commits for efficiency

Recovery Speed

  • Read speed: 500+ MB/s (sequential)
  • Recovery time: ~1 second per GB
  • Parallel replay: Modern DBs use multiple threads

Instead of syncing after every write, batch multiple writes:

// Inefficient: sync per write
for entry in entries {
wal.append(&entry)?;
wal.sync()?; // Slow!
}
// Efficient: group commit
for entry in entries {
wal.append(&entry)?;
}
wal.sync()?; // One sync for all!

Periodically save a snapshot to avoid replaying entire history:

WAL: [Entry1][Entry2][Entry3][CHECKPOINT][Entry4][Entry5]
↑
Recovery starts here, not at Entry1

Reuse old log files to avoid filesystem overhead:

wal.000001.log (full) → Archive or delete
wal.000002.log (active) → Current writes
wal.000003.log (preallocated) → Ready for next rotation
[wal]
sync_mode = "always" # always | periodic | never
sync_interval_ms = 100 # If periodic
max_file_size_mb = 128 # When to rotate
compression = "none" # none | snappy | zstd
  • WAL write latency: Should be < 10ms
  • WAL size: Monitor growth rate
  • Recovery time: Test regularly
  • Sync failures: Must be zero!

Quick Quiz

  1. Why must we sync after WAL writes?
  2. What happens if power fails during a WAL write?
  3. How does group commit improve performance?
  4. When can we safely delete old WAL files?

Try implementing a simple WAL:

struct SimpleWAL {
file: File,
entries: Vec<LogEntry>,
}
impl SimpleWAL {
fn append(&mut self, key: &str, value: &str) -> Result<()> {
// Your code here:
// 1. Create LogEntry
// 2. Serialize to bytes
// 3. Write to file
// 4. Sync to disk
}
fn recover(&mut self) -> Result<()> {
// Your code here:
// 1. Read entries from file
// 2. Deserialize each one
// 3. Apply to state
}
}

Exercise 3: Test crash recovery

Terminal window
# Start a write workload
cargo run --example wal_stress_test &
# Kill it mid-write (simulating crash)
sleep 5 && kill -9 $!
# Run recovery
cargo run --example wal_recovery -- --recover-from test.wal

Key metrics to watch:

  • WAL size: Monitor growth rate and truncation frequency
  • Sync latency: Time spent in fsync() calls
  • Recovery duration: Time to replay WAL after crash

Debugging techniques:

  • WAL inspection: cargo run --bin wal-dump to examine entries
  • Corruption detection: Look for CRC mismatches in logs
  • Performance profiling: Measure time spent in WAL operations

How other databases handle WAL:

DatabaseWAL ImplementationKey Features
PostgreSQLWAL with configurable syncFull ACID compliance
MySQL (InnoDB)Redo logGroup commit optimization
SQLiteJournal or WAL modeSimpler for embedded use
RedisAOF (Append Only File)Optional durability
  1. 1992: ARIES paper establishes WAL principles
  2. 2004: SQLite adds WAL mode for better concurrency
  3. 2010: NoSQL databases adopt WAL for durability
  4. Today: NVMe and persistent memory changing WAL design

Operational concerns:

  • Disk monitoring: WAL can fill disk quickly under high load
  • Sync tuning: Balance durability vs performance for your use case
  • Backup strategy: Include WAL in backups for point-in-time recovery
  • Separate disks: Put WAL on different disk than data for better I/O
  1. Write-ahead principle ensures durability: Log before modifying data structures
  2. Sequential writes are fast: Appending to log much faster than random updates
  3. Recovery is straightforward: Just replay the log from last checkpoint
  • Use WAL when: Data durability is critical (financial transactions, user data)
  • Consider alternatives when: Data can be regenerated or loss is acceptable
  • Implementation complexity: Moderate - requires careful handling of I/O and recovery
  • “ARIES: A Transaction Recovery Method” (Mohan et al., 1992) - Foundational WAL concepts
  • “aLSM: Redesigning LSMs for Nonvolatile Memory” (Eisenman et al., 2018) - Modern adaptations
  • WAL Writer: ferrisdb-storage/src/wal/writer.rs - Core write logic
  • WAL Reader: ferrisdb-storage/src/wal/reader.rs - Recovery implementation
  • Binary Format: ferrisdb-storage/src/wal/log_entry.rs - Entry encoding
  • Tests: ferrisdb-storage/src/wal/ - Test cases showing usage
  1. WAL = Durability: Survives crashes by writing to disk first
  2. Sequential = Fast: Appending is much faster than random writes
  3. Recovery = Replay: Reconstruct state by replaying the log
  4. Sync = Critical: Must force to disk, not just OS buffers

Part of the FerrisDB Learning Journey. Built with ❤️ by a human and an AI.