WAL and Crash Recovery

How Write-Ahead Logs ensure data durability and enable crash recovery in database systems.

Difficulty: 📗 Beginner • Reading time: 15 minutes

The Problem & Why It Matters

Imagine you’re running an e-commerce website. A customer completes their order, sees the “Order Successful!” message, and then… your server crashes. When it restarts, is their order still there? Or did it vanish into the digital void?

This is the fundamental durability problem that every database must solve. Without proper crash recovery, you risk:

The problem is that computers use two types of storage:

RAM (Memory): Super fast but disappears when power is lost
Disk (Storage): Slower but survives power loss

If your database only writes to memory for speed, all data is lost on crash. If it writes to disk for every operation, it becomes painfully slow. Write-Ahead Logging (WAL) solves this dilemma elegantly.

Conceptual Overview

The Core Idea

Write-Ahead Logging follows a simple principle:

WAL Core Principle

“Write changes to a log file BEFORE updating the actual data”

Think of it like a restaurant’s order system:

Taking the order (WAL write): Waiter writes order on paper first
Kitchen preparation (Memory update): Cook starts preparing the meal
Order tracking (Recovery): If cook forgets, the written order still exists

Even if the kitchen catches fire (system crash), the written orders survive, and a new cook can continue where the previous one left off.

Visual Architecture

User Request → WAL (Disk) → MemTable (RAM) → SSTable (Disk)
      ↓           ↓              ↓
   "Success"   Durability    Fast Access    Long-term Storage

Real-World Analogy: Bank Transactions

Let’s understand WAL through a familiar example - ATM withdrawals:

Without WAL
With WAL

1. Customer withdraws $100
2. Update balance in memory: $1000 → $900
3. Power failure! 💥
4. Server restarts
5. Balance shows $1000 (memory lost)
6. Customer has cash but account shows no withdrawal!

1. Customer withdraws $100
2. Write to WAL: "Account 123: -$100 at 2:30 PM"
3. Update balance in memory: $1000 → $900
4. Power failure! 💥
5. Server restarts
6. Read WAL: "Oh, there was a $100 withdrawal"
7. Replay transaction: $1000 → $900
8. Balance correctly shows $900

FerrisDB Implementation

Let’s see how FerrisDB implements WAL:

WAL Entry Structure

pub struct LogEntry {
    pub sequence_number: u64,  // Unique ID for ordering
    pub key: Vec<u8>,         // What was changed
    pub value: Option<Vec<u8>>, // New value (None = delete)
    pub timestamp: u64,       // When it happened
}

Writing to WAL

// Simplified from FerrisDB code
pub fn append(&mut self, entry: &LogEntry) -> Result<()> {
    // 1. Serialize the entry
    let serialized = bincode::serialize(entry)?;

    // 2. Write length prefix (so we know where entries start/end)
    self.writer.write_u32(serialized.len() as u32)?;

    // 3. Write the actual data
    self.writer.write_all(&serialized)?;

    // 4. Force to disk (fsync) - THIS IS CRITICAL!
    self.writer.sync_all()?;

    Ok(())
}

Recovery Process

// How FerrisDB recovers after a crash
pub fn recover(&mut self) -> Result<Vec<LogEntry>> {
    let mut entries = Vec::new();

    loop {
        // Read entry length
        let len = match self.reader.read_u32() {
            Ok(len) => len,
            Err(_) => break, // End of log
        };

        // Read entry data
        let mut buffer = vec![0; len as usize];
        self.reader.read_exact(&mut buffer)?;

        // Deserialize and collect
        let entry: LogEntry = bincode::deserialize(&buffer)?;
        entries.push(entry);
    }

    // Replay all entries to reconstruct state
    for entry in &entries {
        self.apply_to_memtable(entry)?;
    }

    Ok(entries)
}

Performance Characteristics

Write Performance

Sequential writes: 100-500 MB/s
Latency: 1-10ms per sync
Batching: Group commits for efficiency

Recovery Speed

Read speed: 500+ MB/s (sequential)
Recovery time: ~1 second per GB
Parallel replay: Modern DBs use multiple threads

Common Implementation Patterns

1. Group Commit

Instead of syncing after every write, batch multiple writes:

// Inefficient: sync per write
for entry in entries {
    wal.append(&entry)?;
    wal.sync()?; // Slow!
}

// Efficient: group commit
for entry in entries {
    wal.append(&entry)?;
}
wal.sync()?; // One sync for all!

2. Checkpointing

Periodically save a snapshot to avoid replaying entire history:

WAL: [Entry1][Entry2][Entry3][CHECKPOINT][Entry4][Entry5]
                              ↑
                    Recovery starts here, not at Entry1

3. Log Recycling

Reuse old log files to avoid filesystem overhead:

wal.000001.log (full) → Archive or delete
wal.000002.log (active) → Current writes
wal.000003.log (preallocated) → Ready for next rotation

Production Considerations

Configuration Options

[wal]
sync_mode = "always"     # always | periodic | never
sync_interval_ms = 100   # If periodic
max_file_size_mb = 128   # When to rotate
compression = "none"     # none | snappy | zstd

Monitoring Metrics

WAL write latency: Should be < 10ms
WAL size: Monitor growth rate
Recovery time: Test regularly
Sync failures: Must be zero!

Common Pitfalls

Testing Your Understanding

Quick Quiz

Why must we sync after WAL writes?
What happens if power fails during a WAL write?
How does group commit improve performance?
When can we safely delete old WAL files?

Hands-On Exercise

Try implementing a simple WAL:

struct SimpleWAL {
    file: File,
    entries: Vec<LogEntry>,
}

impl SimpleWAL {
    fn append(&mut self, key: &str, value: &str) -> Result<()> {
        // Your code here:
        // 1. Create LogEntry
        // 2. Serialize to bytes
        // 3. Write to file
        // 4. Sync to disk
    }

    fn recover(&mut self) -> Result<()> {
        // Your code here:
        // 1. Read entries from file
        // 2. Deserialize each one
        // 3. Apply to state
    }
}

Additional Exercises

Exercise 3: Test crash recovery

# Start a write workload
cargo run --example wal_stress_test &

# Kill it mid-write (simulating crash)
sleep 5 && kill -9 $!

# Run recovery
cargo run --example wal_recovery -- --recover-from test.wal

Debugging & Observability

Key metrics to watch:

WAL size: Monitor growth rate and truncation frequency
Sync latency: Time spent in fsync() calls
Recovery duration: Time to replay WAL after crash

Debugging techniques:

WAL inspection: cargo run --bin wal-dump to examine entries
Corruption detection: Look for CRC mismatches in logs
Performance profiling: Measure time spent in WAL operations

Real-World Context

Industry Comparison

How other databases handle WAL:

Database	WAL Implementation	Key Features
PostgreSQL	WAL with configurable sync	Full ACID compliance
MySQL (InnoDB)	Redo log	Group commit optimization
SQLite	Journal or WAL mode	Simpler for embedded use
Redis	AOF (Append Only File)	Optional durability

Historical Evolution

1992: ARIES paper establishes WAL principles
2004: SQLite adds WAL mode for better concurrency
2010: NoSQL databases adopt WAL for durability
Today: NVMe and persistent memory changing WAL design

Production Pitfalls & Solutions

Production Considerations

Operational concerns:

Disk monitoring: WAL can fill disk quickly under high load
Sync tuning: Balance durability vs performance for your use case
Backup strategy: Include WAL in backups for point-in-time recovery
Separate disks: Put WAL on different disk than data for better I/O

Summary

Core Concepts Learned

Write-ahead principle ensures durability: Log before modifying data structures
Sequential writes are fast: Appending to log much faster than random updates
Recovery is straightforward: Just replay the log from last checkpoint

When to Apply This Knowledge

Use WAL when: Data durability is critical (financial transactions, user data)
Consider alternatives when: Data can be regenerated or loss is acceptable
Implementation complexity: Moderate - requires careful handling of I/O and recovery

Key Takeaways

WAL = Durability: Survives crashes by writing to disk first
Sequential = Fast: Appending is much faster than random writes
Recovery = Replay: Reconstruct state by replaying the log
Sync = Critical: Must force to disk, not just OS buffers

Next Steps

Understanding LSM Trees - How WAL fits into the bigger picture
SSTable Design - Where data goes after WAL
Building Your Own WAL - Hands-on tutorial

Part of the FerrisDB Learning Journey. Built with ❤️ by a human and an AI.

WAL and Crash Recovery

The Problem & Why It Matters

Conceptual Overview

The Core Idea

Visual Architecture

Real-World Analogy: Bank Transactions

FerrisDB Implementation

WAL Entry Structure

Writing to WAL

Recovery Process

Performance Characteristics

Common Implementation Patterns

1. Group Commit

2. Checkpointing

3. Log Recycling

Production Considerations

Configuration Options

Monitoring Metrics

Common Pitfalls

Testing Your Understanding

Hands-On Exercise

Additional Exercises

Debugging & Observability

Real-World Context

Industry Comparison

Historical Evolution

Production Pitfalls & Solutions

Production Considerations

Summary

Core Concepts Learned

When to Apply This Knowledge

Further Reading & References

Academic Papers

Industry Resources

FerrisDB Code Exploration

Key Takeaways

Next Steps

WAL and Crash Recovery

The Problem & Why It Matters

Conceptual Overview

The Core Idea

Visual Architecture

Real-World Analogy: Bank Transactions

FerrisDB Implementation

WAL Entry Structure

Writing to WAL

Recovery Process

Performance Characteristics

Common Implementation Patterns

1. Group Commit

2. Checkpointing

3. Log Recycling

Production Considerations

Configuration Options

Monitoring Metrics

Common Pitfalls

Testing Your Understanding

Hands-On Exercise

Additional Exercises

Debugging & Observability

Real-World Context

Industry Comparison

Historical Evolution

Production Pitfalls & Solutions

Production Considerations

Summary

Core Concepts Learned

When to Apply This Knowledge

Further Reading & References

Related FerrisDB Articles

Academic Papers

Industry Resources

FerrisDB Code Exploration

Key Takeaways

Next Steps