This page documents interesting architectural patterns and advanced concepts we might explore as
FerrisDB evolves from an educational project toward production-ready capabilities.
Concept
Completely decouple storage nodes from compute nodes for independent scaling and cost optimization.
Storage layer : Pure data persistence (database-aware object storage)
Compute layer : Query processing, transactions, caching
Benefits : Independent scaling, cost optimization, multi-tenant isolation
Examples : Snowflake, Amazon Aurora, CockroachDB Serverless
Learning value : Understanding modern cloud-native database architecture patterns.
Each node owns data shards
Data locality for performance
Simpler consistency model
Traditional distributed database approach
All nodes access common storage pool
Multiple compute nodes can read/write same data
Enables instant failover and load balancing
Storage becomes the single source of truth
Trade-offs in consistency, performance, and complexity
Log as Database
All operations are immutable log entries
Materialized Views
Derive tables and indexes from the log
Time Travel
Query any point in history naturally
Benefits
Perfect audit trail, simplified backup/restore, event sourcing
Research areas:
Log compaction strategies
Efficient materialized view maintenance
Query optimization over log structures
Goal : Single system handles both OLTP and OLAP workloads efficiently.
Columnar storage : For analytical queries
Row storage : For transactional workloads
Automatic routing : Query optimizer chooses optimal storage format
Real-time analytics : Fresh data available immediately
Dual storage formats with synchronization
Adaptive storage layouts based on access patterns
Vectorized execution for analytical queries
Progressive approach : Support multiple data models while learning optimal integration patterns.
Phase 1: Layered Implementation
Document API → LSM Storage Engine
Graph API → LSM Storage Engine
TimeSeries → LSM Storage Engine
Phase 2: Hybrid Integration
Native JSON document support in storage format
Specialized indexing for different models
Cross-model query capabilities
Phase 3: Unified Multi-Model
Storage engine natively understands multiple data types
Atomic transactions across all models
Optimized storage layouts per data type
Models to explore:
Document Store
JSON/BSON with rich querying
Graph Database
Relationships and graph traversals
Time Series
Optimized for metrics and IoT data
Search Engine
Full-text search and indexing
Beyond Raft : Explore alternative coordination mechanisms for better performance.
CRDTs : Conflict-free replicated data types for eventual consistency
Calvin-style : Deterministic transaction scheduling
Clock synchronization : Spanner-style global ordering
Hybrid approaches : Combine techniques based on workload characteristics
Lower latency for global operations
Better partition tolerance
Reduced coordination overhead
Workload-specific optimization
Real-time data processing : Built-in stream processing capabilities.
Change Streams
Real-time data change notifications
Materialized Views
Continuously updated query results
Event Sourcing
Store events, compute state on demand
Stream Integration
Native Kafka/Pulsar compatibility
Use cases:
Real-time analytics and dashboards
Event-driven microservices integration
Live data synchronization between systems
Global distribution : Advanced topology management for worldwide deployments.
Challenges to explore
Network partition handling - Latency-aware query routing - Compliance and data sovereignty -
Cost optimization across regions
Region-aware partitioning : Data gravity and compliance requirements
Cross-region transactions : Global consistency with performance optimization
Cloud portability : Seamless operation across AWS/GCP/Azure
Edge caching : Bringing data closer to users
Machine learning integration : Systems that optimize themselves based on workload patterns.
Auto-compaction : ML-driven compaction strategies
Query optimization : Learn from historical query patterns
Resource allocation : Dynamic memory/CPU allocation
Anomaly detection : Automatic performance issue detection
Reinforcement learning for database tuning
Workload prediction and preparation
Automated schema optimization
Performance regression detection
Pay-per-query model : True serverless database with instant scaling.
Instant Startup
Cold start in milliseconds
Auto-scaling
Scale to zero, scale to millions
Function Integration
Native serverless function support
Cost Model
Pay only for storage and compute used
Technical challenges:
Warm/cold state management
Connection pooling and management
Resource scheduling and allocation
Billing and metering accuracy
Ranked by educational value:
Log-structured everything - Fundamental paradigm shift
HTAP architecture - Combines multiple database concepts
Multi-model architecture - Progressive complexity building
Separation of storage/compute - Modern cloud patterns
Consensus-free coordination - Cutting-edge distributed systems
LSM-Tree Paper
“The Log-Structured Merge-Tree (LSM-Tree)” - O’Neil et al.
Spanner Paper
“Spanner: Google’s Globally Distributed Database” - Corbett et al.
Calvin Paper
“Calvin: Fast Distributed Transactions for Partitioned Database Systems” - Thomson et al.
FoundationDB : Multi-model with ACID guarantees
YugabyteDB : Multi-model with PostgreSQL compatibility
TiDB : HTAP with TiKV storage and TiFlash analytics - ClickHouse : Columnar database with
real-time capabilities
CockroachDB : Global consistency with clock synchronization - Snowflake : Separation of
storage and compute
Apache Pinot : Real-time analytics database
TimescaleDB : Time-series on PostgreSQL