Architecture Overview
SierraDB is designed as a distributed, horizontally-scalable event store with a three-tier storage hierarchy optimized for event sourcing workloads.
Core Architecture
Three-Tier Hierarchy
SierraDB organizes data using a structured hierarchy:
Database
├── Buckets (4 by default)
│ ├── Bucket 0
│ │ ├── Partitions (8 per bucket)
│ │ └── Segments (256MB files)
│ ├── Bucket 1
│ └── ...
1. Buckets
- Purpose: Top-level data organization for I/O parallelism
- Default: 4 buckets
- Scaling: More buckets = better parallel I/O on multi-core systems
- Configuration: Must be power of 2
2. Partitions
- Purpose: Units of distribution and write concurrency
- Default: 32 partitions (8 per bucket)
- Scaling: More partitions = better write parallelism
- Distribution: Events distributed via consistent hashing
3. Segments
- Purpose: Individual storage files with predictable size
- Default: 256MB per segment
- Performance: Consistent write performance regardless of database size
- Rollover: New segment created when size limit reached
Data Distribution
Consistent Hashing
Events are distributed across partitions using consistent hashing:
// Partition determination
let partition_id = hash(partition_key) % partition_count;
Benefits:
- Even distribution across partitions
- Predictable partition assignment
- Enables horizontal scaling
Partition Keys
Events can specify custom partition keys for:
- Cross-stream transactions: Related events in same partition
- Query optimization: Co-locate related data
- Load balancing: Control distribution patterns
Default Partition Key: When no partition key is specified, SierraDB generates a deterministic UUID based on the stream ID:
Uuid::new_v5(&uuid!("219bd637-e279-53e9-9e2b-eabe5d9120cc"), stream_id.as_bytes())
Where 219bd637-e279-53e9-9e2b-eabe5d9120cc is the UUID of:
Uuid::new_v5(&Uuid::NAMESPACE_DNS, b"sierradb.tqwewe.com")
Storage Model
Append-Only Design
SierraDB uses append-only storage for optimal event sourcing:
Segment File Layout:
[Event 1][Event 2][Event 3]...[Event N]
Advantages:
- Sequential writes (optimal for SSDs)
- No update-in-place complexity
- Natural audit trail
- Crash safety
Index Files
Each segment has accompanying index files:
- Event Index (
.eidx): Event ID → Position mapping - Stream Index (
.sidx): Stream → Event positions (includes bloom filter) - Partition Index (
.pidx): Partition sequence → Position
File Structure
data/
buckets/
00000/
segments/
0000000000/
data.evts # Raw append-only event data
index.eidx # Event ID index (lookup event by ID)
partition.pidx # Partition index (scan events by partition ID)
stream.sidx # Stream index (scan events by stream ID)
0000000001/...
00001/...
00002/...
00003/...
Distributed Architecture
Cluster Coordination
SierraDB uses a distributed architecture similar to Cassandra:
- Peer-to-peer: No single point of failure
- libp2p networking: Modern, extensible networking stack
- Automatic discovery: mDNS and manual peer configuration
- Gossip protocol: Cluster state propagation
Replication
Data is replicated across multiple nodes:
Replication Factor = 3
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node A │───▶│ Node B │───▶│ Node C │
│ Primary │ │ Replica │ │ Replica │
└─────────┘ └─────────┘ └─────────┘
Features:
- Configurable replication factor
- Quorum-based writes
- Automatic failover
- Consistent hashing for replica placement
Watermark System
SierraDB implements a watermark system for consistency:
- No distributed read quorums: Single-node reads
- Gap-free guarantees: Monotonic sequence numbers
- Consensus coordination: Raft-inspired term system
- Deterministic leadership: Oldest-node algorithm
Performance Characteristics
Write Performance
- Partitioned writes: Parallel processing across partitions
- Sequential I/O: Append-only writes to segments
- Batched replication: Efficient network utilization
- Predictable latency: Consistent regardless of database size
Read Performance
SierraDB optimizes for three primary read patterns:
-
Event Lookup by ID
- Bloom filter → Index file → Data file
- O(1) average case performance
-
Stream Scanning
- Stream index → Data positions
- Sequential reads within segments
-
Partition Scanning
- Partition index → Sequential scan
- Optimal for replay scenarios
Memory Management
- Configurable caches: Block cache and index cache
- Memory-mapped I/O: OS-managed memory for large datasets
- Bloom filters: Reduce disk I/O for negative lookups
- Rust ownership: Zero-cost abstractions, no GC pressure
Fault Tolerance
Node Failures
- Automatic detection: Health monitoring and failure detection
- Replica promotion: Automatic failover to replicas
- Data recovery: Catch-up replication for rejoining nodes
- Graceful degradation: Partial functionality during failures
Data Integrity
- CRC32C checksums: Corruption detection on every event
- Atomic writes: All-or-nothing segment updates
- Write-ahead logging: Durability guarantees
- Automatic repair: Corrupt segment detection and recovery
Network Partitions
- CAP theorem trade-offs: Consistency and Partition tolerance
- Split-brain prevention: Quorum-based decision making
- Graceful degradation: Read-only mode during partitions
Scaling Characteristics
Horizontal Scaling
Single Node → 3 Nodes → 10 Nodes → 100+ Nodes
│ │ │ │
Local Small Cluster Medium Large Scale
Storage Distributed
Scaling factors:
- Linear write throughput scaling
- Increased fault tolerance
- Geographic distribution
- Load distribution
Partition Strategy
// Example partition calculation
let total_partitions = bucket_count * partitions_per_bucket;
let target_partition = hash(stream_id) % total_partitions;
let target_node = consistent_hash_ring.get_node(target_partition);
Comparison with Other Systems
vs. Traditional Databases
- Append-only vs. CRUD: Optimized for event patterns
- Horizontal scaling: Built-in vs. bolt-on
- Consistency model: Event ordering vs. ACID transactions
vs. Apache Kafka
- Storage model: Persistent vs. retention-based
- Query patterns: Random access vs. sequential consumption
- Client protocol: Redis-compatible vs. custom
vs. KurrentDB (formerly EventStore)
- Clustering: Peer-to-peer vs. quorum-based clustering
- Protocol: Redis RESP3 vs. custom
- Performance: Rust performance vs. .NET