Skip to main content

Architecture Overview

SierraDB is designed as a distributed, horizontally-scalable event store with a three-tier storage hierarchy optimized for event sourcing workloads.

Core Architecture

Three-Tier Hierarchy

SierraDB organizes data using a structured hierarchy:

Database
├── Buckets (4 by default)
│ ├── Bucket 0
│ │ ├── Partitions (8 per bucket)
│ │ └── Segments (256MB files)
│ ├── Bucket 1
│ └── ...

1. Buckets

  • Purpose: Top-level data organization for I/O parallelism
  • Default: 4 buckets
  • Scaling: More buckets = better parallel I/O on multi-core systems
  • Configuration: Must be power of 2

2. Partitions

  • Purpose: Units of distribution and write concurrency
  • Default: 32 partitions (8 per bucket)
  • Scaling: More partitions = better write parallelism
  • Distribution: Events distributed via consistent hashing

3. Segments

  • Purpose: Individual storage files with predictable size
  • Default: 256MB per segment
  • Performance: Consistent write performance regardless of database size
  • Rollover: New segment created when size limit reached

Data Distribution

Consistent Hashing

Events are distributed across partitions using consistent hashing:

// Partition determination
let partition_id = hash(partition_key) % partition_count;

Benefits:

  • Even distribution across partitions
  • Predictable partition assignment
  • Enables horizontal scaling

Partition Keys

Events can specify custom partition keys for:

  • Cross-stream transactions: Related events in same partition
  • Query optimization: Co-locate related data
  • Load balancing: Control distribution patterns

Default Partition Key: When no partition key is specified, SierraDB generates a deterministic UUID based on the stream ID:

Uuid::new_v5(&uuid!("219bd637-e279-53e9-9e2b-eabe5d9120cc"), stream_id.as_bytes())

Where 219bd637-e279-53e9-9e2b-eabe5d9120cc is the UUID of:

Uuid::new_v5(&Uuid::NAMESPACE_DNS, b"sierradb.tqwewe.com")

Storage Model

Append-Only Design

SierraDB uses append-only storage for optimal event sourcing:

Segment File Layout:
[Event 1][Event 2][Event 3]...[Event N]

Advantages:

  • Sequential writes (optimal for SSDs)
  • No update-in-place complexity
  • Natural audit trail
  • Crash safety

Index Files

Each segment has accompanying index files:

  • Event Index (.eidx): Event ID → Position mapping
  • Stream Index (.sidx): Stream → Event positions (includes bloom filter)
  • Partition Index (.pidx): Partition sequence → Position

File Structure

data/
buckets/
00000/
segments/
0000000000/
data.evts # Raw append-only event data
index.eidx # Event ID index (lookup event by ID)
partition.pidx # Partition index (scan events by partition ID)
stream.sidx # Stream index (scan events by stream ID)
0000000001/...
00001/...
00002/...
00003/...

Distributed Architecture

Cluster Coordination

SierraDB uses a distributed architecture similar to Cassandra:

  • Peer-to-peer: No single point of failure
  • libp2p networking: Modern, extensible networking stack
  • Automatic discovery: mDNS and manual peer configuration
  • Gossip protocol: Cluster state propagation

Replication

Data is replicated across multiple nodes:

Replication Factor = 3
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node A │───▶│ Node B │───▶│ Node C │
│ Primary │ │ Replica │ │ Replica │
└─────────┘ └─────────┘ └─────────┘

Features:

  • Configurable replication factor
  • Quorum-based writes
  • Automatic failover
  • Consistent hashing for replica placement

Watermark System

SierraDB implements a watermark system for consistency:

  • No distributed read quorums: Single-node reads
  • Gap-free guarantees: Monotonic sequence numbers
  • Consensus coordination: Raft-inspired term system
  • Deterministic leadership: Oldest-node algorithm

Performance Characteristics

Write Performance

  • Partitioned writes: Parallel processing across partitions
  • Sequential I/O: Append-only writes to segments
  • Batched replication: Efficient network utilization
  • Predictable latency: Consistent regardless of database size

Read Performance

SierraDB optimizes for three primary read patterns:

  1. Event Lookup by ID

    • Bloom filter → Index file → Data file
    • O(1) average case performance
  2. Stream Scanning

    • Stream index → Data positions
    • Sequential reads within segments
  3. Partition Scanning

    • Partition index → Sequential scan
    • Optimal for replay scenarios

Memory Management

  • Configurable caches: Block cache and index cache
  • Memory-mapped I/O: OS-managed memory for large datasets
  • Bloom filters: Reduce disk I/O for negative lookups
  • Rust ownership: Zero-cost abstractions, no GC pressure

Fault Tolerance

Node Failures

  • Automatic detection: Health monitoring and failure detection
  • Replica promotion: Automatic failover to replicas
  • Data recovery: Catch-up replication for rejoining nodes
  • Graceful degradation: Partial functionality during failures

Data Integrity

  • CRC32C checksums: Corruption detection on every event
  • Atomic writes: All-or-nothing segment updates
  • Write-ahead logging: Durability guarantees
  • Automatic repair: Corrupt segment detection and recovery

Network Partitions

  • CAP theorem trade-offs: Consistency and Partition tolerance
  • Split-brain prevention: Quorum-based decision making
  • Graceful degradation: Read-only mode during partitions

Scaling Characteristics

Horizontal Scaling

Single Node → 3 Nodes → 10 Nodes → 100+ Nodes
│ │ │ │
Local Small Cluster Medium Large Scale
Storage Distributed

Scaling factors:

  • Linear write throughput scaling
  • Increased fault tolerance
  • Geographic distribution
  • Load distribution

Partition Strategy

// Example partition calculation
let total_partitions = bucket_count * partitions_per_bucket;
let target_partition = hash(stream_id) % total_partitions;
let target_node = consistent_hash_ring.get_node(target_partition);

Comparison with Other Systems

vs. Traditional Databases

  • Append-only vs. CRUD: Optimized for event patterns
  • Horizontal scaling: Built-in vs. bolt-on
  • Consistency model: Event ordering vs. ACID transactions

vs. Apache Kafka

  • Storage model: Persistent vs. retention-based
  • Query patterns: Random access vs. sequential consumption
  • Client protocol: Redis-compatible vs. custom

vs. KurrentDB (formerly EventStore)

  • Clustering: Peer-to-peer vs. quorum-based clustering
  • Protocol: Redis RESP3 vs. custom
  • Performance: Rust performance vs. .NET