EP17: SSTable and Memtable

When your disk is sorted, but your memory is fresh

Jan 31, 2025

What: A data structure and storage pattern that combines in-memory tables (Memtables) with immutable sorted string tables (SSTables) on disk. SSTable is a file containing sorted key-value pairs, while Memtable is its in-memory counterpart. Together, they form the foundation of Log-Structured Merge Trees (LSM Trees) used in modern databases.

A SSTable then is exactly what it sounds like, it is a file which contains a set of arbitrary, sorted key-value pairs inside. It’s a perfect candidate for streaming processing: read stream from disk, process key by key, and write output back to disk. Combining it with a static index that could be loaded into memory, and we’re one disk seek away from the key. So SSTable solves the problem of fast reads from disks by key.

We could have a similar structure (possibly a different one) but in memory to solve for fast random writes. This in-memory table we will call Memtable.

The following describes the interactions between those components (kudos to this article):

On-disk SSTable indexes are always loaded into memory
All writes go directly to the MemTable index
Reads check the MemTable first and then the SSTable indexes
Periodically, the MemTable is flushed to disk as an SSTable
Periodically, on-disk SSTables are merged with a merge sort

Use cases:

Processing large sequential read/write workloads
Database engines (LevelDB, RocksDB)
Storage engines in distributed databases (Cassandra, HBase)
Embedded storage systems (IndexDB in WebKit)
Processing gigabyte to terabyte-sized datasets

Reliability:

Immutable SSTables ensure data integrity
Periodic flushing of Memtable prevents data loss
Sequential writes reduce chances of corruption
Tombstone records handle deletes safely

Scalability:

Handles datasets from megabytes to petabytes
Efficient merging of multiple SSTables
Support for range-based sharding
Compact storage through optional compression

Maintainability:

Simple key-value interface
Clear separation of hot/cold data
Built-in garbage collection through compaction

Performance:

From RocksDB benchmarks:

150k random read operations per second
10^6 write operations per second

Users:

Google (BigTable)
Meta (RocksDB)
Apache Cassandra
Apache HBase
LevelDB

Resources:

A large portion of this article is borrowed from https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/
RocksDB doc

Tags: #kv #storage

Disclaimer: Found a mistake or have a different perspective? Please drop a comment! I aim to keep this content accurate and valuable for everyone.

System Design Cheat Sheet

Discussion about this post