<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Databases on Neil's Space</title><link>https://neilmin.com/tags/databases/</link><description>Recent content in Databases on Neil's Space</description><image><title>Neil's Space</title><url>https://neilmin.com/images/papermod-cover.png</url><link>https://neilmin.com/images/papermod-cover.png</link></image><generator>Hugo</generator><language>en-US</language><lastBuildDate>Sat, 13 Jun 2026 07:00:00 -0700</lastBuildDate><atom:link href="https://neilmin.com/tags/databases/index.xml" rel="self" type="application/rss+xml"/><item><title>How RocksDB Works: A Minimal LSM-Tree Primer</title><link>https://neilmin.com/posts/how-rocksdb-works/</link><pubDate>Sat, 13 Jun 2026 07:00:00 -0700</pubDate><guid>https://neilmin.com/posts/how-rocksdb-works/</guid><description>I spent some time really learning how RocksDB works while prepping for interviews, and these are my notes: what RocksDB is, how data gets written and read, what compaction does in the background, and the unavoidable trade-off between the three amplification factors. Not exhaustive — just the core LSM-tree ideas, shared for anyone else trying to get it.</description><content:encoded><![CDATA[<p>While prepping for interviews, I spent some time really digging into how RocksDB works — how its storage engine is designed, how data gets written, and how it gets read back. RocksDB (and the LSM-tree underneath it) is one of those things a lot of people have heard of but can&rsquo;t quite explain — I couldn&rsquo;t either, before I sat down with it. Once it clicked, I wrote up the core ideas as these notes, to share with anyone else trying to get it.</p>
<p>I won&rsquo;t claim this is exhaustive or deeply expert, but I hope it leaves you (and future me) with a clear overall picture of how RocksDB actually turns.</p>
<h2 id="what-rocksdb-is">What RocksDB is</h2>
<p>In one line: <strong>an embeddable, persistent key-value store</strong>.</p>
<ul>
<li><strong>Embeddable</strong>: it isn&rsquo;t a standalone server like MySQL — it&rsquo;s a library you compile directly into your program, which cuts out inter-process communication overhead.</li>
<li><strong>Persistent</strong>: data lives on disk; nothing is lost on a crash.</li>
<li>Forked from Google&rsquo;s <strong>LevelDB</strong> in 2012, written in C++, optimized specifically for <strong>SSDs</strong> and <strong>write-heavy</strong> workloads. Meta, Microsoft, Netflix, and Uber all use it.</li>
<li>It is <strong>not distributed</strong> — replication and sharding are your job at a higher layer.</li>
</ul>
<p>The operations it exposes are humble: <code>put(key, value)</code> to write, <code>get(key)</code> to read, <code>delete(key)</code> to remove, <code>merge(key, value)</code> to combine, and <code>iterator.seek()</code> for range scans.</p>
<h2 id="the-core-idea-the-lsm-tree">The core idea: the LSM-tree</h2>
<p>Everything in RocksDB is built on the <strong>LSM-tree (Log-Structured Merge-Tree)</strong>.</p>
<p>The core tension it tackles: <strong>disks hate random writes and love sequential ones</strong>. The LSM-tree&rsquo;s trick is to buffer writes in memory, keep them sorted, then flush them to disk sequentially all at once. In other words, it <strong>batches a flood of random writes into sequential writes</strong> — and that&rsquo;s the fundamental reason it writes so fast.</p>
<p>Structurally, data is split across many levels: the top level lives in memory, and below it sit level after level on disk, numbered L0, L1, L2… The deeper you go, the older and larger the data (each level is typically ~10× the one above it).</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>memory   ┌──────────────────────────────┐
</span></span><span style="display:flex;"><span>         │  MemTable (writable, sorted)  │  ← new data lands here first
</span></span><span style="display:flex;"><span>         └──────────────────────────────┘
</span></span><span style="display:flex;"><span>- - - - - - - - - - - - - - - - - - - - - -  flush
</span></span><span style="display:flex;"><span>disk     L0   [SST] [SST] [SST]      ← newest; key ranges may overlap across files
</span></span><span style="display:flex;"><span>         L1   [SST][SST][SST][SST]   ← no overlap within a level, and bigger
</span></span><span style="display:flex;"><span>         L2   [SST][SST] ......      ← older and larger the deeper you go (~×10)
</span></span><span style="display:flex;"><span>         ...
</span></span></code></pre></div><p>This structure dates back to 1996 and was designed for write-intensive workloads. Besides RocksDB, Bigtable, HBase, Cassandra, and MongoDB&rsquo;s WiredTiger engine are all LSM-tree based.</p>
<h2 id="writing-how-data-gets-in">Writing: how data gets in</h2>
<p>A single write lands in <strong>two</strong> places at once:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>put(key, value)
</span></span><span style="display:flex;"><span>      │
</span></span><span style="display:flex;"><span>      ├──► WAL       (appended sequentially to disk, for crash safety)
</span></span><span style="display:flex;"><span>      │
</span></span><span style="display:flex;"><span>      └──► MemTable  (kept sorted in memory)
</span></span><span style="display:flex;"><span>                  │  fills up at ~64MB
</span></span><span style="display:flex;"><span>                  ▼
</span></span><span style="display:flex;"><span>            turns read-only; a background thread flushes it to one SST file → L0
</span></span></code></pre></div><p><strong>MemTable</strong>: the in-memory write buffer where every insert, update, and delete goes first. It&rsquo;s kept <strong>sorted by key</strong> internally (the default implementation is a <strong>skip list</strong>), which is what makes the later flush and range queries efficient. One detail: a delete doesn&rsquo;t actually erase anything — it writes a <strong>tombstone</strong> record meaning &ldquo;this key is deleted.&rdquo; The real cleanup is left to compaction later.</p>
<p><strong>WAL (Write-Ahead Log)</strong>: the MemTable is in memory, so a power loss would wipe it. So every write also <strong>appends</strong> a record to a WAL file on disk — key, value, operation type, and a checksum. After a crash, RocksDB replays the WAL to reconstruct the MemTable. Note the WAL is <strong>appended in write order, not sorted</strong> — it&rsquo;s optimizing purely for speed.</p>
<p><strong>Flush</strong>: once a MemTable fills up, it turns read-only and a fresh one takes over; a background thread then flushes the read-only MemTable into a single <strong>SST file</strong> on L0. Once that&rsquo;s done, the corresponding WAL can be discarded. Because the MemTable was already sorted, this flush is one <strong>sequential write</strong> — which is the whole point of the LSM-tree.</p>
<h2 id="what-an-sst-file-looks-like">What an SST file looks like</h2>
<p>An <strong>SST (Static Sorted Table)</strong> is the file that actually holds data on disk, and it&rsquo;s never modified once written. Inside is a pile of <strong>sorted key-value pairs</strong>, laid out in a carefully designed block format (blocks default to 4KB and can be compressed with Snappy, LZ4, ZSTD, etc.).</p>
<p>An SST is roughly split into a few sections:</p>
<ul>
<li><strong>Data blocks</strong>: the sorted key-value pairs. Since adjacent keys are similar, only the differences need to be stored (delta encoding) to save space.</li>
<li><strong>Index</strong>: records, for each data block, &ldquo;last key → offset in the file,&rdquo; so a lookup can <strong>binary-search</strong> straight to the right block instead of scanning the whole file.</li>
<li><strong>Bloom filter (optional)</strong>: a probabilistic structure that very quickly answers &ldquo;this key is <strong>definitely not</strong> in this file.&rdquo; It may give a false &ldquo;yes,&rdquo; but never a false &ldquo;no&rdquo; — perfect for skipping, on a read, a whole batch of files you don&rsquo;t need to touch.</li>
</ul>
<h2 id="reading-how-data-gets-found">Reading: how data gets found</h2>
<p>To read a key, you search <strong>newest to oldest</strong>, level by level — newer values sit higher, older ones lower, so the first hit is the latest value:</p>
<ol>
<li>Check the active MemTable;</li>
<li>Then the read-only MemTables not yet flushed;</li>
<li>Then each SST file in L0 (L0 files can overlap in key range, so you have to check them one by one, newest to oldest);</li>
<li>From L1 down, each level has non-overlapping key ranges, so you only need to <strong>locate and check one file per level</strong>.</li>
</ol>
<p>And within a <strong>single SST file</strong>, it&rsquo;s again three steps: first ask the <strong>Bloom filter</strong> whether the key is present — if not, skip the file entirely; if so, use the <strong>index</strong> to binary-search to the right data block; finally read that block and find the key inside it.</p>
<p>So the cost of a read comes down to how many levels and files you have to wade through — which leads straight into the next section.</p>
<h2 id="compaction-the-background-cleanup-that-never-stops">Compaction: the background cleanup that never stops</h2>
<p>As noted, a delete just writes a tombstone, and an update just writes a new value on top of the old one. Over time, the disk fills up with <strong>stale old versions and tombstones</strong>: they waste space <em>and</em> force reads to wade through more files.</p>
<p><strong>Compaction</strong> is the background job that cleans this up: it takes some SST files from one level, merges them with the overlapping files in the next level, <strong>throws away the shadowed old values and deleted keys</strong>, and writes fresh, clean SSTs into the lower level. Since every file is already sorted, the merge uses a <strong>k-way merge</strong> — a scaled-up version of the &ldquo;merge&rdquo; step in merge sort. It all runs on background threads, so it doesn&rsquo;t block foreground reads and writes.</p>
<p>RocksDB defaults to <strong>leveled compaction</strong>:</p>
<ul>
<li><strong>L0</strong> is special: its files <strong>may overlap</strong> in key range (since they&rsquo;re flushed straight from MemTables); compaction triggers once the L0 file count hits a threshold (4 by default).</li>
<li><strong>L1 and below</strong>: within each level, all files have <strong>non-overlapping</strong> key ranges and are globally ordered; when a level&rsquo;s total size exceeds its target, the excess is merged down into the next level — sometimes cascading down several levels in a chain.</li>
</ul>
<h2 id="its-all-trade-offs-the-three-amplifications">It&rsquo;s all trade-offs: the three amplifications</h2>
<p>The key to understanding RocksDB tuning (really, all LSM engines) is three <strong>amplification</strong> factors:</p>
<ul>
<li><strong>Space amplification</strong>: disk space actually used ÷ size of the logical data. The more stale versions and tombstones pile up, the higher it gets.</li>
<li><strong>Read amplification</strong>: how many I/O operations a single logical read actually performs. The more levels and files to wade through, the higher it gets.</li>
<li><strong>Write amplification</strong>: how many times a single logical write is actually written. The same piece of data gets rewritten to lower levels over and over during compaction, so this can get large.</li>
</ul>
<p>These three are a game of whack-a-mole: <strong>the more aggressively you compact, the smaller your space and read amplification, but the larger your write amplification</strong> — and vice versa. The right balance depends entirely on your workload, and the knobs are many and interdependent. Even the RocksDB authors admit it&rsquo;s hard to pin down the exact effect of each parameter, and recommend <strong>benchmarking a lot while keeping an eye on those three amplification factors</strong>.</p>
<blockquote>
<p><strong>An aside: the merge operation</strong></p>
<p>Besides put and delete, RocksDB has <code>merge</code>. When you need to apply lots of <em>incremental</em> updates to a value (say, repeatedly appending to a counter or a list), the traditional approach is read-modify-write: read it out, change it, write it back — clunky. <code>merge</code> lets you write just the <em>increment</em> and hands off the combining to a merge function you define, computing the final value only at read or compaction time. The <strong>upside</strong> is lower write amplification, plus it&rsquo;s thread-safe; the <strong>cost</strong> is that reads get more expensive — until the increments are consolidated, every read has to recompute them.</p>
</blockquote>
<h2 id="the-bits-worth-remembering">The bits worth remembering</h2>
<p>If I keep just one mental map, it&rsquo;s this:</p>
<ul>
<li><strong>RocksDB</strong> = an embeddable, persistent KV store, descended from LevelDB, built on the <strong>LSM-tree</strong>;</li>
<li><strong>Writes</strong>: into the in-memory <strong>MemTable</strong> (sorted) + a sequential <strong>WAL</strong> (crash safety) → once full, flushed to an <strong>SST</strong> file on L0 → <strong>compaction</strong> slowly tidies things downward in the background;</li>
<li><strong>Reads</strong>: search newest to oldest, level by level, using a <strong>Bloom filter</strong> + <strong>index</strong> to skip and locate so you read as few stray files as possible;</li>
<li><strong>The essence</strong>: it trades &ldquo;write amplification&rdquo; for the high throughput of &ldquo;turning random writes into sequential ones&rdquo; — and <strong>between space, read, and write amplification, it&rsquo;s always a trade-off; there&rsquo;s no free lunch</strong>.</li>
</ul>
<p>Hold onto those few lines and the overall shape of RocksDB stands up. The finer details — skip lists, delta encoding, the various compaction strategies, how to tune the knobs — you can dive into whenever you actually need them.</p>
<blockquote>
<p>A lot of my understanding here comes from Artem Krylysov&rsquo;s <a href="https://artem.krylysov.com/blog/2023/04/19/how-rocksdb-works/">How RocksDB Works</a>, which goes into far more depth — highly recommended if you want to go deeper.</p>
</blockquote>
]]></content:encoded></item></channel></rss>