RAG at Scale: Why Quantization Matters

I've been building a RAG system over a terabyte of scanned engineering manuals tens of millions of pages, chunked into roughly six million searchable pieces. At that scale the interesting problems stop being about embeddings and start being about where the vectors physically live. The single decision that made the system affordable was quantization, and it's a genuinely beautiful piece of engineering, so I want to walk through why it works.

The memory wall

When a vector database answers a query, it doesn't compare your query against all six million vectors.

It walks an HNSW graph a layered "highway map" — comparing the query against a few thousand stored vectors along the way and following the closest ones deeper.

That's how a search touches thousands of vectors instead of millions and still finishes in ~20 milliseconds.

But every one of those thousands of comparisons has to read a vector first, and where that read happens decides everything:

RAM read: ~100 nanoseconds
SSD read (page not cached): ~100 microseconds — about 1,000x slower

So vector search is fast only if the vectors live in RAM.

And that runs straight into three facts that collide:

Search reads vectors constantly — one query touches thousands of them.
RAM is ~1,000x faster than disk, so those vectors must be in RAM.
RAM is the expensive resource.

Disk runs about \(0.08/GB/month; RAM on a cloud instance effectively costs \)2–3/GB/month — call it 30x more.

A float32 embedding is 4 bytes per number. At 768 dimensions that's ~3 KB per vector, and across a large corpus the full-precision index runs to tens or hundreds of gigabytes of vectors you'd need to keep in expensive RAM.

For my corpus:

Item	Estimate
Chunks	~6 million
Dense vectors, full precision (float32)	~18 GB
Same vectors, quantized in RAM (int8)	~4.6 GB

Quantization attacks fact #3: shrink the vectors until the whole index fits in a cheap amount of RAM.

It's a compression scheme designed specifically around what similarity search can afford to lose.

How int8 scalar quantization works

A float32 stores a precise decimal in 4 bytes.

An int8 stores one of 256 whole values (−128…127) in 1 byte.

Scalar quantization is just the recipe for mapping decimals onto those 256 buckets and back — at 4x less space.

Step 1 — learn the range

Scan the stored vectors and find the range the values actually occupy.

Embedding values cluster tightly; for a typical normalized embedding almost everything falls between, say, −0.15 and +0.15.

Step 2 — slice the range into 256 steps

scale = (max − min) / 255

step(x) = round((x − min) / scale)      // float32 → int8

approx(s) = min + s × scale             // int8 → approximate float32

With range [−0.15, +0.15] each step is ~0.0012 wide.

The value 0.0231 maps to step 147; reading step 147 back gives ~0.0229.

The error on any single number is at most half a step — about 0.0006.

Step 3 — clamp the outliers (the detail that makes it work)

Real embeddings have a few freak values — one dimension in one vector might be 0.9 while 99% of values sit under 0.15.

If you stretch the range to cover that outlier, your 256 steps spread thin and become coarse exactly where all the real data lives.

So you fit the range to the central 99% of values (a quantile: 0.99 setting in Qdrant) and clamp the rare outliers to the edge.

You spend precision where the data is and starve the values that barely exist.

This one parameter is most of why int8 loses so little quality.

Step 4 — a free compute win

Similarity is a dot product: multiply element-wise, sum.

CPUs have SIMD instructions that process many numbers per clock tick, and they pack 4x more int8s than float32s into one instruction.

So each comparison is also faster, not just the memory read.

A compute speedup stacked on top of the memory win.

Why throwing away precision doesn't throw away meaning

The instinct is that rounding 4-byte floats down to 1-byte integers must wreck retrieval.

It barely moves it, for four reasons that stack:

1. Errors average out

A distance is a sum over all 768 dimensions.

Each dimension's rounding error is tiny and random in direction — some round up, some down so across 768 terms they largely cancel.

The error of the total grows like √768, not 768; proportionally it shrinks as dimensions grow.

2. The precision was fake anyway

The embedding model's own output is noisy — paraphrase a sentence and the vector moves more than quantization moves it.

Digits 4 through 7 of a float32 embedding value sit below the model's noise floor; they carry no semantic signal.

Quantization mostly discards noise.

3. The task is ordinal, not metric

Retrieval only ever asks:

Is A closer than B?

It never asks:

What is the exact distance?

Rounding flips an ordering only when two candidates were nearly tied — a small fraction of comparisons, and usually a tie between two equally good results anyway.

4. There's a safety net

The close calls that do matter get re-checked against the full-precision originals.

Which is the actual architecture.

The two-tier trick: search cheap, verify precise

This is the pattern that makes quantization essentially free in quality terms:

Keep both copies

Quantized int8 vectors pinned in RAM (always_ram: true)
Original float32 vectors on disk (on_disk: true)

Disk is cheap, so storing both costs almost nothing.

Run the expensive phase on the cheap copy

The HNSW traversal — thousands of comparisons — runs entirely against the in-RAM int8s.

This phase only has to be approximately right: get the truly best results somewhere into the candidate pool.

Oversample to widen the net

With oversampling: 2.0, if you want the top 100 you collect the top 200 from the quantized search.

Insurance in case quantization error ranked a true top-100 result down at #130.

Rescore to settle it

Read the ~200 finalists' original float32 vectors from disk (only ~200 reads — trivial), recompute exact distances, and the genuine top 100 emerge in the right order.

Net effect:

Recall measured against an exact full-precision search is typically 99%+
Latency drops (smaller data, SIMD)
RAM drops 4x

You traded thousands of expensive operations for two hundred cheap ones.

Choosing a scheme

Scalar int8 is the workhorse, but it's one of a family, and the trade is always compression vs. quality.

Scheme	Compression	What it keeps	When to use
Scalar (int8)	4x	256 buckets per dim	Default. Works on any model, tiny loss.
Binary	32x	Only the sign of each dim	High-dim (1536–3072) models trained for it; RAM is the hard bottleneck.
Product (PQ)	8–64x	Sub-block → nearest codebook prototype	Hundreds of millions of vectors, where int8 still won't fit.

Binary keeps only whether each dimension is positive or negative — 0.001 and 0.9 both collapse to 1.

Distance becomes a Hamming distance (count differing bits) computed with a single XOR+popcount instruction, so it's absurdly fast (often 20–40x).

But it's brutal, and only survives because of Reason 1 (averaging) — which needs lots of dimensions.

Reliable at 1536–3072 dims with binary-friendly models, risky at 768.

Always pair it with heavy oversampling and rescoring.

Product quantization chops each vector into sub-blocks and replaces each block with the ID of its nearest prototype from a learned codebook.

Best compression-to-quality ratio at extreme scale — it's the heart of FAISS and billion-vector systems — but it's slower to build, adds encoding complexity, and is overkill below ~100M vectors.

The decision logic is short:

int8 unless you have a specific reason
Binary if you're at high dimensions with a compatible model and RAM is truly the wall
PQ when you're at hundreds of millions of vectors and int8 still doesn't fit

The takeaway

Quantization is a rare case where the "obvious" cost — throwing away three quarters of every number — turns out not to be a real cost, because the thing you threw away was noise the model couldn't see, on a task that only cares about ordering, with a cheap full-precision rescore to catch the few cases that matter.

For my 768-dimension embeddings over six million chunks, int8 was the entire ballgame: a 32 GB-RAM instance does comfortably what would otherwise need something several times larger and several times pricier.

At scale, the model is the easy part — the engineering is figuring out where the vectors live.

How quantization keeps vector search in RAM

The memory wall