You open Claude.ai, chatgpt.com, gemini, whatever LLM provider you use. You type something:
"What is the capital of France?"
You hit enter. There's a small loading symbol for half a second. Then words start appearing on your screen, one at a time, in sequence. After a moment, the answer finishes. The model has answered your question.
Whether the answer is right, wrong, or hallucinated, none of our concern in this piece.
What we care about is what happens in those seconds between you hitting enter and the answer finishing. The rate at which words appear on your screen is called inference, and it is measured in tokens per second.
This is the journey those tokens take. From the moment you typed your query to the moment the response finished streaming, what actually happened underneath?
Let's go step by step.
By the end you'll see why a sentence as short as "Paris." costs a real amount of money to produce, why long context is structurally expensive, why your tokens get cheaper as more of them flow, and where every inference company in the world is competing for an edge.
Setting the stage
Before we follow our query into the machine, I need to set up what's already there waiting for it.
Our model in this example is Llama 3.1 8B, open-source, from Meta. By "the model" I mean: a folder of files. At runtime, those files get loaded into a GPU's memory. The actual release has a handful of supporting files (license, readme, generation config, etc.), but the load-bearing ones are these:
- 4
.safetensorsfiles (the actual weights) tokenizer.jsonconfig.json
That's it. That's the model.
What the .safetensors files hold
A model is a bunch of weights. A giant collection of numbers learned during training. Llama 3.1 8B has 8.03 billion of them. (For our purposes, training is a black box. Imagine some massive function got trained on the internet and these numbers fell out.)
Each weight is stored as BF16, which is 16 bits or 2 bytes. The 16 bits are split into:
- 1 bit for sign (is the weight positive or negative)
- 8 bits for exponent (the order of magnitude. If the weight is 0.0034, this encodes the "10⁻³" part)
- 7 bits for mantissa (the actual precise digits, the "3.4" part)
There's also FP16, which uses the same 16 bits but allocates them differently (fewer to the exponent, more to the mantissa). Tradeoff: BF16 covers a huge range (~10⁻³⁸ to ~10³⁸) but has lower precision. FP16 has higher precision but a smaller range (about ±65,000). For training, BF16 wins because gradients can get tiny and you need that range. For inference, both work fine and the model is usually shipped in whatever it was trained in.
So: 8.03B weights × 2 bytes each = ~16 GB. That's the model file.
It's split across 4 .safetensors files for practical reasons. When the inference server boots up and loads the model into the GPU, having 4 files instead of 1 means parallel loading and partial recovery. If file 3 fails to download, you redo file 3, not the whole 16 GB. Pure logistics.
What the .json files hold
tokenizer.json holds the vocabulary. Every token the model knows, mapped to an integer ID. Things like:
"France" → 9822
" the" → 279
"?" → 30
Llama 3.1 has 128,256 entries in its vocabulary.
When the server boots, this file is loaded into a hash map (string → int) in CPU RAM. Plus the BPE merge rules in a fast lookup structure. Both stay in CPU RAM for the lifetime of the server. Every incoming user query needs to be tokenized, every outgoing response needs to be detokenized, so this stuff is constantly in use.
config.json holds the architectural numbers. How many layers, how big d_model is, how many attention heads, that sort of thing. For Llama 3.1 8B (from the actual config):
- 32 layers
- d_model = 4096
- 32 query heads, 8 KV heads (we'll get to this)
- intermediate dim = 14336
When the server boots, the inference framework (vLLM, TensorRT-LLM, Fireworks's runtime, etc.) reads these values and uses them to compile a computational graph on the GPU. The actual sequence of GPU operations the model will run during inference. After this is built, the file isn't accessed again, though the parsed values stay in memory.
The hardware
The model lives on an H100 GPU. This GPU has 80 GB of HBM (High Bandwidth Memory, a special kind of VRAM built for fast access). The bandwidth between this memory and the GPU's compute units is 3,350 GB/sec.
Quick math: 16 GB of model ÷ 3,350 GB/sec = ~0.0048 seconds to read the entire model once. To produce one output token, the GPU has to read every weight at least once. So the absolute theoretical maximum is around 209 tokens/sec for a single user. Real-world is more like 150 tok/s, since you don't get 100% of the bandwidth. We'll come back to this number. It's the most important number in inference.
The model weights live in HBM. The tokenizer lives in CPU RAM. The computational graph is compiled onto the GPU. Everything is ready and waiting.
Now the user hits enter.
Step 1: Tokenization (CPU)
The string "What is the capital of France?" arrives at the server's CPU.
A function (written in Rust or C++, called via Python) runs BPE (Byte-Pair Encoding) on the string. BPE is a deterministic algorithm based on statistical analysis of the training corpus. It chops the string into pieces, and looks each piece up in the hash map to get an integer ID.
For our query, simplified, the result might be 7 tokens:
["What", " is", " the", " capital", " of", " France", "?"]
→ [3923, 374, 279, 6864, 315, 9822, 30]
The output is a small array of integers. ~28 bytes total.
PS: the tokenizer is not a model. It's not a neural network. It has no learned weights at inference time. It's pure algorithm + lookup table. Confused me at first because of the name.
This array of integers gets copied from CPU RAM to GPU HBM over the PCIe bus, in microseconds.
Step 2: Embedding lookup (GPU)
The model has an embedding table, a matrix of shape [128256, 4096] sitting in HBM. One row per vocabulary token. Each token gets projected into a 4096-dimensional space, where 4096 different aspects of that token's meaning get captured by where it sits along each axis. Some axes might roughly correspond to "is this a noun," "is this country-related," "is this a question word," and so on. The model learned which axes capture what during training. We don't have to know what each one means, just that 4096 is the model's working representation of meaning.
For each integer in our array, the GPU looks up the corresponding row. All 7 lookups happen in parallel.
The result: a [7, 4096] matrix sitting in HBM. Seven rows, one per input token, each row a 4096-dim vector.
This matrix is what flows through all 32 transformer layers. Same shape going in, same shape coming out. The numbers inside change at every layer. The shape stays the same.
The bar enters layer 0 and exits layer 31 with the exact same shape. What changes is the colors — the numbers inside. Shape preservation through transformation is the load-bearing property of the transformer.
Now we hit the layers.
Step 3: Inside one layer
Took a while to understand the details here, but this is the meat of the computation part, where the actual work takes place. The 32 layers are stacked copies of the same structure with different learned weights, so cracking one layer cracks the whole stack.
Each layer has two sub-blocks:
- Attention (mixing information across tokens)
- MLP (processing each token independently)
Each sub-block has:
- An RMSNorm at the start. Why? Because as the matrix flows through layers and gets multiplied by big weight matrices, the values inside it can swing wildly, some dimensions blowing up to thousands, others shrinking near zero. When that happens, the math downstream stops behaving and information gets lost in the noise. RMSNorm fixes this. For each row, it computes the root-mean-square (square every value, take the average, take the square root), then divides every value in the row by that number. Now every row has a controlled magnitude. Then each dimension gets multiplied by a learned scale parameter (one per dimension, learned during training, telling the model how much to amplify or dampen that dimension). Cheap operation. Critical for stability.
- The actual operation (attention or MLP).
- A residual connection at the end. Adds the input back to the output, so the operation's job is to compute adjustments on top of what came in, not replace it.
Attention sub-block first.
Attention: what it's trying to do
Each row in our [7, 4096] matrix currently represents one token in isolation. The row for "France" right now is the generic "France" vector, the same as it would be in any sentence.
But we don't want generic. We want "France" in the context of this sentence. We want the row for "France" to know that there's a "capital" earlier in the sequence, and that this is a question. We want every token's row to absorb relevant context from the other tokens.
That's what attention does. It mixes information between tokens.
Attention: the math
We start by computing three different "views" of each token, by multiplying our [7, 4096] matrix with three learned weight matrices:
W_Qof shape[4096, 4096]→ produces Q (Query) of shape[7, 4096]W_Kof shape[4096, 1024]→ produces K (Key) of shape[7, 1024]W_Vof shape[4096, 1024]→ produces V (Value) of shape[7, 1024]
Three roles for every token:
- Q: "what am I looking for in other tokens?"
- K: "here's what I'm about, my advertisement, my label"
- V: "if you decide I'm relevant, here's the actual content I'll give you"
You'll notice Q is 4× larger than K and V. This is Grouped-Query Attention (GQA) and it's one of the most important design choices in modern LLMs.
Q has 32 heads of 128 dimensions each (32 × 128 = 4096). K and V have 8 heads of 128 dimensions each (8 × 128 = 1024). The mapping is fixed by architecture: Q heads 0-3 share K head 0 and V head 0, Q heads 4-7 share K head 1 and V head 1, and so on.
Each color is one group of 4 Q heads sharing one K/V head. 32 questions still get asked independently — but every group of 4 consults the same answer-source. Watch one group light up at a time.
Why? Because K and V get cached (we'll get to the KV cache shortly). If we cached all 32 heads of K and V, the cache would be 4× bigger.
Why not also shrink Q? Because Q is the question each token is asking the others, and we want those questions to be precise. Cutting Q's dimensions would dilute the questions, which directly degrades quality. K and V are the answers, and as it turns out, when 4 different Q heads are asking similar enough questions, they can share the same answer-source without losing much. The 32 questions still get asked independently; 4 of them just consult the same K/V pair. Empirically this drops benchmark performance by less than 1% while shrinking the cache 4×. Brutal tradeoff in favor of inference economics. Every modern model adopted it.
Now the actual relevance computation
For each head, we compute Q · Kᵀ, a dot product between every Q vector and every K vector. The result is a [7, 7] matrix of scores. Each cell tells us how relevant token j is to token i's question.
Three operations stacked. Scores get computed for every token-pair. The upper triangle is masked off (no peeking at the future). Softmax converts each row into an attention distribution. The "France" row tells the model how much of each prior token's V to mix in.
Two important details:
1. Scaling. Before softmax, we divide every score by √128. Why? When you sum 128 paired multiplications, the result has a natural "spread" of about √128 ≈ 11.3. Without scaling, scores hover around ±10, and softmax of values around ±10 collapses to nearly winner-takes-all (the highest score gets ~100% attention, everything else gets ~0%). We lose the ability to attend softly. Dividing by √128 normalizes the spread back to ±1, where softmax behaves well.
2. Causal masking. Token generation is autoregressive. Each token depends on all previous tokens, but never future ones. So token 3 ("capital") shouldn't be allowed to peek at token 5 ("France"). We enforce this by replacing the upper triangle of our [7, 7] matrix with -∞ before softmax. Then softmax converts -∞ to 0, and those positions get exactly zero attention weight.
Then softmax across each row → row-wise probability distribution. Each row tells one token how much of each other token's V to pull in.
For each token, take its softmax row, use it to compute a weighted sum of all V vectors. That's the contextualized output for that token, for that head.
All 32 Q heads run this exact computation in parallel. Different Q values, sometimes shared K/V (because of GQA), but the math is identical and independent across heads. Which is exactly why this maps so well onto GPUs. The hardware loves embarrassingly parallel work like this. Each head produces a [7, 128] output.
Combining the heads
We concatenate all 32 head outputs side by side: 32 × [7, 128] → [7, 4096].
But there's a problem. Each head computed in isolation. Head 0 doesn't know what head 8 was doing. To let the heads' findings interact, we multiply by one more learned weight matrix, W_O of shape [4096, 4096]. This mixes the heads' outputs into a unified representation where every output dimension is a learned blend of all 32 heads' findings.
Each head computes in isolation, in its own color. They concat side-by-side into one wide bar. W_O blends them. The output is a single unified representation where every dimension is a learned mix of all 32 heads' findings.
Result: [7, 4096].
Then the residual connection: we add the original input to this sub-block back to the result. Why? Because if we didn't, the attention block could destroy information from the input. With the residual, attention's job is to compute adjustments on top of the input, not replace it. After 32 layers, the final representation is the original embedding plus 32 layers' worth of adjustments stacked on top. Nothing ever gets erased; everything is layered on.
Output of the attention sub-block: [7, 4096]. Same shape as input.
MLP: what it's trying to do
Attention mixed information between tokens. Now we need to actually process what each token has absorbed. Refine it. Extract patterns. Make it usable.
MLP does this per token, independently. No cross-token mixing. Each row of the matrix gets transformed in parallel with the others, which is why this part runs fast on a GPU.
MLP: the math
First, RMSNorm again (because attention + residual produced unnormalized magnitudes).
Then SwiGLU, which is three weight matrices:
W_upof shape[4096, 14336]W_gateof shape[4096, 14336]W_downof shape[14336, 4096]
Step A: project the matrix into a wider space using both W_up and W_gate, in parallel.
input [7, 4096] × W_up → up_proj [7, 14336]
input [7, 4096] × W_gate → gate_proj [7, 14336]
Why expand to 14336 (3.5× the model dim)? More room to compute features. The wider workspace lets the model detect patterns that aren't visible in 4096 dims.
W_up is the content: a richer 14336-dim representation of the token's information.
W_gate is the filter: a 14336-dim mask telling us which parts of the content to keep vs. suppress.
Step B: apply SiLU to gate_proj. SiLU is a smooth non-linear function. Positive values pass through mostly intact, negative values get pushed toward zero.
Step C: element-wise multiply gate (filtered) and up (content). Each position in the gate filters its corresponding position in the content. This is the GLU (Gated Linear Unit), and it's where the model's non-linearity lives.
A temporary excursion into a 3.5× wider workspace. Gate and up project in parallel; SiLU bends the gate; element-wise multiplication combines them. Then W_down compresses everything back to the standard 4096-dim shape the next layer expects.
Why we need this non-linearity at all. If everything in the model were just matrix multiplications stacked on top of each other, the whole 32-layer thing would mathematically collapse into one matrix multiplication. No matter how deep you stack it. And one matrix multiplication can only learn straight-line relationships ("more X, more Y"). Language doesn't work like that. The non-linear step (the SiLU bend) is what lets the model actually learn the messy, conditional, "it depends" patterns that language is full of.
Same dots, two different worlds. On the left, no straight line cleanly separates the inner cluster from the outer ring. On the right, a single curved separator wraps the inner cluster — exactly what non-linearity unlocks. This is why language, full of "it depends" patterns, can't be modeled with straight-line math alone.
Step D: now we need to compress the result back down to 4096 dims (because the next layer expects that shape). For this we use W_down, another learned weight matrix, shape [14336, 4096]. It was trained alongside the rest of the model, so it knows how to take the gated 14336-dim representation and squeeze it back into the standard 4096-dim format without losing the important stuff.
combined [7, 14336] × W_down → output [7, 4096]
Then the residual connection: add the input to this sub-block (the post-attention residual) back to the output.
Result: [7, 4096]. Ready for the next layer.
Where the parameters actually live
Quick aside that matters for understanding optimization:
| Per layer | Matrix | Parameters |
|---|---|---|
| Attention | W_Q, W_K, W_V, W_O | ~42M |
| MLP | W_up, W_gate, W_down | ~176M |
MLP is roughly 4× bigger than attention. Across all 32 layers, MLPs hold about 75% of the model's total parameters. Which means MLPs also dominate inference compute, and fusing the MLP's three matrix multiplications into a single GPU kernel is apparently one of the biggest performance levers in any modern inference runtime.
Step 4: 32 layers, then the top of the stack
The matrix flows through Layer 0, then Layer 1, then Layer 2... all the way to Layer 31. Same [7, 4096] shape going through every layer. At each layer:
- RMSNorm → Attention → +residual
- RMSNorm → MLP → +residual
Each layer has its own learned weights (different W_Q, W_K, W_V, W_O, W_up, W_gate, W_down, plus its own RMSNorm scale parameters). Different layers tend to learn different jobs. Early layers capture syntactic patterns, middle layers capture semantic relationships, later layers shape the output for prediction. Nobody designed it this way; the specialization emerges from training.
After 32 layers, we have a [7, 4096] matrix that's been deeply contextualized. Every row knows about every other row.
But we only need to predict one token, the next one. So:
- Final RMSNorm on the matrix.
- Take only the last row, the row corresponding to the last input token (
?). That row holds everything the model has figured out about what should come next. It's[1, 4096]. - Multiply by W_out of shape
[4096, 128256]. This is a separate learned weight matrix that projects from the hidden dimension back to vocabulary space. (Some smaller models reuse the embedding table here, called weight tying, but Llama 3.1 8B has W_out as its own matrix.) - The result is
[1, 128256]: one raw score (a logit) per vocabulary token. Loosely, this is "how much does the final hidden state align with each vocab token's vector." - Softmax across the 128,256 logits → probability distribution.
- Sample: pick a token from the distribution. With greedy sampling (temperature 0), you pick the highest-probability token. With higher temperatures, you flatten the distribution and let lower-probability tokens get picked, which is what makes outputs more "creative." Top-k and top-p restrict sampling to the most probable subset.
For our query, the highest probability is on " Paris". We sample it.
That ID gets copied back to CPU. The CPU detokenizes it via the same hash map (reverse direction), gets the string " Paris", and streams it to the user.
That's one output token.
Step 5: The decode loop
We're not done. We have to keep generating until the model produces an end-of-sequence token.
For the next token, we take the new sequence (original 7 tokens + " Paris") and we want to predict what comes next.
Naively, we'd run the full forward pass on all 8 tokens again. But that's wasteful. For tokens 1-7, the K and V vectors at every layer are exactly the same as last time. We already computed them. The only new work is computing K and V for " Paris" and running attention with it as the query.
This is the KV cache. We save K and V for every (token, layer) pair the first time they're computed, and reuse them on every subsequent step.
So in the decode step, we feed in only the latest token as a [1, 4096] matrix. It flows through all 32 layers. At each layer, we compute Q, K, V for just this one token, append the new K and V to the cache, and run attention where the new Q reads against all cached K/V.
One forward pass. One new token. Repeat until end-of-sequence.
Why decode is structurally slower than the first pass
There's a name for that first pass that processes all input tokens in parallel: prefill.
There's a name for each subsequent single-token pass: decode.
These two have completely different performance profiles, and the difference drives almost everything about inference economics.
Prefill processes many tokens in one pass. Each weight read from HBM is amortized across many tokens of computation. The GPU's tensor cores are saturated. Compute-bound. Fast per token.
Decode processes one token per pass. Each weight read from HBM is used for one token of computation. The tensor cores are mostly idle, waiting for weights to arrive. Bandwidth-bound. Slow per token.
Same H100 in both panels. Prefill saturates every tensor core because many tokens share each weight read. Decode reads the entire 16 GB of weights for every single token — most cores sit idle waiting. The gap (thousands vs. ~150) is the entire reason inference economics look the way they do.
For Llama 3.1 8B on H100, baseline (no fancy tricks): prefill processes thousands of tokens per second. Decode produces about 150 per second for a single user. That gap, thousands vs. 150, is the entire reason inference economics look the way they do. (Speculative decoding, which we'll get to, can push the decode number higher by parallelizing parts of the work, but the underlying asymmetry is structural.)
The 209 tok/s ceiling we calculated way back? That was the decode ceiling. Decode is the bottleneck. And every optimization we're about to discuss attacks the decode bottleneck in some form.
Step 6: Attached concepts, where companies optimize
Now that we've walked the lifecycle end-to-end, the optimization landscape starts to make sense. Every technique I've come across (Fireworks, Together, Groq, the silicon-layer challengers) is doing one of these: attacking some specific stage of the lifecycle we just traced.
Each of these is a direct response to something we just covered.
Speculative decoding: attacking the autoregressive bottleneck
The autoregressive constraint says: to generate token N, you need tokens 1 through N-1. So generation is structurally sequential. Decode is slow because of this.
But verification is parallel. If somebody hands the big model a candidate sequence of tokens, the big model can verify all of them in a single parallel forward pass, same shape as prefill.
So: pair the big model with a small draft model in the same family, something in the 1B parameter range, with a compatible tokenizer. The draft model autoregressively generates K candidate tokens (say 4) at high speed because it's small. Then the big model takes prompt + c1 + c2 + c3 + c4 and verifies in one parallel pass.
If the draft was right about all 4: we got 5 tokens for the cost of one big-model forward pass (4 verified + 1 bonus prediction at the next position). 5× speedup.
If the draft was wrong about c1: we accept the big model's prediction at that position (1 token of progress) and discard c2, c3, c4. Same throughput as baseline. No worse.
The big model is the one that decides every token, the draft just proposes guesses. So the final output is the same as if you'd run the big model alone. Pure speedup, no quality loss.
Every top-tier inference provider ships some flavor of this. The bet for whoever pulls ahead: better draft models or better self-speculation (Eagle, Medusa) → higher acceptance rates → bigger speedup → cheaper output tokens at the same latency. A genuine moat avenue.
Quantization: attacking the bandwidth ceiling
Decode is bandwidth-bound. The ceiling is bandwidth ÷ model size. The bandwidth is fixed by hardware. The only thing you can change is the model size.
Quantization shrinks each weight from BF16 (2 bytes) to FP8 (1 byte) or INT4 (0.5 bytes). The mechanics: you're not just truncating bits. You compute a scale factor per weight matrix, divide the original weight by that scale, round to the smaller representation, and store both the quantized weight and the scale. At inference time, the GPU dequantizes on the fly.
Smaller model = less to read from HBM per token = higher tok/s.
On H100 specifically, quantization compounds. H100's tensor cores have native FP8 hardware. Physically more compute units packed into the silicon at the lower precision. So FP8 isn't just half the memory; it's also 2× the compute throughput. Both bandwidth AND compute win. This is why Nvidia is racing toward smaller floating-point formats with each generation (FP4 on Blackwell).
Production sweet spot today: FP8. Cuts memory in half, doubles compute throughput on H100, costs less than 1% on benchmarks. INT4 goes further but loses more quality.
KV cache management: attacking long-context economics
The KV cache for Llama 3.1 8B is 128 KB per token, per user. (32 layers × 8 KV heads × 128 dim × 2 (K+V) × 2 bytes = 128 KB.)
At 1,000 tokens, the cache is 128 MB.
At 100,000 tokens, the cache is 12.8 GB.
At full 128K context, the cache is 16 GB, the same size as the entire model.
On an 80 GB H100 (with 16 GB taken by weights, leaving 64 GB), the math is brutal:
| Context length | Cache per user | Concurrent users |
|---|---|---|
| 1K | 128 MB | ~500 |
| 10K | 1.28 GB | ~50 |
| 100K | 12.8 GB | ~5 |
| 128K | 16 GB | ~4 |
Same hardware, ~125× difference in concurrency, purely based on context length.
Long context isn't structurally hard because the math is harder. It's hard because the cache physically eats the GPU. Same chip, same model. Drag the slider, type a custom value, or click a preset and watch user-slots collapse from ~500 down to a handful.
This is the load-bearing economic insight for long-context serving. Long context isn't structurally hard because the math is harder. It's hard because the cache hogs HBM that could be serving other users. This is why long-context tiers cost more on every provider's pricing page.
A quick search and asking around tells me these are the main directions the field is attacking this problem (I haven't gone deep on any of these yet, putting them here so I know where to look next):
- PagedAttention (vLLM): manages the KV cache like virtual memory, in fixed-size pages, eliminating fragmentation when users come and go.
- Multi-Latent Attention (DeepSeek's MLA): compresses KV by ~10× via a learned latent projection.
- Sliding window attention: cap the cache at a recent window instead of the full context.
The next big breakthrough here probably wins a generation of long-context products.
Continuous batching: attacking the underutilization
When a single user is in decode, the GPU's tensor cores sit ~95% idle, starved for weights from HBM. All that compute is just wasted.
The fix: serve many users simultaneously. While the weights are streaming from HBM anyway, use them to compute many users' tokens in parallel. Same bandwidth cost, much more useful work.
Naive batching (lock 32 users into a fixed batch and run together until done) breaks under real traffic. Slow users block fast ones, new arrivals wait, slots sit empty. Continuous batching rebuilds the batch every decode step, with users joining and leaving fluidly. Combined with PagedAttention for memory management, this is the standard production approach.
The economic impact is huge: cost per token drops ~100× going from batch size 1 to batch size 256. This is the entire reason inference APIs charge fractions of a cent per million tokens. Without batching, no one could afford to serve LLMs at consumer prices.
The moat avenue: smarter schedulers. How aggressively to batch, how to balance prefill vs decode, how to handle priority traffic, how to maintain SLAs under bursty load. All real engineering surfaces.
Multi-LoRA serving: building the lock-in
Customers fine-tune models on their narrow data using LoRA (Low-Rank Adaptation). Instead of updating the full 8B parameters, LoRA learns a small "adjustment", typically ~100 MB instead of 16 GB. The customer trains it cheaply on a small GPU.
At serving time, the inference platform keeps the base model in HBM and applies the customer's LoRA on the fly. One base model serves thousands of customers, each getting their personalized fine-tune.
This is where business lock-in compounds. Once a customer has accumulated 20+ fine-tuned LoRAs on a particular platform, switching is painful. They'd have to re-export, re-validate, re-tune sampling parameters, re-engineer their pipelines. The lock-in isn't contractual; it's integration depth. Every fine-tune deepens it.
Inference platforms push fine-tuning hard precisely because of this dynamic. The customer-data moat is the deepest layer of the stack.
Multi-GPU orchestration: for models that don't fit
Llama 3.1 8B fits on one H100. Llama 3.1 405B doesn't. It's ~810 GB at BF16. To serve big models, you split across GPUs:
- Tensor parallelism (TP): split each weight matrix across GPUs. High communication cost; only works inside a server with NVLink (~900 GB/s GPU-to-GPU bandwidth).
- Pipeline parallelism (PP): different GPUs hold different layers. Lower communication cost; works across servers.
- Data parallelism (DP): replicate the full model on each GPU; distribute requests. Trivially scalable.
Production deployments of huge models combine all three. This is its own engineering domain, a deep rabbit hole I'm not going down here. But it's where the largest inference workloads compete on orchestration efficiency.
Where I think the moat lives
Walk back through every step we just covered and you can see exactly where companies are placing their bets:
- Tokenization, PCIe transfer, embedding lookup. Boring, mostly solved. No real moat. Everyone does these the same way.
- The actual layer math (RMSNorm, attention, MLP). This is where custom GPU kernels live. FlashAttention, FireAttention, hand-tuned CUDA. Their job: take the same math we walked through and run it on the GPU more efficiently. Fuse multiple operations into one kernel so the GPU touches HBM fewer times, squeeze more useful work out of every byte read. One of the deepest technical moats in this whole stack.
- KV cache mechanics. PagedAttention manages the cache so users coming and going doesn't fragment the memory. MLA compresses the cache itself. The job here: fit more concurrent users on the same GPU, especially at long context. Active research frontier.
- The decode bandwidth ceiling. Quantization shrinks the model so each forward pass touches less HBM. Speculative decoding parallelizes work that was structurally sequential. Both attack the same ceiling from different angles. The wins compound.
- Concurrency. Continuous batching is the scheduler that lets one GPU serve hundreds of users at once instead of one at a time. Without this, no one could afford to serve LLMs at consumer prices. Pure operational engineering excellence; the hidden hero of LLM economics.
- Long context. The cache management techniques above all feed into this. Long context is the bottleneck for the next wave of products (think: agents that hold long state, deep-research systems, code assistants with whole-codebase context). Whoever cracks this cleanly opens up product categories that are currently uneconomical.
- Customer fine-tuning. Multi-LoRA serving lets one base model serve thousands of customers, each with their own specialization. This is where the business moat lives. Once a customer has 20 LoRAs hosted on your platform, switching is genuinely painful for them. Lock-in compounds with every fine-tune.
The companies that win at inference don't do one of these well. They do all of them well, integrated into a single runtime that compounds the wins. That integrated runtime is the actual product moat. Same model, same hardware, dramatically different economics.
My read on this: when inference companies say things like "we're the fastest" or "we're the cheapest," what they're actually claiming, whether they spell it out or not, is that they have real technical breakthroughs at one or more of these specific angles. Getting more output tokens out of every byte the GPU reads from HBM (so each weight read does more useful work). Fitting more concurrent users into the same KV cache budget. Higher acceptance rates on speculative decoding. More aggressive quantization without sacrificing quality. None of these wins is dramatic in isolation. Stack them together inside one integrated runtime and you get a real edge. On token volumes large enough to matter, that edge is the difference between a viable inference business and a commodity passthrough.
This is the lens I'm planning to use going forward. When I read an inference company's blog post or benchmark claim, I want to ask: which of these specific levers are they actually pulling, and how hard?
Where my knowledge ends (for now)
There's a lot I skipped. Either because it would have made the piece three times longer, or because I haven't gone deep enough on it myself yet. Honest list:
- The math of softmax and RMSNorm. Mentioned what they do, didn't derive them. Original Transformer paper covers softmax. RMSNorm has its own paper from 2019.
- Multi-GPU orchestration in detail. Just named the three approaches (TP, PP, DP). For the actual mechanics, vLLM's docs and the Megatron-LM paper are the canonical reads.
- Continuous batching scheduling. Covered the concept, not the production-engineering detail. The Orca paper is the definitive treatment. vLLM's blog post on PagedAttention is the practical implementation guide.
- Quantization techniques in depth. Covered the conceptual model. For the actual algorithms (GPTQ, AWQ, SmoothQuant), the original papers are the place to go.
- DeepSeek's MLA and other KV compression schemes. Mentioned but not unpacked. Their V3 technical report goes deep.
- Self-speculation methods (Eagle, Medusa). The modern frontier of speculative decoding.
- The actual GPU kernels. Waved at FlashAttention, FireAttention, custom CUDA. I have not written one. This is the layer I haven't touched yet, and probably the most important layer for understanding why some inference runtimes genuinely outperform others. It's next on my list.
- Training mechanics. Backprop, gradients, optimizer state. Entirely out of scope.
Why I wrote this
I want to break into the inference space, specifically the product side at companies like Fireworks.ai. For that, I need real functional knowledge of what's actually happening underneath. So I sat down and learned it.
This piece is the artifact of that. I went deep over a few days. Read papers, blog posts, model cards, datasheets. Asked a lot of questions. Wrote things down. Got things wrong, fixed them, wrote them down again. The whole thing is anchored to Llama 3.1 8B on an Nvidia H100 because I've found details only stick with me when I'm working through one concrete example end-to-end.
I'm sure there are details I got slightly wrong even though I tried hard. Kernel-level stuff is genuinely on my list to learn next. The other thing I'm doing now is actually testing out the products in this space, with this understanding hopefully helping me form sharper opinions about what's good and what's not, and getting real context on the current market. I know Cerebras and Groq are attacking inference from the chip level instead of the software-on-Nvidia level, so that's another angle I want to dig into.
If you've worked on inference at any depth and you spot something off, I'd genuinely value the pushback.
— Krithik
References
The load-bearing facts in this piece, with sources:
Llama 3.1 8B
- Model card and config (Meta / HuggingFace)
- Llama 3.1 8B Instruct variant
- HuggingFace Transformers Llama documentation
Nvidia H100
- H100 Tensor Core GPU Datasheet (PDF)
- Nvidia Hopper Architecture In-Depth blog post
- H100 product page
Foundational papers behind the techniques referenced
- Vaswani et al., Attention Is All You Need (the transformer)
- Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models (Grouped-Query Attention)
- Shazeer, GLU Variants Improve Transformer (SwiGLU)
- Zhang & Sennrich, Root Mean Square Layer Normalization (RMSNorm)
- Dao et al., FlashAttention
- Kwon et al., Efficient Memory Management for LLM Serving with PagedAttention (vLLM)
- Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (continuous batching)
- Leviathan et al., Fast Inference from Transformers via Speculative Decoding
- Hu et al., LoRA: Low-Rank Adaptation of Large Language Models