Building a Transformer: The Complete Forward Pass
Part 3 of 4: Assembling the Decoder-Only Transformer
TL;DR: Parts 1 and 2 gave us attention and position. This article assembles them into a working model. We build the scaffolding — normalization (RMSNorm, and the pre-norm vs post-norm question that decides whether a deep stack trains), residual connections (the gradient highway), and the feed-forward network (expansion, activation, compression) — wrap it into a transformer block, then stack the block into a full decoder-only transformer with a token embedding, a final norm, and a language-model head. We add a KV-cached
generate()and count the parameters of a real 123.6M model, component by component. Every line is backed by tests at rlvr-from-scratch.
Prerequisites: Part 1: Attention (scaled dot-product + multi-head attention, causal mask, KV-cache) and Part 2: Positional Encoding (RoPE applied to Q and K after the head split).
From Components to a Model
We have two components. In Part 1, attention let every token gather information from every other token. In Part 2, RoPE gave the model a sense of order by rotating queries and keys.
Neither is a transformer. A transformer is what you get when you wrap those components in normalization and residual connections, add a feed-forward network, and stack the result times — then bookend the stack with a token embedding on one side and a language-model head on the other.
This article builds that, top to bottom, in the order you’d reach for each piece. By the end, integer token ids go in one side and next-token logits come out the other — a complete forward pass through a real model — and a generate() method turns that forward pass into text.
Throughout, every architectural choice has a classical answer (the 2017 Attention Is All You Need design) and a modern one (what Llama, Qwen, and Mistral actually ship). We build modern by default and keep classical one argument away, because the contrast is where the understanding lives.
1. Layer Normalization
Stack dozens of layers and activation magnitudes drift — they grow or shrink as data passes through, gradients follow, and training destabilizes. Normalization holds activations in a sane range at every layer. The question is which normalization, and where to put it.
LayerNorm: the classical choice
LayerNorm normalizes each token’s feature vector to zero mean and unit variance, then applies a learned scale and shift:
where and are the mean and variance over the feature dimension. Two learned vectors of size : scale and shift .
RMSNorm: the modern choice
RMSNorm asks whether the mean subtraction earns its keep. Zhang & Sennrich (2019) argued the win comes from re-scaling, not re-centering — so RMSNorm drops both the and the , normalizing by the root-mean-square and applying a learned scale only:
How much is actually lost by dropping the mean? Expand the root-mean-square:
where is the mean. So RMSNorm’s denominator and LayerNorm’s differ by exactly one term — . When the mean is small (which it tends to be once activations have passed through a layer or two), the two are nearly identical. RMSNorm isn’t a different idea from LayerNorm; it’s LayerNorm’s scaling under the assumption that the centering contributes little — and empirically it doesn’t lose much, if anything.
One learned vector instead of two, no mean to compute, no shift to add. It trains as well as LayerNorm and costs less. Every major open-weight LLM since Llama uses it.
class RMSNorm(nn.Module):
def __init__(self, d_model: int, *, eps: float = 1e-6) -> None:
super().__init__()
self.gamma = nn.Parameter(torch.ones(d_model)) # scale, no shift
self.eps = eps
def forward(self, x: Tensor) -> Tensor:
mean_sq = x.pow(2).mean(dim=-1, keepdim=True) # (..., 1)
rms_inv = torch.rsqrt(mean_sq + self.eps) # eps INSIDE the sqrt
return self.gamma * (x * rms_inv) # (..., d_model)
Key Insight: The epsilon goes inside the square root —
rsqrt(mean_sq + eps), notx / (rms + eps). This matches Llama, HuggingFace, andF.rms_norm. The placement looks trivial, but it changes the numerics, and getting it wrong silently corrupts any external checkpoint you try to load. A from-scratch implementation that can’t load Llama weights isn’t a from-scratch Llama.
Why placement matters: pre-norm vs post-norm
Each sublayer (attention, then the FFN) is wrapped in a normalization and a residual connection. The single most consequential decision in the whole block is the order of those two.
Post-norm (classical) normalizes after the residual add:
Pre-norm (modern) normalizes before the sublayer, inside the residual:
That difference looks cosmetic. It is not — and the next section, on residual connections, is where you see exactly why.
The module, with LayerNorm kept as the classical baseline, is at norm.py.
2. Residual Connections
A residual connection adds a sublayer’s input to its output:
He et al. (2015) introduced this to train very deep networks, and the reason is the gradient highway. Consider the backward pass. The derivative of with respect to is:
That is everything. It means the gradient can flow back through the residual path undiminished, no matter how small the sublayer’s own gradient is. Stack 50 layers without residuals and gradients get multiplied by 50 small numbers on the way back — they vanish. With residuals, there’s always a path where the gradient is multiplied by 1, straight from the loss to any layer.
Now return to pre-norm vs post-norm. The residual highway only works if it stays an identity path. Look at where the norm sits:
- Post-norm: . The norm wraps the entire sum, including the residual. The highway is interrupted at every layer — the gradient has to pass through the normalization’s Jacobian on the way back. Deep post-norm transformers are touchy: they need learning-rate warmup and grow less stable with depth.
- Pre-norm: . The norm only conditions what goes into the sublayer. The residual flows from block input to block output untouched. The highway is unbroken.
Make that concrete by differentiating one block as a map from its input to its output. Pre-norm gives
so back-propagating through stacked blocks always retains the identity path — the gradient never has to survive a product of Jacobians to reach an early layer. Post-norm gives
where is the normalization’s Jacobian. Now every one of the layers multiplies the gradient by a , and a product of such factors can shrink or distort the signal long before it reaches the bottom of the stack. That product is the instability — and it’s why post-norm needs warmup to get started and gets harder to train the deeper you go.
Key Insight: Pre-norm keeps an unbroken identity path from a block’s input to its output; the sublayers add corrections on top, and the norm never sits on the highway. That single structural property — more than the choice of RMSNorm over LayerNorm — is why modern transformers train stably at depth without warmup gymnastics. If you change only one thing from the 2017 design, change this.
The one cost of pre-norm: the residual stream itself is never normalized on the way through, so the model wants a single final norm after the last block. We add it when we assemble the full model.
3. The Feed-Forward Network
Attention mixes information across positions. The feed-forward network (FFN) transforms each position independently — same function applied to every token. It’s where most of a transformer’s parameters live.
Expansion → activation → compression
The classical FFN is two linear layers with a non-linearity between them. It expands the representation to a wider hidden dimension , applies a pointwise activation, and compresses back to :
with and , typically . The expansion is the point: the wider hidden space gives the non-linearity room to compute richer per-token features before projecting back down. GELU — a smooth, probabilistic relative of ReLU — became the standard activation in the GPT/BERT era.
class GeluFFN(nn.Module):
def __init__(self, d_model: int, d_ff: int, *, bias: bool = False) -> None:
super().__init__()
self.W_1 = nn.Linear(d_model, d_ff, bias=bias) # expand
self.W_2 = nn.Linear(d_ff, d_model, bias=bias) # compress
def forward(self, x: Tensor) -> Tensor:
hidden = F.gelu(self.W_1(x)) # (B, T, d_ff)
return self.W_2(hidden) # (B, T, d_model)
SwiGLU: the modern variant
Modern models replace the single up-projection with a gated one. There are now two projections into the hidden dimension — an “up” projection producing values and a “gate” projection passed through SiLU — multiplied element-wise before the down-projection:
where . The gate lets the network scale each hidden unit based on the input rather than passing it through a fixed curve. Shazeer (2020) showed this beats the vanilla FFN at matched parameter count.
class SwiGLU(nn.Module):
def __init__(self, d_model: int, d_ff: int, *, bias: bool = False) -> None:
super().__init__()
self.W_gate = nn.Linear(d_model, d_ff, bias=bias)
self.W_up = nn.Linear(d_model, d_ff, bias=bias)
self.W_down = nn.Linear(d_ff, d_model, bias=bias) # d_ff -> d_model
def forward(self, x: Tensor) -> Tensor:
gate = F.silu(self.W_gate(x)) # (B, T, d_ff)
up = self.W_up(x) # (B, T, d_ff)
return self.W_down(gate * up) # (B, T, d_model)
Watch the shapes: W_gate and W_up map d_model -> d_ff; W_down maps d_ff -> d_model. Getting that last one backwards — d_model -> d_ff — is the most common from-scratch FFN bug. The hidden tensor is (B, T, d_ff), so the down-projection’s input must be d_ff.
Key Insight: SwiGLU has three weight matrices where the vanilla FFN has two. To keep the parameter budget comparable, the hidden dimension is shrunk to roughly — Llama uses instead of . The extra matrix buys expressivity; the shrink keeps the comparison honest.
Both FFNs share the constructor signature (d_model, d_ff, *, bias=False), so they drop into the block interchangeably. Module: ffn.py.
4. The Transformer Block
Now we tie the pieces together. A block holds two sublayers — attention and FFN — each with its own norm, threaded through residuals. There’s one loose thread to close first: RoPE.
In Part 2 we built RotaryPositionalEmbedding to rotate Q and K after the head split. But Part 1’s MultiHeadAttention splits heads internally, so there was nowhere to apply it. The fix is a small, additive hook on attention — an optional rope module applied right after the split and right before the KV-cache concatenation:
# Inside MultiHeadAttention.forward, after splitting heads:
if self.rope is not None:
offset = kv_cache[0].size(2) if kv_cache is not None else 0
Q, K = self.rope(Q, K, offset=offset) # both (B, H, T, d_k)
The offset is the subtle part: with a KV-cache, new tokens start at position cache_length, not 0. The cached keys were already rotated at their own step; we rotate only the new ones, then concatenate. Values are never rotated — position lives in the Q·K score. Default rope=None, so attention is unchanged unless you opt in.
With that in place, the block is straightforward:
class TransformerBlock(nn.Module):
def __init__(
self, d_model, n_heads, d_ff, *,
pre_norm=True, norm_cls=RMSNorm, ffn_cls=SwiGLU, rope=None, bias=False,
) -> None:
super().__init__()
self.pre_norm = pre_norm
self.attn = MultiHeadAttention(d_model, n_heads, bias=bias, rope=rope)
self.ffn = ffn_cls(d_model, d_ff, bias=bias)
self.norm1 = norm_cls(d_model)
self.norm2 = norm_cls(d_model)
def forward(self, x, mask=None, kv_cache=None):
if self.pre_norm:
normed = self.norm1(x)
attn_out, _, new_kv = self.attn(normed, normed, normed, mask, kv_cache)
x = x + attn_out # residual
x = x + self.ffn(self.norm2(x)) # residual
else: # post-norm
attn_out, _, new_kv = self.attn(x, x, x, mask, kv_cache)
x = self.norm1(x + attn_out)
x = self.norm2(x + self.ffn(x))
return x, new_kv
The defaults give the modern block. Three arguments recover the 2017 design:
# Modern (default): pre-norm + RMSNorm + SwiGLU
block = TransformerBlock(d_model=768, n_heads=12, d_ff=2048, rope=rope)
# Classical: post-norm + LayerNorm + GELU
block = TransformerBlock(
768, 12, 3072, pre_norm=False, norm_cls=LayerNorm, ffn_cls=GeluFFN,
)
That’s the payoff of keeping both: same class, same forward, and you A/B two architectures by swapping constructor arguments. Module: block.py.
5. The Full Decoder-Only Transformer
A decoder-only transformer — the GPT family — is the block, times, between an embedding and a head:
Three things deserve attention in the assembly.
No positional embedding at the input. Because RoPE lives inside attention, we do not add a sinusoidal or learned position vector to the token embeddings. Position enters later, per-head, as rotation. The input is just the token lookup.
One RoPE, shared. The rotation is identical at every depth, so a cos/sin cache per layer would be pure waste. We build a single RotaryPositionalEmbedding and inject the same instance into every block.
Weight tying. The language-model head projects the final hidden state back to vocabulary logits — a (d_model -> vocab) matrix, which is the transpose-shaped twin of the (vocab -> d_model) embedding. Tying them (sharing one weight) is standard since GPT-2: it saves a large matrix and tends to help, since a token’s input and output representations are related.
class Transformer(nn.Module):
def __init__(self, vocab_size, d_model, n_layers, n_heads, d_ff, *,
max_len=8192, tie_weights=True, bias=False):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, d_model)
self.rope = RotaryPositionalEmbedding(d_model // n_heads, max_len=max_len)
self.blocks = nn.ModuleList(
TransformerBlock(d_model, n_heads, d_ff, rope=self.rope, bias=bias)
for _ in range(n_layers)
)
self.final_norm = RMSNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
if tie_weights:
self.lm_head.weight = self.token_emb.weight # share the matrix
def forward(self, input_ids, kv_caches=None, mask=None):
_, T = input_ids.shape
x = self.token_emb(input_ids) # (B, T, d_model)
if mask is None and T > 1:
mask = causal_mask(T, device=input_ids.device)
new_caches = []
for i, block in enumerate(self.blocks):
cache_i = kv_caches[i] if kv_caches is not None else None
x, new_cache = block(x, mask=mask, kv_cache=cache_i)
new_caches.append(new_cache)
x = self.final_norm(x) # the pre-norm tax
logits = self.lm_head(x) # (B, T, vocab_size)
return logits, new_caches
The forward handles both regimes from one signature. With kv_caches=None it’s a plain full pass (training): it builds a causal mask of size and runs the stack. With caches supplied it’s incremental decoding — which is what generate() uses next. Module: transformer.py.
6. The generate() Method
Training processes a whole sequence at once. Generation produces one token at a time, feeding each new token back in. Done naively, every step re-encodes the entire prefix — wasted projection work over a sequence. The KV-cache fixes this: store each layer’s keys and values, and at each step compute only the new token’s K and V, then append.
The method has two phases — prefill the prompt in one pass, then decode one token at a time:
@torch.no_grad()
def generate(self, input_ids, max_new_tokens, *, temperature=1.0, top_k=None):
self.eval()
B = input_ids.size(0)
d_k = self.d_model // self.n_heads
# Start every block with an empty cache so blocks return grown ones.
caches = [(torch.zeros(B, self.n_heads, 0, d_k),
torch.zeros(B, self.n_heads, 0, d_k)) for _ in self.blocks]
logits, caches = self.forward(input_ids, kv_caches=caches) # prefill
generated = input_ids
for _ in range(max_new_tokens):
next_logits = logits[:, -1, :] # (B, vocab)
next_token = self._sample(next_logits, temperature, top_k)
generated = torch.cat([generated, next_token], dim=1)
logits, caches = self.forward(next_token, kv_caches=caches) # decode 1
return generated
Sampling is where you control the output’s character:
- Temperature scales the logits before softmax. Below 1 sharpens the distribution (more deterministic); above 1 flattens it (more diverse). At we take the argmax — pure greedy.
- Top-k restricts sampling to the most likely tokens, masking the rest to before softmax. It prevents the long tail of low-probability tokens from occasionally derailing the output.
@staticmethod
def _sample(logits, temperature, top_k):
if temperature <= 0.0:
return logits.argmax(dim=-1, keepdim=True) # greedy
logits = logits / temperature
if top_k is not None:
kth = logits.topk(min(top_k, logits.size(-1)), dim=-1).values[:, -1, None]
logits = logits.masked_fill(logits < kth, float("-inf"))
probs = F.softmax(logits, dim=-1)
return torch.multinomial(probs, num_samples=1) # (B, 1)
The invariant that proves it correct
A KV-cache is easy to write and easy to get subtly wrong — especially the RoPE offset. Shape tests pass either way. There is exactly one test that pins down correctness, end to end:
Decoding token-by-token with a KV-cache must produce the exact same logits as a single full forward pass.
If those agree, then the embedding, every block, every RoPE offset, the final norm, and the head all compose correctly under caching. A wrong offset breaks the agreement instantly — a key rotated for position 3 won’t match the same key rotated for position 5.
def test_incremental_decode_matches_full_forward():
model.eval()
ids = torch.randint(0, VOCAB, (1, T))
full_logits, _ = model(ids) # ground truth
caches = [(torch.zeros(1, H, 0, D_K), torch.zeros(1, H, 0, D_K))
for _ in range(N_LAYERS)]
step_logits = []
for t in range(T):
lg, caches = model(ids[:, t:t+1], kv_caches=caches)
step_logits.append(lg)
incremental = torch.cat(step_logits, dim=1)
torch.testing.assert_close(full_logits, incremental, atol=1e-5)
This is the load-bearing test of the whole article. Everything else supports it.
7. Parameter Counting
”How big is the model?” has an exact answer you can derive from the architecture. With bias=False everywhere, the per-component counts are clean:
| Component | Formula | Notes |
|---|---|---|
| Token embedding | Shared with the LM head when tied | |
| Attention (per block) | , each | |
| SwiGLU FFN (per block) | gate + up + down | |
| RMSNorm (per block) | two vectors | |
| Final norm | one |
Plug in a concrete GPT-2-small-scale config — , , layers, heads, — and the numbers come straight from model.num_parameters():
| Component | Count |
|---|---|
| Token embedding () | 38,597,376 |
| Attention, per block () | 2,359,296 |
| SwiGLU FFN, per block () | 4,718,592 |
| Norms, per block () | 1,536 |
| One block | 7,079,424 |
| All 12 blocks | 84,953,088 |
| Final norm | 768 |
| Non-embedding total | 84,953,856 |
| Total (tied head) | 123,551,232 (≈123.6M) |
| Total (untied head) | 162,148,608 |
Two things jump out. First, the FFN is twice the size of attention in every block ( vs ) — the feed-forward network, not attention, is where the parameters concentrate. Second, weight tying saves 38.6M parameters — about 24% of the untied total — by sharing one matrix between the embedding and the head. The norms, by contrast, are a rounding error: 1,536 parameters per block against millions.
model = Transformer(vocab_size=50257, d_model=768, n_layers=12,
n_heads=12, d_ff=2048)
model.num_parameters() # 123,551,232
model.num_parameters(non_embedding=True) # 84,953,856
8. Implementation
The model and its components are tested module by module.
| Module | Component | Role |
|---|---|---|
norm.py | RMSNorm, LayerNorm | Per-sublayer normalization |
ffn.py | SwiGLU, GeluFFN | Position-wise transformation |
attention.py | MultiHeadAttention + RoPE hook | Cross-position mixing |
block.py | TransformerBlock | The repeating unit |
transformer.py | Transformer | Embedding → stack → norm → head, plus generate() |
Test coverage
Correctness:
- Forward output is
(B, T, vocab_size); one cache slot per layer - Weight tying shares the embedding and head matrix; untied has more parameters
- A single RoPE instance is shared across all blocks
num_parameters(non_embedding=True)matches the total minus the embedding
Generation:
generate()returns(B, T_prompt + max_new_tokens)and preserves the prompt- Greedy decoding (
temperature=0) is deterministic
The invariant:
- Incremental KV-cached decoding equals the full forward pass — end to end, through embedding, blocks, RoPE, final norm, and head
Training:
- Gradients flow to the embedding, every block’s FFN, and the norms
- Invalid configurations raise
ValueError
9. Verify Your Understanding
Answer each in your own words before opening it. If you can, you can explain how a transformer is built from its parts — which is the real test.
1. Why does the epsilon go inside the square root in RMSNorm rather than added to the RMS afterward?
Numerically the two are different functions, and the inside convention — rsqrt(mean(x²) + eps) — is the one Llama, HuggingFace, and F.rms_norm use. If your implementation adds eps outside (x / (rms + eps)), the scaling differs slightly, so a model trained or saved with the standard convention won’t behave the same when its weights are loaded into yours. Checkpoint compatibility, not just stability, is at stake.
2. Pre-norm and post-norm contain the same operations. Why does pre-norm train deep stacks more stably?
Because of where the normalization sits relative to the residual. Post-norm wraps the whole sum, Norm(x + Sublayer(x)), so the normalization’s Jacobian sits on the residual path and the gradient must pass through it at every layer. Pre-norm, x + Sublayer(Norm(x)), leaves the residual an unbroken identity path — the +1 in the backward pass survives all the way down. That clean highway is why pre-norm scales to many layers without learning-rate warmup.
3. SwiGLU has three weight matrices to the vanilla FFN's two. Why is its hidden dimension usually shrunk to about 8/3 · d_model?
To keep the parameter count comparable. A vanilla FFN at has FFN parameters. SwiGLU with three matrices at has — the same budget. The shrink is what makes “SwiGLU beats GELU at matched parameters” an apples-to-apples claim rather than just spending more weights.
4. During cached generation, what RoPE offset does the new token get at decode step t, and why aren't the cached keys re-rotated?
The offset is the current cache length — the number of tokens already processed. The new token sits at that absolute position, so it’s rotated accordingly. The cached keys were each rotated by their own absolute position when they were the current token at their own step, and RoPE rotation is fixed per position, so re-rotating them would be wrong. Rotate the new K only, then concatenate. (Values are never rotated — position lives in the Q·K score.)
5. With weight tying, how many parameters does the language-model head add to the model — and why is the incremental-equals-full test the one that matters?
Zero. The tied head is the embedding matrix, shared, so it contributes no new parameters (in the 123.6M example, tying saves the full 38.6M the untied head would cost). As for the test: shapes can be right while behavior is silently wrong, especially the RoPE offset under a cache. The incremental-equals-full invariant exercises the entire forward path under caching and fails the instant any position is rotated or masked incorrectly — so it’s the test that actually certifies the model generates correctly.
What’s Next
We have a model that runs a complete forward pass and generates text — but its weights are random. In Part 4: Training From Scratch, we build the training loop: the cross-entropy objective over next-token logits, AdamW with decoupled weight decay, a cosine learning-rate schedule with linear warmup, gradient clipping, and the loss curve that tells you whether any of this learned. From a model that runs to a model that’s trained.
Further Reading
Original Papers:
- Root Mean Square Layer Normalization (Zhang & Sennrich, 2019) — RMSNorm
- Deep Residual Learning for Image Recognition (He et al., 2015) — residual connections
- GLU Variants Improve Transformer (Shazeer, 2020) — SwiGLU
- On Layer Normalization in the Transformer Architecture (Xiong et al., 2020) — pre-norm vs post-norm
Architecture in Practice:
- LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023) — RMSNorm + SwiGLU + RoPE + pre-norm
- Language Models are Unsupervised Multitask Learners (Radford et al., 2019) — GPT-2, weight tying
Implementation:
- rlvr-from-scratch — the tested model from this article
Cite this reference
Sousa, V. (2026). Building a Transformer: The Complete Forward Pass. vitorsousa.com (Foundation Reference). https://www.vitorsousa.com/foundations//
@article{sousa2026,
title={Building a Transformer: The Complete Forward Pass},
author={Sousa, Vitor},
year={2026},
note={Foundation Reference},
url={https://www.vitorsousa.com/foundations//}
} Enjoyed this? Get notified when I publish new references.
Subscribe via RSS
Discussion
Found something useful, spotted an error, or want to add context? Comments are powered by GitHub Discussions.