Back to foundations Foundation
Last updated: Apr 24, 2026 ~35 min intermediate
Prerequisites: Attention Is All You Need to Implement , Positional Encoding: Teaching Transformers to Count

Building a Transformer: The Complete Forward Pass

Part 3 of 4: Assembling the Decoder-Only Transformer

TL;DR: Parts 1 and 2 gave us attention and position. This article assembles them into a working model. We build the scaffolding — normalization (RMSNorm, and the pre-norm vs post-norm question that decides whether a deep stack trains), residual connections (the gradient highway), and the feed-forward network (expansion, activation, compression) — wrap it into a transformer block, then stack the block into a full decoder-only transformer with a token embedding, a final norm, and a language-model head. We add a KV-cached generate() and count the parameters of a real 123.6M model, component by component. Every line is backed by tests at rlvr-from-scratch.

Prerequisites: Part 1: Attention (scaled dot-product + multi-head attention, causal mask, KV-cache) and Part 2: Positional Encoding (RoPE applied to Q and K after the head split).


From Components to a Model

We have two components. In Part 1, attention let every token gather information from every other token. In Part 2, RoPE gave the model a sense of order by rotating queries and keys.

Neither is a transformer. A transformer is what you get when you wrap those components in normalization and residual connections, add a feed-forward network, and stack the result NN times — then bookend the stack with a token embedding on one side and a language-model head on the other.

This article builds that, top to bottom, in the order you’d reach for each piece. By the end, integer token ids go in one side and next-token logits come out the other — a complete forward pass through a real model — and a generate() method turns that forward pass into text.

Throughout, every architectural choice has a classical answer (the 2017 Attention Is All You Need design) and a modern one (what Llama, Qwen, and Mistral actually ship). We build modern by default and keep classical one argument away, because the contrast is where the understanding lives.


1. Layer Normalization

Stack dozens of layers and activation magnitudes drift — they grow or shrink as data passes through, gradients follow, and training destabilizes. Normalization holds activations in a sane range at every layer. The question is which normalization, and where to put it.

LayerNorm: the classical choice

LayerNorm normalizes each token’s feature vector to zero mean and unit variance, then applies a learned scale and shift:

LayerNorm(x)=γxμσ2+ϵ+β\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

where μ\mu and σ2\sigma^2 are the mean and variance over the feature dimension. Two learned vectors of size dmodeld_\text{model}: scale γ\gamma and shift β\beta.

RMSNorm: the modern choice

RMSNorm asks whether the mean subtraction earns its keep. Zhang & Sennrich (2019) argued the win comes from re-scaling, not re-centering — so RMSNorm drops both the μ-\mu and the +β+\beta, normalizing by the root-mean-square and applying a learned scale only:

RMSNorm(x)=γxmean(x2)+ϵ\text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\text{mean}(x^2) + \epsilon}}

How much is actually lost by dropping the mean? Expand the root-mean-square:

RMS(x)=mean(x2)=var(x)+xˉ2\text{RMS}(x) = \sqrt{\text{mean}(x^2)} = \sqrt{\text{var}(x) + \bar{x}^2}

where xˉ\bar{x} is the mean. So RMSNorm’s denominator and LayerNorm’s σ2\sqrt{\sigma^2} differ by exactly one term — xˉ2\bar{x}^2. When the mean is small (which it tends to be once activations have passed through a layer or two), the two are nearly identical. RMSNorm isn’t a different idea from LayerNorm; it’s LayerNorm’s scaling under the assumption that the centering contributes little — and empirically it doesn’t lose much, if anything.

One learned vector instead of two, no mean to compute, no shift to add. It trains as well as LayerNorm and costs less. Every major open-weight LLM since Llama uses it.

class RMSNorm(nn.Module):
    def __init__(self, d_model: int, *, eps: float = 1e-6) -> None:
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))  # scale, no shift
        self.eps = eps

    def forward(self, x: Tensor) -> Tensor:
        mean_sq = x.pow(2).mean(dim=-1, keepdim=True)  # (..., 1)
        rms_inv = torch.rsqrt(mean_sq + self.eps)      # eps INSIDE the sqrt
        return self.gamma * (x * rms_inv)              # (..., d_model)

Key Insight: The epsilon goes inside the square root — rsqrt(mean_sq + eps), not x / (rms + eps). This matches Llama, HuggingFace, and F.rms_norm. The placement looks trivial, but it changes the numerics, and getting it wrong silently corrupts any external checkpoint you try to load. A from-scratch implementation that can’t load Llama weights isn’t a from-scratch Llama.

Why placement matters: pre-norm vs post-norm

Each sublayer (attention, then the FFN) is wrapped in a normalization and a residual connection. The single most consequential decision in the whole block is the order of those two.

Post-norm (classical) normalizes after the residual add:

xNorm(x+Sublayer(x))x \leftarrow \text{Norm}\big(x + \text{Sublayer}(x)\big)

Pre-norm (modern) normalizes before the sublayer, inside the residual:

xx+Sublayer(Norm(x))x \leftarrow x + \text{Sublayer}\big(\text{Norm}(x)\big)

That difference looks cosmetic. It is not — and the next section, on residual connections, is where you see exactly why.

The module, with LayerNorm kept as the classical baseline, is at norm.py.


2. Residual Connections

A residual connection adds a sublayer’s input to its output:

output=x+Sublayer(x)\text{output} = x + \text{Sublayer}(x)

He et al. (2015) introduced this to train very deep networks, and the reason is the gradient highway. Consider the backward pass. The derivative of x+Sublayer(x)x + \text{Sublayer}(x) with respect to xx is:

x(x+Sublayer(x))=1+Sublayer(x)x\frac{\partial}{\partial x}\big(x + \text{Sublayer}(x)\big) = 1 + \frac{\partial\, \text{Sublayer}(x)}{\partial x}

That 11 is everything. It means the gradient can flow back through the residual path undiminished, no matter how small the sublayer’s own gradient is. Stack 50 layers without residuals and gradients get multiplied by 50 small numbers on the way back — they vanish. With residuals, there’s always a path where the gradient is multiplied by 1, straight from the loss to any layer.

Now return to pre-norm vs post-norm. The residual highway only works if it stays an identity path. Look at where the norm sits:

  • Post-norm: xNorm(x+Sublayer(x))x \leftarrow \text{Norm}(x + \text{Sublayer}(x)). The norm wraps the entire sum, including the residual. The highway is interrupted at every layer — the gradient has to pass through the normalization’s Jacobian on the way back. Deep post-norm transformers are touchy: they need learning-rate warmup and grow less stable with depth.
  • Pre-norm: xx+Sublayer(Norm(x))x \leftarrow x + \text{Sublayer}(\text{Norm}(x)). The norm only conditions what goes into the sublayer. The residual xx flows from block input to block output untouched. The highway is unbroken.

Make that concrete by differentiating one block as a map from its input to its output. Pre-norm gives

x(x+F(Norm(x)))=I+F(Norm(x))x\frac{\partial}{\partial x}\Big(x + F(\text{Norm}(x))\Big) = I + \frac{\partial F(\text{Norm}(x))}{\partial x}

so back-propagating through LL stacked blocks always retains the identity path — the gradient never has to survive a product of Jacobians to reach an early layer. Post-norm gives

xNorm(x+F(x))=JNorm(I+F(x)x)\frac{\partial}{\partial x}\,\text{Norm}\big(x + F(x)\big) = J_\text{Norm}\cdot\Big(I + \frac{\partial F(x)}{\partial x}\Big)

where JNormJ_\text{Norm} is the normalization’s Jacobian. Now every one of the LL layers multiplies the gradient by a JNormJ_\text{Norm}, and a product of LL such factors can shrink or distort the signal long before it reaches the bottom of the stack. That product is the instability — and it’s why post-norm needs warmup to get started and gets harder to train the deeper you go.

Key Insight: Pre-norm keeps an unbroken identity path from a block’s input to its output; the sublayers add corrections on top, and the norm never sits on the highway. That single structural property — more than the choice of RMSNorm over LayerNorm — is why modern transformers train stably at depth without warmup gymnastics. If you change only one thing from the 2017 design, change this.

The one cost of pre-norm: the residual stream itself is never normalized on the way through, so the model wants a single final norm after the last block. We add it when we assemble the full model.


3. The Feed-Forward Network

Attention mixes information across positions. The feed-forward network (FFN) transforms each position independently — same function applied to every token. It’s where most of a transformer’s parameters live.

Expansion → activation → compression

The classical FFN is two linear layers with a non-linearity between them. It expands the representation to a wider hidden dimension dffd_\text{ff}, applies a pointwise activation, and compresses back to dmodeld_\text{model}:

FFN(x)=GELU(xW1)W2\text{FFN}(x) = \text{GELU}(x W_1)\, W_2

with W1Rdmodel×dffW_1 \in \mathbb{R}^{d_\text{model} \times d_\text{ff}} and W2Rdff×dmodelW_2 \in \mathbb{R}^{d_\text{ff} \times d_\text{model}}, typically dff=4dmodeld_\text{ff} = 4\, d_\text{model}. The expansion is the point: the wider hidden space gives the non-linearity room to compute richer per-token features before projecting back down. GELU — a smooth, probabilistic relative of ReLU — became the standard activation in the GPT/BERT era.

class GeluFFN(nn.Module):
    def __init__(self, d_model: int, d_ff: int, *, bias: bool = False) -> None:
        super().__init__()
        self.W_1 = nn.Linear(d_model, d_ff, bias=bias)   # expand
        self.W_2 = nn.Linear(d_ff, d_model, bias=bias)   # compress

    def forward(self, x: Tensor) -> Tensor:
        hidden = F.gelu(self.W_1(x))  # (B, T, d_ff)
        return self.W_2(hidden)       # (B, T, d_model)

SwiGLU: the modern variant

Modern models replace the single up-projection with a gated one. There are now two projections into the hidden dimension — an “up” projection producing values and a “gate” projection passed through SiLU — multiplied element-wise before the down-projection:

SwiGLU(x)=(SiLU(xWgate)(xWup))Wdown\text{SwiGLU}(x) = \big(\text{SiLU}(x W_\text{gate}) \odot (x W_\text{up})\big)\, W_\text{down}

where SiLU(x)=xσ(x)\text{SiLU}(x) = x \cdot \sigma(x). The gate lets the network scale each hidden unit based on the input rather than passing it through a fixed curve. Shazeer (2020) showed this beats the vanilla FFN at matched parameter count.

class SwiGLU(nn.Module):
    def __init__(self, d_model: int, d_ff: int, *, bias: bool = False) -> None:
        super().__init__()
        self.W_gate = nn.Linear(d_model, d_ff, bias=bias)
        self.W_up = nn.Linear(d_model, d_ff, bias=bias)
        self.W_down = nn.Linear(d_ff, d_model, bias=bias)  # d_ff -> d_model

    def forward(self, x: Tensor) -> Tensor:
        gate = F.silu(self.W_gate(x))  # (B, T, d_ff)
        up = self.W_up(x)              # (B, T, d_ff)
        return self.W_down(gate * up)  # (B, T, d_model)

Watch the shapes: W_gate and W_up map d_model -> d_ff; W_down maps d_ff -> d_model. Getting that last one backwards — d_model -> d_ff — is the most common from-scratch FFN bug. The hidden tensor is (B, T, d_ff), so the down-projection’s input must be d_ff.

Key Insight: SwiGLU has three weight matrices where the vanilla FFN has two. To keep the parameter budget comparable, the hidden dimension is shrunk to roughly 23\tfrac{2}{3} — Llama uses dff83dmodeld_\text{ff} \approx \tfrac{8}{3}\,d_\text{model} instead of 4dmodel4\,d_\text{model}. The extra matrix buys expressivity; the shrink keeps the comparison honest.

Both FFNs share the constructor signature (d_model, d_ff, *, bias=False), so they drop into the block interchangeably. Module: ffn.py.


4. The Transformer Block

Now we tie the pieces together. A block holds two sublayers — attention and FFN — each with its own norm, threaded through residuals. There’s one loose thread to close first: RoPE.

In Part 2 we built RotaryPositionalEmbedding to rotate Q and K after the head split. But Part 1’s MultiHeadAttention splits heads internally, so there was nowhere to apply it. The fix is a small, additive hook on attention — an optional rope module applied right after the split and right before the KV-cache concatenation:

# Inside MultiHeadAttention.forward, after splitting heads:
if self.rope is not None:
    offset = kv_cache[0].size(2) if kv_cache is not None else 0
    Q, K = self.rope(Q, K, offset=offset)  # both (B, H, T, d_k)

The offset is the subtle part: with a KV-cache, new tokens start at position cache_length, not 0. The cached keys were already rotated at their own step; we rotate only the new ones, then concatenate. Values are never rotated — position lives in the Q·K score. Default rope=None, so attention is unchanged unless you opt in.

With that in place, the block is straightforward:

class TransformerBlock(nn.Module):
    def __init__(
        self, d_model, n_heads, d_ff, *,
        pre_norm=True, norm_cls=RMSNorm, ffn_cls=SwiGLU, rope=None, bias=False,
    ) -> None:
        super().__init__()
        self.pre_norm = pre_norm
        self.attn = MultiHeadAttention(d_model, n_heads, bias=bias, rope=rope)
        self.ffn = ffn_cls(d_model, d_ff, bias=bias)
        self.norm1 = norm_cls(d_model)
        self.norm2 = norm_cls(d_model)

    def forward(self, x, mask=None, kv_cache=None):
        if self.pre_norm:
            normed = self.norm1(x)
            attn_out, _, new_kv = self.attn(normed, normed, normed, mask, kv_cache)
            x = x + attn_out                    # residual
            x = x + self.ffn(self.norm2(x))     # residual
        else:  # post-norm
            attn_out, _, new_kv = self.attn(x, x, x, mask, kv_cache)
            x = self.norm1(x + attn_out)
            x = self.norm2(x + self.ffn(x))
        return x, new_kv

The defaults give the modern block. Three arguments recover the 2017 design:

# Modern (default): pre-norm + RMSNorm + SwiGLU
block = TransformerBlock(d_model=768, n_heads=12, d_ff=2048, rope=rope)

# Classical: post-norm + LayerNorm + GELU
block = TransformerBlock(
    768, 12, 3072, pre_norm=False, norm_cls=LayerNorm, ffn_cls=GeluFFN,
)

That’s the payoff of keeping both: same class, same forward, and you A/B two architectures by swapping constructor arguments. Module: block.py.


5. The Full Decoder-Only Transformer

A decoder-only transformer — the GPT family — is the block, NN times, between an embedding and a head:

idsembedding[block]×Nfinal normLM headlogits\text{ids} \rightarrow \text{embedding} \rightarrow [\text{block}] \times N \rightarrow \text{final norm} \rightarrow \text{LM head} \rightarrow \text{logits}

Three things deserve attention in the assembly.

No positional embedding at the input. Because RoPE lives inside attention, we do not add a sinusoidal or learned position vector to the token embeddings. Position enters later, per-head, as rotation. The input is just the token lookup.

One RoPE, shared. The rotation is identical at every depth, so a cos/sin cache per layer would be pure waste. We build a single RotaryPositionalEmbedding and inject the same instance into every block.

Weight tying. The language-model head projects the final hidden state back to vocabulary logits — a (d_model -> vocab) matrix, which is the transpose-shaped twin of the (vocab -> d_model) embedding. Tying them (sharing one weight) is standard since GPT-2: it saves a large matrix and tends to help, since a token’s input and output representations are related.

class Transformer(nn.Module):
    def __init__(self, vocab_size, d_model, n_layers, n_heads, d_ff, *,
                 max_len=8192, tie_weights=True, bias=False):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.rope = RotaryPositionalEmbedding(d_model // n_heads, max_len=max_len)
        self.blocks = nn.ModuleList(
            TransformerBlock(d_model, n_heads, d_ff, rope=self.rope, bias=bias)
            for _ in range(n_layers)
        )
        self.final_norm = RMSNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        if tie_weights:
            self.lm_head.weight = self.token_emb.weight  # share the matrix

    def forward(self, input_ids, kv_caches=None, mask=None):
        _, T = input_ids.shape
        x = self.token_emb(input_ids)                     # (B, T, d_model)
        if mask is None and T > 1:
            mask = causal_mask(T, device=input_ids.device)
        new_caches = []
        for i, block in enumerate(self.blocks):
            cache_i = kv_caches[i] if kv_caches is not None else None
            x, new_cache = block(x, mask=mask, kv_cache=cache_i)
            new_caches.append(new_cache)
        x = self.final_norm(x)                            # the pre-norm tax
        logits = self.lm_head(x)                          # (B, T, vocab_size)
        return logits, new_caches

The forward handles both regimes from one signature. With kv_caches=None it’s a plain full pass (training): it builds a causal mask of size TT and runs the stack. With caches supplied it’s incremental decoding — which is what generate() uses next. Module: transformer.py.


6. The generate() Method

Training processes a whole sequence at once. Generation produces one token at a time, feeding each new token back in. Done naively, every step re-encodes the entire prefix — O(T2)O(T^2) wasted projection work over a sequence. The KV-cache fixes this: store each layer’s keys and values, and at each step compute only the new token’s K and V, then append.

The method has two phases — prefill the prompt in one pass, then decode one token at a time:

@torch.no_grad()
def generate(self, input_ids, max_new_tokens, *, temperature=1.0, top_k=None):
    self.eval()
    B = input_ids.size(0)
    d_k = self.d_model // self.n_heads
    # Start every block with an empty cache so blocks return grown ones.
    caches = [(torch.zeros(B, self.n_heads, 0, d_k),
               torch.zeros(B, self.n_heads, 0, d_k)) for _ in self.blocks]

    logits, caches = self.forward(input_ids, kv_caches=caches)   # prefill
    generated = input_ids
    for _ in range(max_new_tokens):
        next_logits = logits[:, -1, :]                          # (B, vocab)
        next_token = self._sample(next_logits, temperature, top_k)
        generated = torch.cat([generated, next_token], dim=1)
        logits, caches = self.forward(next_token, kv_caches=caches)  # decode 1
    return generated

Sampling is where you control the output’s character:

  • Temperature scales the logits before softmax. Below 1 sharpens the distribution (more deterministic); above 1 flattens it (more diverse). At 00 we take the argmax — pure greedy.
  • Top-k restricts sampling to the kk most likely tokens, masking the rest to -\infty before softmax. It prevents the long tail of low-probability tokens from occasionally derailing the output.
@staticmethod
def _sample(logits, temperature, top_k):
    if temperature <= 0.0:
        return logits.argmax(dim=-1, keepdim=True)        # greedy
    logits = logits / temperature
    if top_k is not None:
        kth = logits.topk(min(top_k, logits.size(-1)), dim=-1).values[:, -1, None]
        logits = logits.masked_fill(logits < kth, float("-inf"))
    probs = F.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1)        # (B, 1)

The invariant that proves it correct

A KV-cache is easy to write and easy to get subtly wrong — especially the RoPE offset. Shape tests pass either way. There is exactly one test that pins down correctness, end to end:

Decoding token-by-token with a KV-cache must produce the exact same logits as a single full forward pass.

If those agree, then the embedding, every block, every RoPE offset, the final norm, and the head all compose correctly under caching. A wrong offset breaks the agreement instantly — a key rotated for position 3 won’t match the same key rotated for position 5.

def test_incremental_decode_matches_full_forward():
    model.eval()
    ids = torch.randint(0, VOCAB, (1, T))
    full_logits, _ = model(ids)                      # ground truth

    caches = [(torch.zeros(1, H, 0, D_K), torch.zeros(1, H, 0, D_K))
              for _ in range(N_LAYERS)]
    step_logits = []
    for t in range(T):
        lg, caches = model(ids[:, t:t+1], kv_caches=caches)
        step_logits.append(lg)
    incremental = torch.cat(step_logits, dim=1)

    torch.testing.assert_close(full_logits, incremental, atol=1e-5)

This is the load-bearing test of the whole article. Everything else supports it.


7. Parameter Counting

”How big is the model?” has an exact answer you can derive from the architecture. With bias=False everywhere, the per-component counts are clean:

ComponentFormulaNotes
Token embeddingVdV \cdot dShared with the LM head when tied
Attention (per block)4d24 d^2WQ,WK,WV,WOW^Q, W^K, W^V, W^O, each d×dd \times d
SwiGLU FFN (per block)3ddff3 \, d \cdot d_\text{ff}gate + up + down
RMSNorm (per block)2d2 dtwo γ\gamma vectors
Final normddone γ\gamma

Plug in a concrete GPT-2-small-scale config — V=50,257V = 50{,}257, d=768d = 768, N=12N = 12 layers, H=12H = 12 heads, dff=204883dd_\text{ff} = 2048 \approx \tfrac{8}{3}d — and the numbers come straight from model.num_parameters():

ComponentCount
Token embedding (VdV \cdot d)38,597,376
Attention, per block (4d24d^2)2,359,296
SwiGLU FFN, per block (3ddff3 d\, d_\text{ff})4,718,592
Norms, per block (2d2d)1,536
One block7,079,424
All 12 blocks84,953,088
Final norm768
Non-embedding total84,953,856
Total (tied head)123,551,232 (≈123.6M)
Total (untied head)162,148,608

Two things jump out. First, the FFN is twice the size of attention in every block (4.7M4.7\text{M} vs 2.4M2.4\text{M}) — the feed-forward network, not attention, is where the parameters concentrate. Second, weight tying saves 38.6M parameters — about 24% of the untied total — by sharing one matrix between the embedding and the head. The norms, by contrast, are a rounding error: 1,536 parameters per block against millions.

model = Transformer(vocab_size=50257, d_model=768, n_layers=12,
                    n_heads=12, d_ff=2048)
model.num_parameters()                      # 123,551,232
model.num_parameters(non_embedding=True)    #  84,953,856

8. Implementation

The model and its components are tested module by module.

ModuleComponentRole
norm.pyRMSNorm, LayerNormPer-sublayer normalization
ffn.pySwiGLU, GeluFFNPosition-wise transformation
attention.pyMultiHeadAttention + RoPE hookCross-position mixing
block.pyTransformerBlockThe repeating unit
transformer.pyTransformerEmbedding → stack → norm → head, plus generate()

Test coverage

Correctness:

  • Forward output is (B, T, vocab_size); one cache slot per layer
  • Weight tying shares the embedding and head matrix; untied has more parameters
  • A single RoPE instance is shared across all blocks
  • num_parameters(non_embedding=True) matches the total minus the embedding

Generation:

  • generate() returns (B, T_prompt + max_new_tokens) and preserves the prompt
  • Greedy decoding (temperature=0) is deterministic

The invariant:

  • Incremental KV-cached decoding equals the full forward pass — end to end, through embedding, blocks, RoPE, final norm, and head

Training:

  • Gradients flow to the embedding, every block’s FFN, and the norms
  • Invalid configurations raise ValueError

9. Verify Your Understanding

Answer each in your own words before opening it. If you can, you can explain how a transformer is built from its parts — which is the real test.

1. Why does the epsilon go inside the square root in RMSNorm rather than added to the RMS afterward?

Numerically the two are different functions, and the inside convention — rsqrt(mean(x²) + eps) — is the one Llama, HuggingFace, and F.rms_norm use. If your implementation adds eps outside (x / (rms + eps)), the scaling differs slightly, so a model trained or saved with the standard convention won’t behave the same when its weights are loaded into yours. Checkpoint compatibility, not just stability, is at stake.

2. Pre-norm and post-norm contain the same operations. Why does pre-norm train deep stacks more stably?

Because of where the normalization sits relative to the residual. Post-norm wraps the whole sum, Norm(x + Sublayer(x)), so the normalization’s Jacobian sits on the residual path and the gradient must pass through it at every layer. Pre-norm, x + Sublayer(Norm(x)), leaves the residual an unbroken identity path — the +1 in the backward pass survives all the way down. That clean highway is why pre-norm scales to many layers without learning-rate warmup.

3. SwiGLU has three weight matrices to the vanilla FFN's two. Why is its hidden dimension usually shrunk to about 8/3 · d_model?

To keep the parameter count comparable. A vanilla FFN at dff=4dd_\text{ff} = 4d has 2d4d=8d22 \cdot d \cdot 4d = 8d^2 FFN parameters. SwiGLU with three matrices at dff=83dd_\text{ff} = \tfrac{8}{3}d has 3d83d=8d23 \cdot d \cdot \tfrac{8}{3}d = 8d^2 — the same budget. The shrink is what makes “SwiGLU beats GELU at matched parameters” an apples-to-apples claim rather than just spending more weights.

4. During cached generation, what RoPE offset does the new token get at decode step t, and why aren't the cached keys re-rotated?

The offset is the current cache length — the number of tokens already processed. The new token sits at that absolute position, so it’s rotated accordingly. The cached keys were each rotated by their own absolute position when they were the current token at their own step, and RoPE rotation is fixed per position, so re-rotating them would be wrong. Rotate the new K only, then concatenate. (Values are never rotated — position lives in the Q·K score.)

5. With weight tying, how many parameters does the language-model head add to the model — and why is the incremental-equals-full test the one that matters?

Zero. The tied head is the embedding matrix, shared, so it contributes no new parameters (in the 123.6M example, tying saves the full 38.6M the untied head would cost). As for the test: shapes can be right while behavior is silently wrong, especially the RoPE offset under a cache. The incremental-equals-full invariant exercises the entire forward path under caching and fails the instant any position is rotated or masked incorrectly — so it’s the test that actually certifies the model generates correctly.


What’s Next

We have a model that runs a complete forward pass and generates text — but its weights are random. In Part 4: Training From Scratch, we build the training loop: the cross-entropy objective over next-token logits, AdamW with decoupled weight decay, a cosine learning-rate schedule with linear warmup, gradient clipping, and the loss curve that tells you whether any of this learned. From a model that runs to a model that’s trained.


Further Reading

Original Papers:

Architecture in Practice:

Implementation:

Cite this reference

Sousa, V. (2026). Building a Transformer: The Complete Forward Pass. vitorsousa.com (Foundation Reference). https://www.vitorsousa.com/foundations//

@article{sousa2026,
  title={Building a Transformer: The Complete Forward Pass},
  author={Sousa, Vitor},
  year={2026},
  note={Foundation Reference},
  url={https://www.vitorsousa.com/foundations//}
}

Discussion

Found something useful, spotted an error, or want to add context? Comments are powered by GitHub Discussions.