Back to foundations Foundation
Last updated: Apr 2, 2026 ~30 min intermediate
Prerequisites: Attention Is All You Need to Implement

Positional Encoding: Teaching Transformers to Count

Part 2 of 4: Making Order Matter

TL;DR: Attention is permutation equivariant — shuffle the tokens and you get the same output (shuffled). Without position information, “the cat sat on the mat” and “mat the on sat cat the” are identical to the model. This article derives three approaches to fixing this: sinusoidal encoding (fixed, infinite extrapolation in theory), learned embeddings (flexible, bounded), and Rotary Position Embeddings (RoPE — the modern winner, encoding relative position directly into Q and K via rotation matrices). Full tested implementation at rlvr-from-scratch.

Prerequisites: Part 1: Attention Is All You Need to Implement covers scaled dot-product attention and multi-head attention.


The Position Problem

In Part 1 we built attention from scratch. It works — but it has a fundamental gap.

Consider two inputs:

Input A: ["The", "cat", "sat", "on", "the", "mat"]
Input B: ["mat", "the", "on", "sat", "cat", "The"]

Feed both through our multi-head attention module. The attention weights will differ (different tokens in different positions means different Q, K, V vectors). But here’s the problem: if you permute both the input and the output in the same way, you get the same result. Attention treats its input as a set, not a sequence.

Formally, for any permutation π\pi:

Attention(π(X))=π(Attention(X))\text{Attention}(\pi(X)) = \pi(\text{Attention}(X))

This is called permutation equivariance. It means attention has no concept of “first”, “second”, “last”. Token 0 and token 99 are processed identically — there’s nothing in the computation that distinguishes position.

For language, this is catastrophic. “The dog bit the man” and “The man bit the dog” have the same tokens. Without position, attention can’t tell them apart.

We need to inject position information. The question is how.


Sinusoidal Positional Encoding

The Original Approach

Vaswani et al. (2017) proposed adding a fixed signal to each token’s embedding. The signal varies with position and dimension, using sine and cosine functions at different frequencies:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)

PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_\text{model}}}\right)

where pospos is the position in the sequence and ii is the dimension index.

Why sin and cos? Why 10000? Why this specific formula?

The Frequency Intuition

Think of each dimension pair (2i,2i+1)(2i, 2i+1) as a clock running at a different speed.

  • Dimensions 0, 1: a fast clock — cycles every few positions
  • Dimensions d/2d/2, d/2+1d/2+1: a slow clock — cycles over thousands of positions

The wavelength for dimension 2i2i is:

λi=2π100002i/dmodel\lambda_i = 2\pi \cdot 10000^{2i/d_\text{model}}

For dmodel=512d_\text{model} = 512:

Dimension pairWavelengthWhat it captures
0, 12π6.32\pi \approx 6.3Very local position (every ~6 tokens)
128, 129630\approx 630Paragraph-level position
254, 25562,800\approx 62{,}800Document-level position

The model gets a multi-scale position signal. Low dimensions distinguish nearby tokens. High dimensions distinguish distant tokens.

Why Sin/Cos Pairs?

The key property: for any fixed offset kk, PEpos+kPE_{pos+k} can be written as a linear transformation of PEposPE_{pos}.

[sin(ωi(pos+k))cos(ωi(pos+k))]=[cos(ωik)sin(ωik)sin(ωik)cos(ωik)][sin(ωipos)cos(ωipos)]\begin{bmatrix} \sin(\omega_i (pos + k)) \\ \cos(\omega_i (pos + k)) \end{bmatrix} = \begin{bmatrix} \cos(\omega_i k) & \sin(\omega_i k) \\ -\sin(\omega_i k) & \cos(\omega_i k) \end{bmatrix} \begin{bmatrix} \sin(\omega_i pos) \\ \cos(\omega_i pos) \end{bmatrix}

where ωi=1/100002i/dmodel\omega_i = 1 / 10000^{2i/d_\text{model}}.

This is a rotation matrix. Moving from position pospos to position pos+kpos + k is a rotation by angle ωik\omega_i k — the same rotation regardless of pospos. This means the model can learn to attend to relative positions: “the token 3 positions back” is always the same transformation.

The Implementation

import torch
import math

class SinusoidalPositionalEncoding(torch.nn.Module):
    """
    Fixed sinusoidal positional encoding from "Attention Is All You Need".
    
    No learnable parameters — the encoding is deterministic.
    
    Args:
        d_model: Model dimension (must be even).
        max_len: Maximum sequence length to precompute.
    """
    
    def __init__(self, d_model: int, max_len: int = 8192):
        super().__init__()
        assert d_model % 2 == 0, "d_model must be even for sin/cos pairs"
        
        # =========================================
        # Precompute encoding matrix: (max_len, d_model)
        # =========================================
        pe = torch.zeros(max_len, d_model)
        
        position = torch.arange(0, max_len).unsqueeze(1).float()  # (max_len, 1)
        
        # Frequency for each dimension pair
        # div_term = 1 / 10000^(2i / d_model) = exp(-2i * log(10000) / d_model)
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )  # (d_model / 2,)
        
        pe[:, 0::2] = torch.sin(position * div_term)  # even dimensions
        pe[:, 1::2] = torch.cos(position * div_term)  # odd dimensions
        
        # Register as buffer (not a parameter, but saved with model)
        # Shape: (1, max_len, d_model) for broadcasting over batch
        self.register_buffer("pe", pe.unsqueeze(0))
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (B, T, d_model)
        Returns:
            x + positional encoding: (B, T, d_model)
        """
        # =========================================
        # Add position signal to input embeddings
        # =========================================
        # self.pe: (1, max_len, d_model) -> slice to (1, T, d_model)
        return x + self.pe[:, :x.size(1), :]

Key implementation detail: We compute the division term in log-space (exp(-2i * log(10000) / d_model)) instead of directly computing 10000^(2i/d_model). This avoids numerical overflow for large dimension indices.

Strengths and Limitations

Strengths:

  • No parameters — nothing to learn, nothing to overfit
  • Extrapolation — in theory, can handle any sequence length (the frequencies are defined for all positions)
  • Relative position encoded — the linear transformation property

Limitations:

  • Extrapolation doesn’t actually work well — while the math supports it, models trained on length TT degrade at length >T> T in practice
  • Fixed — can’t adapt to the task
  • Additive injection — position and content share the same representation space, which can create interference

Key Insight: Sinusoidal encoding turns position into a multi-frequency signal. Each dimension pair is a clock at a different speed. The sin/cos pairing enables relative position through rotation — a property that RoPE later exploits much more directly.


Learned Positional Embeddings

The Simplest Approach

Why compute a fixed formula when you can just learn the position representations?

PEpos=Epos[pos]where EposRTmax×dmodelPE_{pos} = E_\text{pos}[pos] \quad \text{where } E_\text{pos} \in \mathbb{R}^{T_\text{max} \times d_\text{model}}

An embedding table. Position 0 gets one learned vector, position 1 gets another, and so on. This is what GPT-2 and BERT use.

class LearnedPositionalEmbedding(torch.nn.Module):
    """
    Learned positional embedding — a lookup table.
    
    Args:
        max_len: Maximum sequence length.
        d_model: Model dimension.
    """
    
    def __init__(self, max_len: int, d_model: int):
        super().__init__()
        self.embedding = torch.nn.Embedding(max_len, d_model)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: (B, T, d_model)
        Returns:
            x + position embeddings: (B, T, d_model)
        """
        T = x.size(1)
        # =========================================
        # Create position indices and look up embeddings
        # =========================================
        positions = torch.arange(T, device=x.device)  # (T,)
        pos_emb = self.embedding(positions)            # (T, d_model)
        return x + pos_emb                             # broadcast over batch

That’s it. An nn.Embedding layer. Tmax×dmodelT_\text{max} \times d_\text{model} learnable parameters.

When Learned Beats Sinusoidal

In practice, learned embeddings often perform slightly better than sinusoidal for fixed-length tasks — the model can learn position representations optimized for the actual data distribution.

But there’s a hard ceiling: the model has no embedding for positions beyond TmaxT_\text{max}. If you train with Tmax=512T_\text{max} = 512 and try to process a 1,024-token sequence, you crash. You’d need to either truncate or retrain.

Comparison

PropertySinusoidalLearned
Parameters0Tmax×dmodelT_\text{max} \times d_\text{model}
ExtrapolationTheoretically yes, practically weakNo — hard crash beyond TmaxT_\text{max}
AdaptabilityNoneTask-specific
Relative positionEncoded via rotation propertyNot explicitly encoded
Used byOriginal TransformerGPT-2, BERT

Both approaches share a fundamental limitation: they add position to the token embedding, mixing content and position in the same vector. What if we could encode position without this interference?

Key Insight: Learned embeddings trade generality for expressiveness. They work well within their trained range but cannot extrapolate. For modern LLMs that need long-context generalization, this is a dealbreaker.


Rotary Position Embeddings (RoPE)

The Modern Winner

RoPE (Su et al., 2021) is used by LLaMA, Qwen, Mistral, and most modern open-weight LLMs. Instead of adding position to the embedding, RoPE rotates the query and key vectors by a position-dependent angle. The attention score between two tokens then naturally depends on their relative position.

This is the key shift: position goes into Q and K, not into the embedding itself.

The Core Idea

For a 2D vector [q0,q1][q_0, q_1] at position mm, RoPE applies a rotation by angle mθm\theta:

RoPE(q,m)=[cos(mθ)sin(mθ)sin(mθ)cos(mθ)][q0q1]\text{RoPE}(q, m) = \begin{bmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{bmatrix} \begin{bmatrix} q_0 \\ q_1 \end{bmatrix}

When we compute the dot product of a rotated query at position mm with a rotated key at position nn:

RoPE(q,m)TRoPE(k,n)=qTR(m)TR(n)k=qTR(nm)k\text{RoPE}(q, m)^T \text{RoPE}(k, n) = q^T R(m)^T R(n) k = q^T R(n - m) k

The rotation matrices compose: R(m)TR(n)=R(nm)R(m)^T R(n) = R(n - m). The dot product depends only on the relative position nmn - m, not on the absolute positions.

This is why RoPE works: relative position emerges naturally from the algebra of rotations.

Extending to Higher Dimensions

For dmodel>2d_\text{model} > 2, RoPE applies independent rotations to consecutive pairs of dimensions, each at a different frequency:

θi=100002i/dmodelfor i=0,1,,d/21\theta_i = 10000^{-2i/d_\text{model}} \quad \text{for } i = 0, 1, \ldots, d/2 - 1

The full rotation for position mm:

R(m)=[Rθ0(m)Rθ1(m)Rθd/21(m)]R(m) = \begin{bmatrix} R_{\theta_0}(m) & & \\ & R_{\theta_1}(m) & \\ & & \ddots \\ & & & R_{\theta_{d/2-1}}(m) \end{bmatrix}

where each Rθi(m)R_{\theta_i}(m) is a 2×2 rotation matrix with angle mθim\theta_i.

Notice the frequency formula — it’s the same as sinusoidal encoding. RoPE inherits the multi-scale property: low-frequency pairs capture long-range position, high-frequency pairs capture local position.

Efficient Implementation

We don’t actually construct the rotation matrix. Instead, we use the identity:

[q0cos(mθ)q1sin(mθ)q0sin(mθ)+q1cos(mθ)]\begin{bmatrix} q_0 \cos(m\theta) - q_1 \sin(m\theta) \\ q_0 \sin(m\theta) + q_1 \cos(m\theta) \end{bmatrix}

This is just element-wise multiply and a swap — no matrix construction needed.

class RotaryPositionalEmbedding(torch.nn.Module):
    """
    Rotary Position Embedding (RoPE).
    
    Applied to Q and K tensors, not to the input embedding.
    Position information enters through rotation, encoding
    relative position in the attention score.
    
    Args:
        d_model: Model dimension (must be even).
        max_len: Maximum sequence length.
        base: Base for frequency computation (default 10000).
    """
    
    def __init__(self, d_model: int, max_len: int = 8192, base: float = 10000.0):
        super().__init__()
        assert d_model % 2 == 0, "d_model must be even for RoPE pairs"
        
        # =========================================
        # Precompute frequencies: θ_i = base^(-2i/d)
        # =========================================
        inv_freq = 1.0 / (base ** (torch.arange(0, d_model, 2).float() / d_model))
        self.register_buffer("inv_freq", inv_freq)  # (d_model / 2,)
        
        # Precompute cos/sin for all positions
        self._build_cache(max_len)
    
    def _build_cache(self, max_len: int):
        """Precompute cos and sin values for positions 0..max_len-1."""
        positions = torch.arange(max_len).float()       # (max_len,)
        # Outer product: (max_len,) x (d/2,) -> (max_len, d/2)
        freqs = torch.outer(positions, self.inv_freq)
        # Duplicate for pairs: (max_len, d)
        freqs = torch.cat([freqs, freqs], dim=-1)
        
        self.register_buffer("cos_cached", freqs.cos().unsqueeze(0).unsqueeze(0))
        self.register_buffer("sin_cached", freqs.sin().unsqueeze(0).unsqueeze(0))
        # Shape: (1, 1, max_len, d_model) — broadcastable over B and H
    
    def forward(
        self, q: torch.Tensor, k: torch.Tensor, offset: int = 0
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Apply RoPE to query and key tensors.
        
        Args:
            q: (B, H, T, d_k)
            k: (B, H, T, d_k)
            offset: Position offset for KV-cache (default 0)
            
        Returns:
            q_rotated: (B, H, T, d_k)
            k_rotated: (B, H, T, d_k)
        """
        T = q.size(2)
        
        # =========================================
        # Slice precomputed cos/sin for current positions
        # =========================================
        cos = self.cos_cached[:, :, offset:offset + T, :]  # (1, 1, T, d_k)
        sin = self.sin_cached[:, :, offset:offset + T, :]  # (1, 1, T, d_k)
        
        # =========================================
        # Apply rotation to Q and K
        # =========================================
        q_rotated = (q * cos) + (self._rotate_half(q) * sin)
        k_rotated = (k * cos) + (self._rotate_half(k) * sin)
        
        return q_rotated, k_rotated
    
    @staticmethod
    def _rotate_half(x: torch.Tensor) -> torch.Tensor:
        """
        Rearrange pairs: [x0, x1, x2, x3, ...] -> [-x_{d/2}, ..., x0, x1, ...]
        
        This implements the "swap and negate" part of the rotation.
        """
        d_half = x.shape[-1] // 2
        x1 = x[..., :d_half]
        x2 = x[..., d_half:]
        return torch.cat([-x2, x1], dim=-1)

How RoPE Integrates with Multi-Head Attention

RoPE modifies MultiHeadAttention.forward() — after splitting heads and before computing attention:

# Inside MultiHeadAttention.forward():
Q = self._split_heads(self.W_Q(query))   # (B, H, T_q, d_k)
K = self._split_heads(self.W_K(key))     # (B, H, T_k, d_k)
V = self._split_heads(self.W_V(value))   # (B, H, T_k, d_k)

# =========================================
# Apply RoPE to Q and K (not V!)
# =========================================
Q, K = self.rope(Q, K, offset=cache_len)

# Then proceed with attention as before
attn_output, weights = scaled_dot_product_attention(Q, K, V, mask)

Note: RoPE is applied to Q and K only — not to V. Position should affect which tokens attend to each other (the attention weights), but not what information they provide (the values).

RoPE and KV-Cache

RoPE integrates naturally with KV-cache. The offset parameter tells RoPE which absolute position the current tokens start at. During incremental decoding:

  • First pass (full sequence): offset=0, rotates all positions
  • Step tt (single token): offset=t, rotates by the correct position for the new token
  • Cached K values are already rotated — no re-rotation needed

This is another advantage over additive position encoding, where you’d need to carefully track which positions have already been encoded.

Key Insight: RoPE encodes relative position as a mathematical property of dot products — not as an additive signal that competes with content. The attention score qmTknq_m^T k_n naturally depends on mnm - n through the rotation algebra. This is why it generalizes better than additive approaches.


ALiBi: Attention with Linear Biases

An Even Simpler Alternative

ALiBi (Press et al., 2022) takes a radically different approach: don’t encode position in the embeddings at all. Instead, add a linear bias directly to the attention scores based on distance.

scoresij=qiTkjmij\text{scores}_{ij} = q_i^T k_j - m \cdot |i - j|

where mm is a head-specific slope. Closer tokens get a smaller penalty; distant tokens get a larger one.

Each attention head gets a different slope, geometrically spaced:

mh=128h/Hfor head h=1,,Hm_h = \frac{1}{2^{8h/H}} \quad \text{for head } h = 1, \ldots, H

With 8 heads: m{1/2,1/4,1/8,1/16,1/32,1/64,1/128,1/256}m \in \{1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256\}.

Implementation

class ALiBi(torch.nn.Module):
    """
    Attention with Linear Biases.
    
    No position encoding in embeddings — position enters
    as a bias on attention scores.
    
    Args:
        n_heads: Number of attention heads.
        max_len: Maximum sequence length.
    """
    
    def __init__(self, n_heads: int, max_len: int = 8192):
        super().__init__()
        
        # =========================================
        # Compute head-specific slopes
        # =========================================
        slopes = torch.tensor([
            1.0 / (2 ** (8 * h / n_heads))
            for h in range(1, n_heads + 1)
        ])  # (H,)
        
        # =========================================
        # Precompute distance matrix
        # =========================================
        positions = torch.arange(max_len)
        # |i - j| for all position pairs
        distance = (positions.unsqueeze(0) - positions.unsqueeze(1)).abs().float()
        # (max_len, max_len)
        
        # Bias: (H, max_len, max_len) — negative, penalizes distance
        bias = -slopes.view(-1, 1, 1) * distance.unsqueeze(0)
        
        self.register_buffer("bias", bias.unsqueeze(0))  # (1, H, max_len, max_len)
    
    def forward(self, T: int) -> torch.Tensor:
        """
        Returns ALiBi bias for sequence length T.
        
        Add this to attention scores before softmax.
        
        Returns:
            bias: (1, H, T, T)
        """
        return self.bias[:, :, :T, :T]

Usage is simple — add the bias to scores in the attention function:

scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
scores = scores + alibi.forward(T)  # add position bias
scores = scores + causal_mask        # add causal mask
weights = torch.softmax(scores, dim=-1)

Why ALiBi Works

1. Recency bias is a strong prior. In language, nearby tokens are usually more relevant. ALiBi encodes this directly.

2. Different heads, different horizons. Steep slopes (m=1/2m = 1/2) create heads that focus on very local context. Gentle slopes (m=1/256m = 1/256) create heads that can attend far back.

3. Extrapolation. Because the bias is a simple linear function of distance, it extends naturally to unseen lengths. This is ALiBi’s strongest selling point — it extrapolates better than sinusoidal or learned embeddings.

Limitations

  • No relative position in content — position only affects attention weights, not representations
  • Assumes recency — tasks where distant tokens are equally relevant (e.g., retrieval) may suffer
  • Not used in most modern LLMs — RoPE has largely won for open-weight models

Key Insight: ALiBi is the minimalist approach — position as a bias on attention scores, nothing more. It extrapolates well but sacrifices the richness of position information in representations. Think of it as a strong baseline that RoPE improves upon.


Comparison: Which to Use When

PropertySinusoidalLearnedRoPEALiBi
Parameters0T×dT \times d00
Where appliedAdd to embeddingsAdd to embeddingsRotate Q, KBias on scores
Relative positionVia rotation propertyNot explicitDirect (dot product)Direct (distance)
ExtrapolationWeak in practiceNoneGood (with NTK-aware scaling)Best out-of-box
Used byOriginal TransformerGPT-2, BERTLLaMA, Qwen, MistralBLOOM, MPT
Content-position couplingAdditive (coupled)Additive (coupled)Multiplicative (decoupled)Score-level only

The Decision

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'12px', 'fontFamily':'monospace' } }}%% graph TD Start{"<b>Building a<br/>transformer?</b>"} Start -->|"Modern LLM<br/>(decoder-only)"| RoPE["<b>Use RoPE</b><br/>━━━━━━━━<br/>Industry standard<br/>Relative position<br/>Good extrapolation"] Start -->|"Short fixed-length<br/>(BERT-style)"| Learned["<b>Learned Embeddings</b><br/>━━━━━━━━<br/>Simple, effective<br/>within trained range"] Start -->|"Need length<br/>extrapolation"| Q2{"<b>Rich position<br/>info needed?</b>"} Q2 -->|"Yes"| RoPE2["<b>RoPE + NTK scaling</b><br/>━━━━━━━━<br/>Best of both worlds"] Q2 -->|"No, just recency"| ALiBi2["<b>ALiBi</b><br/>━━━━━━━━<br/>Simplest extrapolation"] Start -->|"Educational /<br/>from scratch"| Sinusoidal["<b>Sinusoidal</b><br/>━━━━━━━━<br/>Understand the math<br/>then move to RoPE"] style Start fill:#334155,stroke:#64748b,color:#e5e7eb style RoPE fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style RoPE2 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Learned fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px style ALiBi2 fill:#1e293b,stroke:#8b5cf6,color:#e9d5ff,stroke-width:2px style Sinusoidal fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2px style Q2 fill:#334155,stroke:#64748b,color:#e5e7eb

For rlvr-from-scratch, we implement all three (sinusoidal, learned, RoPE) and use RoPE as the default — matching modern LLM practice.


Implementation

The full tested implementation is at src/rlvr_from_scratch/model/positional.py.

Module Summary

ComponentTypeParametersHow it works
SinusoidalPositionalEncodingAdditive0Pre-computed sin/cos buffer added to embeddings
LearnedPositionalEmbeddingAdditiveT×dT \times dnn.Embedding lookup added to embeddings
RotaryPositionalEmbeddingMultiplicative0Rotation applied to Q, K after head split

Test Coverage

Correctness:

  • Sinusoidal: different positions produce different encodings
  • Learned: output shape matches input, gradients flow
  • RoPE: relative position property — qmTknq_m^T k_n depends only on mnm - n
  • RoPE: rotation preserves vector norm

Extrapolation:

  • Sinusoidal: produces valid (non-NaN) output beyond trained length
  • Learned: raises error beyond max_len
  • RoPE: produces valid output at arbitrary positions

Integration:

  • Each variant integrates with MultiHeadAttention without shape errors
  • KV-cache works correctly with RoPE offset parameter

Key Takeaways

The Core Problem

Attention is permutation equivariant. Without position information, “the dog bit the man” and “the man bit the dog” are indistinguishable.

Three Approaches

  • Sinusoidal: Fixed multi-frequency signal added to embeddings. Educational, historically important, but superseded.
  • Learned: An embedding table. Simple, effective within range, can’t extrapolate.
  • RoPE: Rotation applied to Q and K. Relative position emerges from dot-product algebra. The modern standard.

Why RoPE Won

  • Decoupled from content — position enters through rotation, not addition
  • Relative by constructionqmTkn=f(mn)q_m^T k_n = f(m - n), not f(m,n)f(m, n)
  • Compatible with KV-cache — rotated keys are cached as-is
  • Extrapolation — extends well with NTK-aware frequency scaling

What’s Next

We now have attention (Part 1) and position encoding (Part 2). In Part 3: Building a Transformer, I assemble the full transformer block: multi-head attention + feed-forward network + layer normalization + residual connections. The architecture goes from components to a complete, trainable model.


Further Reading

Original Papers:

Extensions:

Implementation:

Cite this reference

Sousa, V. (2026). Positional Encoding: Teaching Transformers to Count. vitorsousa.com (Foundation Reference). https://www.vitorsousa.com/foundations//

@article{sousa2026,
  title={Positional Encoding: Teaching Transformers to Count},
  author={Sousa, Vitor},
  year={2026},
  note={Foundation Reference},
  url={https://www.vitorsousa.com/foundations//}
}

Discussion

Found something useful, spotted an error, or want to add context? Comments are powered by GitHub Discussions.