Bits | Vitor Sousa — Senior Data Scientist & AI Engineer

Bias-Variance Decomposition

$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\mathbb{E}[\hat{f}(x)] - f(x))^2}_{\text{bias}^2} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}$

Bias: model too simple to capture f
Variance: model too sensitive to training sample
Noise: irreducible — the floor

Why it matters: Every regularization choice (dropout, weight decay, early stopping) trades bias for variance. Knowing which side you’re on tells you which knob to turn.

Binary Search (Safe Midpoint)

def bsearch(arr, target):
    lo, hi = 0, len(arr) - 1
    while lo <= hi:
        mid = lo + (hi - lo) // 2   # avoids overflow
        if arr[mid] == target: return mid
        if arr[mid] < target:  lo = mid + 1
        else:                  hi = mid - 1
    return -1

Why it matters: (lo + hi) // 2 overflows in languages with fixed-width ints. Python’s fine but interviewers in C++/Java land care. Same template extends to bisect_left / bisect_right.

Causal Mask as an Additive Tensor

def causal_mask(T, device):
    mask = torch.zeros(T, T, device=device)
    mask = mask.masked_fill(
        torch.triu(torch.ones(T, T, device=device), diagonal=1).bool(),
        float("-inf"),
    )
    return mask  # (T, T), broadcasts to (B, H, T, T)

Why it matters: Additive masks (0.0 / -inf) compose — sum causal + padding and pass one tensor. Multiplicative masks don’t compose cleanly.

Chain Rule for Vectors

$y = f(g(x)), \quad \frac{\partial y}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}$

Shapes: (k × m) · (m × n) = (k × n).

Why it matters: Backprop is this, applied right-to-left, never materializing full Jacobians — autograd stores each op’s vector-Jacobian product instead.

Why `.contiguous()` After `transpose()`

# RuntimeError: view size is not compatible with input tensor's...
x.transpose(1, 2).view(B, T, d_model)

# Fix: contiguous copy, then view
x.transpose(1, 2).contiguous().view(B, T, d_model)

Why it matters: transpose returns a view with permuted strides. view requires contiguous memory. .contiguous() is the copy. You hit this exactly once, then never forget.

Eigendecomposition: What It Buys You

$A = Q \Lambda Q^{-1} \quad\Rightarrow\quad A^t = Q \Lambda^t Q^{-1}$

For $x_{t+1} = Ax_t$ , iterating t steps costs one exponentiation of a diagonal matrix instead of t matmuls.

Why it matters: Spectral radius ρ(A) = max|λᵢ| governs stability. |λ| < 1 contracts, |λ| > 1 explodes. This is why RNN gradients vanish or explode — it’s the same theorem.

Einops for Tensor Reshaping

from einops import rearrange

# split heads: (B, T, d_model) -> (B, H, T, d_k)
Q = rearrange(Q, "b t (h d) -> b h t d", h=n_heads)

# merge heads
out = rearrange(attn_out, "b h t d -> b t (h d)")

Why it matters: The operation is readable at the call site — no mental bookkeeping of .view().transpose().contiguous() chains. Dimensions are named. Shape bugs drop to near zero.

He Initialization

# For ReLU / GELU activations
nn.init.kaiming_normal_(w, mode="fan_in", nonlinearity="relu")
# Equivalent: w ~ N(0, sqrt(2 / fan_in))

Why it matters: Variance-preserving across ReLU layers (ReLU kills half the activations, so ×2 to compensate). Xavier (÷fan_in) is for tanh/sigmoid and will underflow gradients in deep ReLU nets.

Jensen's Inequality

For convex f: $\mathbb{E}[f(X)] \geq f(\mathbb{E}[X])$ . For concave f (e.g., log): $\mathbb{E}[f(X)] \leq f(\mathbb{E}[X])$ .

Why it matters: Whole derivation of ELBO / variational inference starts here. Also why $\log \mathbb{E}[\cdot]$ and $\mathbb{E}[\log \cdot]$ aren’t interchangeable — they differ by the KL to the variational posterior.

KV-Cache in Five Lines

if kv_cache is not None:
    K_prev, V_prev = kv_cache
    K = torch.cat([K_prev, K], dim=2)  # (B, H, T_prev + T_k, d_k)
    V = torch.cat([V_prev, V], dim=2)
new_kv_cache = (K, V)

Why it matters: Autoregressive decoding is O(T²) without this. With it, per-step cost drops to O(1) projection + O(T) attention. Single biggest inference optimization.

Maximum Likelihood = Minimum Cross-Entropy

$\arg\max_\theta \sum_i \log p_\theta(y_i \mid x_i) \;=\; \arg\min_\theta \;-\frac{1}{N}\sum_i \log p_\theta(y_i \mid x_i)$

The RHS is cross-entropy between the empirical distribution and the model.

Why it matters: “Why cross-entropy loss?” has one answer: it’s MLE for a categorical distribution. Same identity gives you MSE for Gaussians and binary cross-entropy for Bernoullis.

RoPE Applies at `d_k`, Not `d_model`

# WRONG — rotate before head split
x = apply_rope(x, freqs)            # (B, T, d_model)
Q, K, V = split_heads(project(x))   # breaks relative-position property

# RIGHT — split heads first, rotate Q and K per-head
Q, K, V = split_heads(project(x))   # (B, H, T, d_k)
Q = apply_rope(Q, freqs)
K = apply_rope(K, freqs)

Why it matters: RoPE encodes relative position through 2D rotations in the per-head subspace. Apply at d_model and heads mix rotations, losing the ⟨q_m, k_n⟩ = f(m−n) property. Every modern LLM rotates at d_k.

Scaled Dot-Product Attention

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    # Q, K, V: (B, H, T, d_k)
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
    if mask is not None:
        scores = scores + mask  # additive: 0.0 or -inf
    weights = F.softmax(scores, dim=-1)
    return weights @ V, weights

Why it matters: The √d_k scaling keeps dot products out of softmax’s saturated regions where gradients vanish. Forget it and a transformer stops training at d_k ≥ 64.

Seed Everything for Reproducibility

import random, numpy as np, torch

def seed_everything(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

Why it matters: Same seed + same code → same loss curves. If this fails, you have a hidden source of nondeterminism (data loader workers, CUDA nondet ops, etc.). Non-negotiable for debugging training runs.

Sinusoidal vs Learned vs Rotary

	Sinusoidal	Learned	RoPE
Type	Fixed	Parameter	Fixed
Applied at	Input embedding	Input embedding	Q and K inside attention
Extrapolates past max_len	Yes	No	Yes
Encodes relative position	Weakly	No	Explicitly

Why it matters: LLaMA, Qwen, Mistral all use RoPE because ⟨RoPE(q, m), RoPE(k, n)⟩ depends only on m−n — relative position becomes a first-class operation, not a learned approximation.

Softmax + Temperature

$\text{softmax}(x_i; \tau) = \frac{e^{x_i / \tau}}{\sum_j e^{x_j / \tau}}$

τ → 0: argmax (sharp)
τ = 1: standard softmax
τ → ∞: uniform

Why it matters: Sampling temperature in LLMs is this τ. Also appears in knowledge distillation (soft targets) and contrastive learning (InfoNCE).

Standardize vs Normalize

# Standardize: mean 0, std 1 — assumes Gaussian-ish
x = (x - x.mean()) / x.std()

# Normalize: bound to [0, 1] — for bounded inputs / image pixels
x = (x - x.min()) / (x.max() - x.min())

# Robust: for outlier-heavy data
x = (x - np.median(x)) / (np.quantile(x, 0.75) - np.quantile(x, 0.25))

Why it matters: Wrong choice silently breaks training. Standardize for linear models and anything with L2 regularization. Normalize for fixed-range inputs. Robust when you can’t trust your tails.

Stratified Split

from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Why it matters: With class imbalance, a naive split can put all minority examples in train or test. Stratify preserves class ratios across splits — otherwise your metrics are noise.

SVD in One Line

$A = U \Sigma V^T$

U, V orthogonal; Σ diagonal with σ₁ ≥ σ₂ ≥ … ≥ 0
Columns of V: input directions
Columns of U: output directions
σᵢ: stretch factor along each direction

Why it matters: Every matrix is a rotation, a scaling, and another rotation. Truncating at rank-k gives the best rank-k approximation in Frobenius norm — this is PCA, LoRA, and half of numerical linear algebra.

Two Pointers — The Canonical Pattern

def two_sum_sorted(nums, target):
    l, r = 0, len(nums) - 1
    while l < r:
        s = nums[l] + nums[r]
        if s == target: return [l, r]
        if s < target:  l += 1
        else:           r -= 1
    return []

Why it matters: Sorted array + pair/triple problems → two pointers beats hashmap on space. Template reused in 3-sum, container-with-most-water, trapping rain water.

uv — Fast Python Package Management

# Replace pip + venv + pip-tools entirely
uv init my-project
uv add torch transformers
uv run python script.py
uv sync  # reproducible install from uv.lock

Why it matters: 10–100× faster than pip. Lockfile built-in. Single binary, no virtualenv activation dance. This is what rlvr-from-scratch uses.

Calculating Average Precision (AP) without Sklearn

import numpy as np

def calculate_ap(recalls, precisions):
    # Ensure monotonic decreasing precision (11-point or all-point interpolation)
    m_rec = np.concatenate(([0.0], recalls, [1.0]))
    m_pre = np.concatenate(([0.0], precisions, [0.0]))

    for i in range(len(m_pre) - 1, 0, -1):
        m_pre[i - 1] = np.maximum(m_pre[i - 1], m_pre[i])

    # Area under the curve via trapezoidal integration
    indices = np.where(m_rec[1:] != m_rec[:-1])[0]
    ap = np.sum((m_rec[indices + 1] - m_rec[indices]) * m_pre[indices + 1])
    return ap

Why it matters: Object detection and retrieval metrics break when you only eyeball curves. Manual AP keeps leaderboard numbers reproducible.

Bessel's Correction in Variance Calculation

import numpy as np

data = [10, 12, 23, 23, 16, 23, 21, 16]

# Population Variance (N)
pop_var = np.var(data)

# Sample Variance (N-1) - The "Unbiased" Estimator
sample_var = np.var(data, ddof=1)

Why it matters: For small samples, dividing by N underestimates the population variance. ddof=1 keeps statistical reporting honest.

Representative Centroid Selection for Long-Context RAG

from sklearn.cluster import KMeans
import numpy as np

def get_representative_embeddings(embeddings, k=5):
    # Instead of taking the top-K similar, take the K most diverse centroids
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10)
    kmeans.fit(embeddings)
    # Find the actual vectors closest to these centroids
    return kmeans.cluster_centers_

Why it matters: Mitigates “lost in the middle” issues in RAG by feeding the model diverse context instead of redundant snippets.

The Log-Sum-Exp Trick for Softmax

import numpy as np

def log_sum_exp(x):
    # Subtracting the max prevents overflow when exponentiating large numbers
    c = np.max(x)
    return c + np.log(np.sum(np.exp(x - c)))

def stable_softmax(x):
    return np.exp(x - log_sum_exp(x))

Why it matters: The log-sum-exp pattern prevents NaN or Inf when logits are large, keeping gradients finite during backprop.

Vectorized Covariance Matrix Calculation

import numpy as np

def fast_covariance(X):
    # X is an (n_samples, n_features) matrix
    n = X.shape[0]
    X_centered = X - X.mean(axis=0)
    # Using the dot product is significantly faster than np.cov for large matrices
    return (X_centered.T @ X_centered) / (n - 1)

Why it matters: Center once, multiply once. Large feature banks compute faster when you skip Python loops and lean on vectorized math.

Python @dataclass

The @dataclass decorator auto-generates __init__, __repr__, and __eq__:

from dataclasses import dataclass

@dataclass
class Point:
    x: float
    y: float
    label: str = "origin"

Useful options:

frozen=True - immutable instances
order=True - enables comparison operators
slots=True - use __slots__ for memory efficiency

Quick-fire references