Technical Digital Garden

Bits

A collection of atomic notes, code snippets, and technical 'cheats' I’ve gathered over the years. These are unpolished references intended for quick utility rather than narrative reading.

Quick-fire references

Scan the grid, filter by utility tags, and grab the snippet you need without diving into long-form posts.

ML Theory

Bias-Variance Decomposition

ML Theory

E[(yf^(x))2]=(E[f^(x)]f(x))2bias2+Var(f^(x))variance+σ2noise\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\mathbb{E}[\hat{f}(x)] - f(x))^2}_{\text{bias}^2} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{variance}} + \underbrace{\sigma^2}_{\text{noise}}

  • Bias: model too simple to capture f
  • Variance: model too sensitive to training sample
  • Noise: irreducible — the floor

Why it matters: Every regularization choice (dropout, weight decay, early stopping) trades bias for variance. Knowing which side you’re on tells you which knob to turn.

Algorithms

Binary Search (Safe Midpoint)

Algorithms
def bsearch(arr, target):
    lo, hi = 0, len(arr) - 1
    while lo <= hi:
        mid = lo + (hi - lo) // 2   # avoids overflow
        if arr[mid] == target: return mid
        if arr[mid] < target:  lo = mid + 1
        else:                  hi = mid - 1
    return -1

Why it matters: (lo + hi) // 2 overflows in languages with fixed-width ints. Python’s fine but interviewers in C++/Java land care. Same template extends to bisect_left / bisect_right.

Deep Learning

Causal Mask as an Additive Tensor

Deep Learning
def causal_mask(T, device):
    mask = torch.zeros(T, T, device=device)
    mask = mask.masked_fill(
        torch.triu(torch.ones(T, T, device=device), diagonal=1).bool(),
        float("-inf"),
    )
    return mask  # (T, T), broadcasts to (B, H, T, T)

Why it matters: Additive masks (0.0 / -inf) compose — sum causal + padding and pass one tensor. Multiplicative masks don’t compose cleanly.

Math

Chain Rule for Vectors

Math

y=f(g(x)),yx=fggxy = f(g(x)), \quad \frac{\partial y}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}

Shapes: (k × m) · (m × n) = (k × n).

Why it matters: Backprop is this, applied right-to-left, never materializing full Jacobians — autograd stores each op’s vector-Jacobian product instead.

Deep Learning

Why `.contiguous()` After `transpose()`

Deep Learning
# RuntimeError: view size is not compatible with input tensor's...
x.transpose(1, 2).view(B, T, d_model)

# Fix: contiguous copy, then view
x.transpose(1, 2).contiguous().view(B, T, d_model)

Why it matters: transpose returns a view with permuted strides. view requires contiguous memory. .contiguous() is the copy. You hit this exactly once, then never forget.

Math

Eigendecomposition: What It Buys You

Math

A=QΛQ1At=QΛtQ1A = Q \Lambda Q^{-1} \quad\Rightarrow\quad A^t = Q \Lambda^t Q^{-1}

For xt+1=Axtx_{t+1} = Ax_t, iterating t steps costs one exponentiation of a diagonal matrix instead of t matmuls.

Why it matters: Spectral radius ρ(A) = max|λᵢ| governs stability. |λ| < 1 contracts, |λ| > 1 explodes. This is why RNN gradients vanish or explode — it’s the same theorem.

Code

Einops for Tensor Reshaping

Code
from einops import rearrange

# split heads: (B, T, d_model) -> (B, H, T, d_k)
Q = rearrange(Q, "b t (h d) -> b h t d", h=n_heads)

# merge heads
out = rearrange(attn_out, "b h t d -> b t (h d)")

Why it matters: The operation is readable at the call site — no mental bookkeeping of .view().transpose().contiguous() chains. Dimensions are named. Shape bugs drop to near zero.

Deep Learning

He Initialization

Deep Learning
# For ReLU / GELU activations
nn.init.kaiming_normal_(w, mode="fan_in", nonlinearity="relu")
# Equivalent: w ~ N(0, sqrt(2 / fan_in))

Why it matters: Variance-preserving across ReLU layers (ReLU kills half the activations, so ×2 to compensate). Xavier (÷fan_in) is for tanh/sigmoid and will underflow gradients in deep ReLU nets.

ML Theory

Jensen's Inequality

ML Theory

For convex f: E[f(X)]f(E[X])\mathbb{E}[f(X)] \geq f(\mathbb{E}[X]). For concave f (e.g., log): E[f(X)]f(E[X])\mathbb{E}[f(X)] \leq f(\mathbb{E}[X]).

Why it matters: Whole derivation of ELBO / variational inference starts here. Also why logE[]\log \mathbb{E}[\cdot] and E[log]\mathbb{E}[\log \cdot] aren’t interchangeable — they differ by the KL to the variational posterior.

Deep Learning

KV-Cache in Five Lines

Deep Learning
if kv_cache is not None:
    K_prev, V_prev = kv_cache
    K = torch.cat([K_prev, K], dim=2)  # (B, H, T_prev + T_k, d_k)
    V = torch.cat([V_prev, V], dim=2)
new_kv_cache = (K, V)

Why it matters: Autoregressive decoding is O(T²) without this. With it, per-step cost drops to O(1) projection + O(T) attention. Single biggest inference optimization.

ML Theory

Maximum Likelihood = Minimum Cross-Entropy

ML Theory

argmaxθilogpθ(yixi)  =  argminθ  1Nilogpθ(yixi)\arg\max_\theta \sum_i \log p_\theta(y_i \mid x_i) \;=\; \arg\min_\theta \;-\frac{1}{N}\sum_i \log p_\theta(y_i \mid x_i)

The RHS is cross-entropy between the empirical distribution and the model.

Why it matters: “Why cross-entropy loss?” has one answer: it’s MLE for a categorical distribution. Same identity gives you MSE for Gaussians and binary cross-entropy for Bernoullis.

Deep Learning

RoPE Applies at `d_k`, Not `d_model`

Deep Learning
# WRONG — rotate before head split
x = apply_rope(x, freqs)            # (B, T, d_model)
Q, K, V = split_heads(project(x))   # breaks relative-position property

# RIGHT — split heads first, rotate Q and K per-head
Q, K, V = split_heads(project(x))   # (B, H, T, d_k)
Q = apply_rope(Q, freqs)
K = apply_rope(K, freqs)

Why it matters: RoPE encodes relative position through 2D rotations in the per-head subspace. Apply at d_model and heads mix rotations, losing the ⟨q_m, k_n⟩ = f(m−n) property. Every modern LLM rotates at d_k.

Deep Learning

Scaled Dot-Product Attention

Deep Learning
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    # Q, K, V: (B, H, T, d_k)
    d_k = Q.size(-1)
    scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
    if mask is not None:
        scores = scores + mask  # additive: 0.0 or -inf
    weights = F.softmax(scores, dim=-1)
    return weights @ V, weights

Why it matters: The √d_k scaling keeps dot products out of softmax’s saturated regions where gradients vanish. Forget it and a transformer stops training at d_k ≥ 64.

Tools

Seed Everything for Reproducibility

Tools
import random, numpy as np, torch

def seed_everything(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

Why it matters: Same seed + same code → same loss curves. If this fails, you have a hidden source of nondeterminism (data loader workers, CUDA nondet ops, etc.). Non-negotiable for debugging training runs.

Deep Learning

Sinusoidal vs Learned vs Rotary

Deep Learning
SinusoidalLearnedRoPE
TypeFixedParameterFixed
Applied atInput embeddingInput embeddingQ and K inside attention
Extrapolates past max_lenYesNoYes
Encodes relative positionWeaklyNoExplicitly

Why it matters: LLaMA, Qwen, Mistral all use RoPE because ⟨RoPE(q, m), RoPE(k, n)⟩ depends only on m−n — relative position becomes a first-class operation, not a learned approximation.

Math

Softmax + Temperature

Math

softmax(xi;τ)=exi/τjexj/τ\text{softmax}(x_i; \tau) = \frac{e^{x_i / \tau}}{\sum_j e^{x_j / \tau}}

  • τ → 0: argmax (sharp)
  • τ = 1: standard softmax
  • τ → ∞: uniform

Why it matters: Sampling temperature in LLMs is this τ. Also appears in knowledge distillation (soft targets) and contrastive learning (InfoNCE).

Data

Standardize vs Normalize

Data
# Standardize: mean 0, std 1 — assumes Gaussian-ish
x = (x - x.mean()) / x.std()

# Normalize: bound to [0, 1] — for bounded inputs / image pixels
x = (x - x.min()) / (x.max() - x.min())

# Robust: for outlier-heavy data
x = (x - np.median(x)) / (np.quantile(x, 0.75) - np.quantile(x, 0.25))

Why it matters: Wrong choice silently breaks training. Standardize for linear models and anything with L2 regularization. Normalize for fixed-range inputs. Robust when you can’t trust your tails.

Data

Stratified Split

Data
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Why it matters: With class imbalance, a naive split can put all minority examples in train or test. Stratify preserves class ratios across splits — otherwise your metrics are noise.

Math

SVD in One Line

Math

A=UΣVTA = U \Sigma V^T

  • U, V orthogonal; Σ diagonal with σ₁ ≥ σ₂ ≥ … ≥ 0
  • Columns of V: input directions
  • Columns of U: output directions
  • σᵢ: stretch factor along each direction

Why it matters: Every matrix is a rotation, a scaling, and another rotation. Truncating at rank-k gives the best rank-k approximation in Frobenius norm — this is PCA, LoRA, and half of numerical linear algebra.

Algorithms

Two Pointers — The Canonical Pattern

Algorithms
def two_sum_sorted(nums, target):
    l, r = 0, len(nums) - 1
    while l < r:
        s = nums[l] + nums[r]
        if s == target: return [l, r]
        if s < target:  l += 1
        else:           r -= 1
    return []

Why it matters: Sorted array + pair/triple problems → two pointers beats hashmap on space. Template reused in 3-sum, container-with-most-water, trapping rain water.

Tools

uv — Fast Python Package Management

Tools
# Replace pip + venv + pip-tools entirely
uv init my-project
uv add torch transformers
uv run python script.py
uv sync  # reproducible install from uv.lock

Why it matters: 10–100× faster than pip. Lockfile built-in. Single binary, no virtualenv activation dance. This is what rlvr-from-scratch uses.

Algorithms

Calculating Average Precision (AP) without Sklearn

Algorithms
import numpy as np

def calculate_ap(recalls, precisions):
    # Ensure monotonic decreasing precision (11-point or all-point interpolation)
    m_rec = np.concatenate(([0.0], recalls, [1.0]))
    m_pre = np.concatenate(([0.0], precisions, [0.0]))

    for i in range(len(m_pre) - 1, 0, -1):
        m_pre[i - 1] = np.maximum(m_pre[i - 1], m_pre[i])

    # Area under the curve via trapezoidal integration
    indices = np.where(m_rec[1:] != m_rec[:-1])[0]
    ap = np.sum((m_rec[indices + 1] - m_rec[indices]) * m_pre[indices + 1])
    return ap

Why it matters: Object detection and retrieval metrics break when you only eyeball curves. Manual AP keeps leaderboard numbers reproducible.

Data

Bessel's Correction in Variance Calculation

Data
import numpy as np

data = [10, 12, 23, 23, 16, 23, 21, 16]

# Population Variance (N)
pop_var = np.var(data)

# Sample Variance (N-1) - The "Unbiased" Estimator
sample_var = np.var(data, ddof=1)

Why it matters: For small samples, dividing by N underestimates the population variance. ddof=1 keeps statistical reporting honest.

Algorithms

Representative Centroid Selection for Long-Context RAG

Algorithms
from sklearn.cluster import KMeans
import numpy as np

def get_representative_embeddings(embeddings, k=5):
    # Instead of taking the top-K similar, take the K most diverse centroids
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10)
    kmeans.fit(embeddings)
    # Find the actual vectors closest to these centroids
    return kmeans.cluster_centers_

Why it matters: Mitigates “lost in the middle” issues in RAG by feeding the model diverse context instead of redundant snippets.

Deep Learning

The Log-Sum-Exp Trick for Softmax

Deep Learning
import numpy as np

def log_sum_exp(x):
    # Subtracting the max prevents overflow when exponentiating large numbers
    c = np.max(x)
    return c + np.log(np.sum(np.exp(x - c)))

def stable_softmax(x):
    return np.exp(x - log_sum_exp(x))

Why it matters: The log-sum-exp pattern prevents NaN or Inf when logits are large, keeping gradients finite during backprop.

Code

Vectorized Covariance Matrix Calculation

Code
import numpy as np

def fast_covariance(X):
    # X is an (n_samples, n_features) matrix
    n = X.shape[0]
    X_centered = X - X.mean(axis=0)
    # Using the dot product is significantly faster than np.cov for large matrices
    return (X_centered.T @ X_centered) / (n - 1)

Why it matters: Center once, multiply once. Large feature banks compute faster when you skip Python loops and lean on vectorized math.

Code

Python @dataclass

Code

The @dataclass decorator auto-generates __init__, __repr__, and __eq__:

from dataclasses import dataclass

@dataclass
class Point:
    x: float
    y: float
    label: str = "origin"

Useful options:

  • frozen=True - immutable instances
  • order=True - enables comparison operators
  • slots=True - use __slots__ for memory efficiency