Technical Digital Garden
Bits
A collection of atomic notes, code snippets, and technical 'cheats' I’ve gathered over the years. These are unpolished references intended for quick utility rather than narrative reading.
Quick-fire references
Scan the grid, filter by utility tags, and grab the snippet you need without diving into long-form posts.
Bias-Variance Decomposition
- Bias: model too simple to capture f
- Variance: model too sensitive to training sample
- Noise: irreducible — the floor
Why it matters: Every regularization choice (dropout, weight decay, early stopping) trades bias for variance. Knowing which side you’re on tells you which knob to turn.
Binary Search (Safe Midpoint)
def bsearch(arr, target):
lo, hi = 0, len(arr) - 1
while lo <= hi:
mid = lo + (hi - lo) // 2 # avoids overflow
if arr[mid] == target: return mid
if arr[mid] < target: lo = mid + 1
else: hi = mid - 1
return -1
Why it matters: (lo + hi) // 2 overflows in languages with fixed-width ints. Python’s fine but interviewers in C++/Java land care. Same template extends to bisect_left / bisect_right.
Causal Mask as an Additive Tensor
def causal_mask(T, device):
mask = torch.zeros(T, T, device=device)
mask = mask.masked_fill(
torch.triu(torch.ones(T, T, device=device), diagonal=1).bool(),
float("-inf"),
)
return mask # (T, T), broadcasts to (B, H, T, T)
Why it matters: Additive masks (0.0 / -inf) compose — sum causal + padding and pass one tensor. Multiplicative masks don’t compose cleanly.
Chain Rule for Vectors
Shapes: (k × m) · (m × n) = (k × n).
Why it matters: Backprop is this, applied right-to-left, never materializing full Jacobians — autograd stores each op’s vector-Jacobian product instead.
Why `.contiguous()` After `transpose()`
# RuntimeError: view size is not compatible with input tensor's...
x.transpose(1, 2).view(B, T, d_model)
# Fix: contiguous copy, then view
x.transpose(1, 2).contiguous().view(B, T, d_model)
Why it matters: transpose returns a view with permuted strides. view requires contiguous memory. .contiguous() is the copy. You hit this exactly once, then never forget.
Eigendecomposition: What It Buys You
For , iterating t steps costs one exponentiation of a diagonal matrix instead of t matmuls.
Why it matters: Spectral radius ρ(A) = max|λᵢ| governs stability. |λ| < 1 contracts, |λ| > 1 explodes. This is why RNN gradients vanish or explode — it’s the same theorem.
Einops for Tensor Reshaping
from einops import rearrange
# split heads: (B, T, d_model) -> (B, H, T, d_k)
Q = rearrange(Q, "b t (h d) -> b h t d", h=n_heads)
# merge heads
out = rearrange(attn_out, "b h t d -> b t (h d)")
Why it matters: The operation is readable at the call site — no mental bookkeeping of .view().transpose().contiguous() chains. Dimensions are named. Shape bugs drop to near zero.
He Initialization
# For ReLU / GELU activations
nn.init.kaiming_normal_(w, mode="fan_in", nonlinearity="relu")
# Equivalent: w ~ N(0, sqrt(2 / fan_in))
Why it matters: Variance-preserving across ReLU layers (ReLU kills half the activations, so ×2 to compensate). Xavier (÷fan_in) is for tanh/sigmoid and will underflow gradients in deep ReLU nets.
Jensen's Inequality
For convex f: . For concave f (e.g., log): .
Why it matters: Whole derivation of ELBO / variational inference starts here. Also why and aren’t interchangeable — they differ by the KL to the variational posterior.
KV-Cache in Five Lines
if kv_cache is not None:
K_prev, V_prev = kv_cache
K = torch.cat([K_prev, K], dim=2) # (B, H, T_prev + T_k, d_k)
V = torch.cat([V_prev, V], dim=2)
new_kv_cache = (K, V)
Why it matters: Autoregressive decoding is O(T²) without this. With it, per-step cost drops to O(1) projection + O(T) attention. Single biggest inference optimization.
Maximum Likelihood = Minimum Cross-Entropy
The RHS is cross-entropy between the empirical distribution and the model.
Why it matters: “Why cross-entropy loss?” has one answer: it’s MLE for a categorical distribution. Same identity gives you MSE for Gaussians and binary cross-entropy for Bernoullis.
RoPE Applies at `d_k`, Not `d_model`
# WRONG — rotate before head split
x = apply_rope(x, freqs) # (B, T, d_model)
Q, K, V = split_heads(project(x)) # breaks relative-position property
# RIGHT — split heads first, rotate Q and K per-head
Q, K, V = split_heads(project(x)) # (B, H, T, d_k)
Q = apply_rope(Q, freqs)
K = apply_rope(K, freqs)
Why it matters: RoPE encodes relative position through 2D rotations in the per-head subspace. Apply at d_model and heads mix rotations, losing the ⟨q_m, k_n⟩ = f(m−n) property. Every modern LLM rotates at d_k.
Scaled Dot-Product Attention
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
# Q, K, V: (B, H, T, d_k)
d_k = Q.size(-1)
scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
if mask is not None:
scores = scores + mask # additive: 0.0 or -inf
weights = F.softmax(scores, dim=-1)
return weights @ V, weights
Why it matters: The √d_k scaling keeps dot products out of softmax’s saturated regions where gradients vanish. Forget it and a transformer stops training at d_k ≥ 64.
Seed Everything for Reproducibility
import random, numpy as np, torch
def seed_everything(seed: int = 42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Why it matters: Same seed + same code → same loss curves. If this fails, you have a hidden source of nondeterminism (data loader workers, CUDA nondet ops, etc.). Non-negotiable for debugging training runs.
Sinusoidal vs Learned vs Rotary
| Sinusoidal | Learned | RoPE | |
|---|---|---|---|
| Type | Fixed | Parameter | Fixed |
| Applied at | Input embedding | Input embedding | Q and K inside attention |
| Extrapolates past max_len | Yes | No | Yes |
| Encodes relative position | Weakly | No | Explicitly |
Why it matters: LLaMA, Qwen, Mistral all use RoPE because ⟨RoPE(q, m), RoPE(k, n)⟩ depends only on m−n — relative position becomes a first-class operation, not a learned approximation.
Softmax + Temperature
- τ → 0: argmax (sharp)
- τ = 1: standard softmax
- τ → ∞: uniform
Why it matters: Sampling temperature in LLMs is this τ. Also appears in knowledge distillation (soft targets) and contrastive learning (InfoNCE).
Standardize vs Normalize
# Standardize: mean 0, std 1 — assumes Gaussian-ish
x = (x - x.mean()) / x.std()
# Normalize: bound to [0, 1] — for bounded inputs / image pixels
x = (x - x.min()) / (x.max() - x.min())
# Robust: for outlier-heavy data
x = (x - np.median(x)) / (np.quantile(x, 0.75) - np.quantile(x, 0.25))
Why it matters: Wrong choice silently breaks training. Standardize for linear models and anything with L2 regularization. Normalize for fixed-range inputs. Robust when you can’t trust your tails.
Stratified Split
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
Why it matters: With class imbalance, a naive split can put all minority examples in train or test. Stratify preserves class ratios across splits — otherwise your metrics are noise.
SVD in One Line
- U, V orthogonal; Σ diagonal with σ₁ ≥ σ₂ ≥ … ≥ 0
- Columns of V: input directions
- Columns of U: output directions
- σᵢ: stretch factor along each direction
Why it matters: Every matrix is a rotation, a scaling, and another rotation. Truncating at rank-k gives the best rank-k approximation in Frobenius norm — this is PCA, LoRA, and half of numerical linear algebra.
Two Pointers — The Canonical Pattern
def two_sum_sorted(nums, target):
l, r = 0, len(nums) - 1
while l < r:
s = nums[l] + nums[r]
if s == target: return [l, r]
if s < target: l += 1
else: r -= 1
return []
Why it matters: Sorted array + pair/triple problems → two pointers beats hashmap on space. Template reused in 3-sum, container-with-most-water, trapping rain water.
uv — Fast Python Package Management
# Replace pip + venv + pip-tools entirely
uv init my-project
uv add torch transformers
uv run python script.py
uv sync # reproducible install from uv.lock
Why it matters: 10–100× faster than pip. Lockfile built-in. Single binary, no virtualenv activation dance. This is what rlvr-from-scratch uses.
Calculating Average Precision (AP) without Sklearn
import numpy as np
def calculate_ap(recalls, precisions):
# Ensure monotonic decreasing precision (11-point or all-point interpolation)
m_rec = np.concatenate(([0.0], recalls, [1.0]))
m_pre = np.concatenate(([0.0], precisions, [0.0]))
for i in range(len(m_pre) - 1, 0, -1):
m_pre[i - 1] = np.maximum(m_pre[i - 1], m_pre[i])
# Area under the curve via trapezoidal integration
indices = np.where(m_rec[1:] != m_rec[:-1])[0]
ap = np.sum((m_rec[indices + 1] - m_rec[indices]) * m_pre[indices + 1])
return ap
Why it matters: Object detection and retrieval metrics break when you only eyeball curves. Manual AP keeps leaderboard numbers reproducible.
Bessel's Correction in Variance Calculation
import numpy as np
data = [10, 12, 23, 23, 16, 23, 21, 16]
# Population Variance (N)
pop_var = np.var(data)
# Sample Variance (N-1) - The "Unbiased" Estimator
sample_var = np.var(data, ddof=1)
Why it matters: For small samples, dividing by N underestimates the population variance. ddof=1 keeps statistical reporting honest.
Representative Centroid Selection for Long-Context RAG
from sklearn.cluster import KMeans
import numpy as np
def get_representative_embeddings(embeddings, k=5):
# Instead of taking the top-K similar, take the K most diverse centroids
kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10)
kmeans.fit(embeddings)
# Find the actual vectors closest to these centroids
return kmeans.cluster_centers_
Why it matters: Mitigates “lost in the middle” issues in RAG by feeding the model diverse context instead of redundant snippets.
The Log-Sum-Exp Trick for Softmax
import numpy as np
def log_sum_exp(x):
# Subtracting the max prevents overflow when exponentiating large numbers
c = np.max(x)
return c + np.log(np.sum(np.exp(x - c)))
def stable_softmax(x):
return np.exp(x - log_sum_exp(x))
Why it matters: The log-sum-exp pattern prevents NaN or Inf when logits are large, keeping gradients finite during backprop.
Vectorized Covariance Matrix Calculation
import numpy as np
def fast_covariance(X):
# X is an (n_samples, n_features) matrix
n = X.shape[0]
X_centered = X - X.mean(axis=0)
# Using the dot product is significantly faster than np.cov for large matrices
return (X_centered.T @ X_centered) / (n - 1)
Why it matters: Center once, multiply once. Large feature banks compute faster when you skip Python loops and lean on vectorized math.
Python @dataclass
The @dataclass decorator auto-generates __init__, __repr__, and __eq__:
from dataclasses import dataclass
@dataclass
class Point:
x: float
y: float
label: str = "origin"
Useful options:
frozen=True- immutable instancesorder=True- enables comparison operatorsslots=True- use__slots__for memory efficiency
No bits matched your filters. Try a different keyword or category.