Quick Knowledge

Bits & Flashcards

Small, digestible pieces of knowledge - concepts, formulas, code patterns, and tools that I find useful and worth remembering.

Click any card to flip it and reveal the details. Think of it as my digital flashcard collection for ML, math, and engineering.

Knowledge Cards

Filter by category or search to find specific concepts, patterns, or tools.

Deep Learning

Attention Mechanism

Deep Learning

Attention computes a weighted sum of values based on query-key similarity:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The scaling factor dk\sqrt{d_k} prevents the dot products from growing too large, which would push softmax into regions with extremely small gradients.

Math

Softmax Function

Math

Softmax converts a vector of real numbers into a probability distribution:

softmax(xi)=exijexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}

Properties:

  • Output sums to 1
  • All values are positive
  • Preserves relative ordering
  • Temperature parameter τ\tau controls sharpness: exi/τjexj/τ\frac{e^{x_i/\tau}}{\sum_j e^{x_j/\tau}}
Code

Python @dataclass

Code

The @dataclass decorator auto-generates __init__, __repr__, and __eq__:

from dataclasses import dataclass

@dataclass
class Point:
    x: float
    y: float
    label: str = "origin"

Useful options:

  • frozen=True - immutable instances
  • order=True - enables comparison operators
  • slots=True - use __slots__ for memory efficiency
Deep Learning

Transformer Architecture

Deep Learning

The Transformer consists of stacked encoder/decoder blocks:

Encoder block:

  1. Multi-head self-attention
  2. Add & Norm (residual connection)
  3. Feed-forward network
  4. Add & Norm

Key innovations:

  • Positional encoding (no recurrence)
  • Multi-head attention (parallel attention)
  • Layer normalization
  • Residual connections throughout
ML Theory

Gradient Descent

ML Theory

Gradient descent updates parameters to minimize a loss function:

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)

Variants:

  • Batch GD: Uses all data (stable but slow)
  • SGD: Uses one sample (noisy but fast)
  • Mini-batch: Uses subset (balanced)

Learning rate η\eta controls step size - too high causes divergence, too low causes slow convergence.

Tools

Docker Basics

Tools

Essential Docker commands:

# Build image
docker build -t myapp .

# Run container
docker run -d -p 8080:80 myapp

# List running containers
docker ps

# Stop container
docker stop <container_id>

Dockerfile basics:

  • FROM - base image
  • COPY - add files
  • RUN - execute commands
  • CMD - default command
Algorithms

Binary Search

Algorithms

Binary search finds an element in a sorted array in O(log n):

def binary_search(arr, target):
    left, right = 0, len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return -1

Tip: Use left + (right - left) // 2 to avoid integer overflow.

Data

Data Normalization

Data

Common normalization techniques:

Min-Max Scaling (range [0,1]): x=xxminxmaxxminx' = \frac{x - x_{min}}{x_{max} - x_{min}}

Z-Score Standardization (mean=0, std=1): x=xμσx' = \frac{x - \mu}{\sigma}

When to use:

  • Min-Max: bounded features, neural networks
  • Z-Score: Gaussian-like data, SVMs, linear regression
  • Robust scaling: data with outliers