Vitor Sousa | Bits <meta name="astro-view-transitions-enabled" content="true"><meta name="astro-view-transitions-fallback" content="animate"> <script> (() => { const storageKey = 'vitor-theme'; const getPreferred = () => { try { const saved = window.localStorage.getItem(storageKey); if (saved === 'light' || saved === 'dark') return saved; } catch (error) { console.warn('Unable to access theme preference storage.', error); } return window.matchMedia('(prefers-color-scheme: dark)').matches ? 'dark' : 'light'; }; /** * @param {'light' | 'dark'} theme */ const applyTheme = (theme) => { const root = document.documentElement; root.dataset.theme = theme; root.style.colorScheme = theme; }; applyTheme(getPreferred()); })(); </script>

Attention Mechanism

Attention computes a weighted sum of values based on query-key similarity:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The scaling factor $\sqrt{d_k}$ prevents the dot products from growing too large, which would push softmax into regions with extremely small gradients.

Softmax Function

Softmax converts a vector of real numbers into a probability distribution:

$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}$

Properties:

Output sums to 1
All values are positive
Preserves relative ordering
Temperature parameter $\tau$ controls sharpness: $\frac{e^{x_i/\tau}}{\sum_j e^{x_j/\tau}}$

Python @dataclass

The @dataclass decorator auto-generates __init__, __repr__, and __eq__:

from dataclasses import dataclass

@dataclass
class Point:
    x: float
    y: float
    label: str = "origin"

Useful options:

frozen=True - immutable instances
order=True - enables comparison operators
slots=True - use __slots__ for memory efficiency

Transformer Architecture

The Transformer consists of stacked encoder/decoder blocks:

Encoder block:

Multi-head self-attention
Add & Norm (residual connection)
Feed-forward network
Add & Norm

Key innovations:

Positional encoding (no recurrence)
Multi-head attention (parallel attention)
Layer normalization
Residual connections throughout

Gradient Descent

Gradient descent updates parameters to minimize a loss function:

$\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)$

Variants:

Batch GD: Uses all data (stable but slow)
SGD: Uses one sample (noisy but fast)
Mini-batch: Uses subset (balanced)

Learning rate $\eta$ controls step size - too high causes divergence, too low causes slow convergence.

Docker Basics

Essential Docker commands:

# Build image
docker build -t myapp .

# Run container
docker run -d -p 8080:80 myapp

# List running containers
docker ps

# Stop container
docker stop <container_id>

Dockerfile basics:

FROM - base image
COPY - add files
RUN - execute commands
CMD - default command

Binary Search

Binary search finds an element in a sorted array in O(log n):

def binary_search(arr, target):
    left, right = 0, len(arr) - 1
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    return -1

Tip: Use left + (right - left) // 2 to avoid integer overflow.

Data Normalization

Common normalization techniques:

Min-Max Scaling (range [0,1]): $x' = \frac{x - x_{min}}{x_{max} - x_{min}}$

Z-Score Standardization (mean=0, std=1): $x' = \frac{x - \mu}{\sigma}$

When to use:

Min-Max: bounded features, neural networks
Z-Score: Gaussian-like data, SVMs, linear regression
Robust scaling: data with outliers

Bits & Flashcards

Knowledge Cards