Nov 15, 2025 ~18 min By Vitor Sousa

Contextual Bandit Theory: Regret Bounds and Exploration

Part 2 of 5: Theory That Informs Practice

TL;DR: Contextual bandits minimize regret—cumulative opportunity cost versus the optimal policy. Optimal algorithms achieve O(√(dT log T)) regret for d-dimensional contexts. The exploration-exploitation tradeoff is fundamental: confidence bounds (UCB) and posterior sampling (Thompson) provide principled solutions. This math directly guides algorithm selection and hyperparameter tuning.

Reading time: ~18 minutes

Introduction: Why Math Matters for Practice

You don’t need to derive regret bounds to deploy bandits in production. But understanding the mathematics directly informs practical decisions:

Algorithm selection: Why LinUCB for linear rewards? Why Thompson Sampling often wins empirically?
Hyperparameter tuning: What does α control in LinUCB? How to set it?
Debugging: Why is my bandit not converging? (Check if regret is sublinear)
Feature engineering: Why does dimensionality hurt? (√d penalty in regret bounds)

This post builds intuition for the theory that makes contextual bandits work.

The Contextual Bandit Problem (Formal)

Setup

At each round t = 1, 2, …, T:

Environment reveals context xₜ ∈ X (context space)
Agent chooses action aₜ ∈ A from available action set
Environment generates reward rₜ ~ P(r | xₜ, aₜ) from unknown distribution
Agent observes reward rₜ for chosen action (but not rewards for unchosen actions)

Goal

Learn policy π: X → A that maximizes expected cumulative reward:

$V(\pi) = \mathbb{E}\left[\sum_{t=1}^{T} r_t \mid \pi \right]$

Key Insight: Partial Feedback

We only observe rewards for actions we take. This is the partial feedback problem—we never know what would have happened if we’d chosen differently (the counterfactual outcome).

Example:

Context: User is 25-year-old in San Francisco
Action chosen: Recommend Product A
Reward observed: Click (reward = 1)

Unknown counterfactuals:
- What if we'd shown Product B? Maybe reward = 0 (worse)
- What if we'd shown Product C? Maybe reward = 1 (equally good)

We'll never know unless we try those actions in similar contexts.

This partial observability is why exploration is necessary—we must try actions to learn their value.

Regret: The Fundamental Performance Metric

Since we don’t know the true reward distributions, we can’t compute V(π) directly. Instead, we measure regret—how much worse we do compared to the optimal policy.

Definition

Let π* be the optimal policy that always chooses the best action for each context. The regret after T rounds is:

$R(T) = \sum_{t=1}^{T} r_t^* - \sum_{t=1}^{T} r_t = T \cdot V(\pi^*) - V(\pi)$

where rₜ* is the reward we would have gotten from the optimal action at time t.

Interpretation

Regret measures cumulative opportunity cost. If R(T) = 100 after 1000 rounds, we earned 100 less total reward than optimal.

Important: We want algorithms that minimize regret growth.

Sublinear Regret

An algorithm is good if regret grows sublinearly in T:

$R(T) = o(T)$

Equivalently, average regret goes to zero:

$\frac{R(T)}{T} \to 0 \text{ as } T \to \infty$

This means we’re converging to optimal performance. Linear regret R(T) = Θ(T) would mean we’re making a constant mistake forever—we never learn.

Typical Regret Bounds

Algorithm	Regret Bound	Interpretation
ε-greedy	O(T^(2/3))	Suboptimal. Explores forever at fixed rate.
UCB	O(√(T log T))	Near-optimal for MAB. Confidence-based exploration.
Thompson Sampling	O(√T)	Optimal for many problems. Posterior sampling.
LinUCB	O(d√(T log T))	Linear contextual bandit. Pays √d factor for d dimensions.

The √T bound is essentially optimal—you can’t do better in the worst case. This comes from a fundamental information-theoretic tradeoff.

Visual Comparison of Regret Growth

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'fontSize':'14px', 'fontFamily':'monospace' } }}%% graph TB subgraph Regret["Cumulative Regret Over Time"] direction LR Bad["Linear O(T) ━━━━━━━━ Never learns Constant mistakes ❌ Unacceptable"] Subopt["Sublinear O(T^2/3) ━━━━━━━━ ε-greedy Learning but slow ⚠️ Suboptimal"] Good["O(√T log T) ━━━━━━━━ UCB, LinUCB Near-optimal ✓ Production-ready"] Optimal["O(√T) ━━━━━━━━ Thompson Sampling Optimal bound ✓✓ Best theoretical"] end Bad -.worse.-> Subopt Subopt -.worse.-> Good Good -.worse.-> Optimal style Bad fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style Subopt fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2px style Good fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style Optimal fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2.5px style Regret fill:none,stroke:#64748b,stroke-width:2px linkStyle 0,1,2 stroke:#64748b,stroke-width:1px,stroke-dasharray:3

Practical interpretation:

After 10,000 rounds:

Linear O(T): ~10,000 cumulative regret (never learned)
ε-greedy O(T^(2/3)): ~464 cumulative regret
UCB O(√T log T): ~130 cumulative regret
Thompson O(√T): ~100 cumulative regret

The difference compounds over time. In production with millions of decisions, UCB/Thompson save significant value vs ε-greedy.

Reward Models: Assumptions About Structure

Real-world reward functions are complex. We make modeling assumptions to make learning tractable.

Linear Reward Model

Assumption: Reward is a linear function of context features:

$r(x, a) = \theta_a^T x + \epsilon$

where:

θₐ ∈ ℝᵈ is a weight vector for action a
x ∈ ℝᵈ is the context vector
ε ~ N(0, σ²) is noise

When this is reasonable:

✅ Features are well-engineered (embeddings, hand-crafted)
✅ Relationships are approximately linear in feature space
✅ Interpretability is important (coefficients show feature importance)

Limitations:

❌ Misses complex nonlinear interactions
❌ Requires good feature engineering upfront
❌ Can underfit when true function is highly nonlinear

Example: Content Recommendation

Context x: [user_age, user_tenure, content_freshness, 
            content_popularity, topic_match]
Action a: show_content (binary)

Linear model: 
r(x, show) = θ₁·age + θ₂·tenure + θ₃·freshness + 
             θ₄·popularity + θ₅·topic_match

Interpretation: If θ₃ = 0.8, each unit increase in freshness 
                adds 0.8 to expected reward

Generalized Linear Models (GLMs)

Assumption: Reward follows an exponential family distribution with linear predictor:

$\mathbb{E}[r \mid x, a] = g^{-1}(\theta_a^T x)$

where g is a link function.

Common cases:

Logistic regression (binary rewards):

$\mathbb{P}(r = 1 \mid x, a) = \sigma(\theta_a^T x) = \frac{1}{1 + e^{-\theta_a^T x}}$

Use when reward is click/no-click, conversion/no-conversion.

Poisson regression (count rewards):

$\mathbb{E}[r \mid x, a] = e^{\theta_a^T x}$

Use when reward is number of purchases, page views, etc.

Advantage: Captures bounded or discrete reward structures while maintaining interpretability.

Nonlinear Models (Neural Bandits)

Assumption: Reward is a complex nonlinear function:

$r(x, a) = f_{\theta_a}(x) + \epsilon$

where f is a neural network with parameters θₐ.

When this is necessary:

✅ Complex feature interactions that linear models can’t capture
✅ High-dimensional raw inputs (images, text)
✅ Sufficient data to train deep models

Challenges:

❌ Requires much more data
❌ Less interpretable (black box)
❌ Harder to quantify uncertainty (needed for exploration)
❌ Computationally expensive

When to use each:

Context Dim	Reward Structure	Recommended Model
d < 50	Linear or mildly nonlinear	Linear
d = 50-500	Linear with interactions	Linear + feature engineering
d > 500	Linear	Linear with regularization (large λ)
Any d	Highly nonlinear	Neural network
Images, text	Complex patterns	Neural (CNN, Transformer)

The Exploration-Exploitation Tradeoff

This is the fundamental dilemma in bandit problems.

The Tension

Exploitation: Choose the action that has performed best so far. Maximize immediate reward based on current knowledge.

Exploration: Try actions with uncertain rewards. Gather information that might lead to better decisions later.

The conflict: Exploration sacrifices short-term reward for long-term learning. Exploitation maximizes current reward but risks missing better alternatives.

Toy Example

Two actions: A and B
Current estimates: 
  μ̂_A = 0.8 (based on 100 trials)
  μ̂_B = 0.6 (based on 10 trials)

Pure exploitation: Always choose A (highest estimate)
Problem: B's estimate is uncertain. True μ_B might be 0.9, 
         but we have sparse data.

Pure exploration: Choose randomly to gather data
Problem: While gathering data, we're not maximizing reward

Optimal balance: Choose A most of the time, occasionally try B 
                 until confident about its value

Mathematical Formalization

Define the exploitation gap as the difference between estimated best and true best:

$\Delta_t = \max_a \mathbb{E}[r \mid x_t, a] - \mathbb{E}[r \mid x_t, a_t]$

And the exploration benefit as information gain:

$I_t = \text{Reduction in uncertainty about } \theta$

Optimal policies balance these:

$a_t = \arg\max_a \left\{ \mathbb{E}[r \mid x_t, a] + \lambda \cdot I(a) \right\}$

where λ controls exploration intensity.

Visual Representation

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'fontSize':'14px', 'fontFamily':'monospace', 'nodeBorder':'2px' } }}%% graph LR Exploit[Exploitation ━━━━━━━━ Use current knowledge Maximize immediate reward Risk: Missing better options] Explore[Exploration ━━━━━━━━ Gather information Reduce uncertainty Cost: Lower immediate reward] Balance[Optimal Balance ━━━━━━━━ Exploit high-confidence actions Explore uncertain alternatives Minimize regret over time] Exploit -.tradeoff.-> Balance Explore -.tradeoff.-> Balance style Exploit fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Explore fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2.5px style Balance fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2.5px linkStyle 0,1 stroke:#64748b,stroke-width:2px,stroke-dasharray:5

Why This Matters in Practice

If you over-exploit (ε = 0.01):

Converge quickly to locally optimal solution
Miss potentially better alternatives
Regret grows linearly if you lock onto suboptimal action

If you over-explore (ε = 0.5):

Waste traffic on clearly inferior actions
Slow convergence to optimal policy
Users see more bad experiences

Optimal strategy: Exploration rate should decrease over time. Early on, uncertainty is high—explore more. Later, confidence grows—exploit more.

The Confidence-Bound Principle

Many algorithms use optimism under uncertainty: when unsure about an action’s value, assume it might be great.

Upper Confidence Bound (UCB)

Intuition: Add a confidence bonus to each action’s estimate. Choose the action with the highest upper confidence bound.

$a_t = \arg\max_a \left\{ \hat{\mu}_a + \sqrt{\frac{2 \log t}{n_a}} \right\}$

where:

$\hat{\mu}_a$ = estimated mean reward for action a
nₐ = number of times action a has been tried
The √ term = confidence radius

Why This Works

Well-explored actions (large nₐ): Confidence radius is small. Trust the estimate $\hat{\mu}_a$ .
Under-explored actions (small nₐ): Confidence radius is large. UCB is optimistic.
This creates natural exploration: uncertain actions get inflated values and get tried.

The Math Behind the Confidence Radius

The √(2 log t / nₐ) comes from Hoeffding’s inequality:

$\mathbb{P}\left(|\hat{\mu}_a - \mu_a| > \sqrt{\frac{2 \log t}{n_a}}\right) \leq \frac{1}{t^4}$

With high probability (1 - 1/t⁴), the true mean is within the confidence radius. By using the upper bound, we ensure we don’t prematurely dismiss good actions.

Visual Intuition for UCB

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'fontSize':'14px', 'fontFamily':'monospace' } }}%% graph LR subgraph Action_A[" "] A_Est[Estimate: 0.5 Trials: 100 Confidence: ±0.1] A_UCB[UCB: 0.6 ✓ Well-explored Small bonus] end subgraph Action_B[" "] B_Est[Estimate: 0.4 Trials: 10 Confidence: ±0.3] B_UCB[UCB: 0.7 ⚡ Uncertain Large bonus] end A_Est --> A_UCB B_Est --> B_UCB B_UCB -.selected.-> Winner[Action B chosen despite lower estimate] style Action_A fill:none,stroke:none style Action_B fill:none,stroke:none style A_Est fill:#334155,stroke:#10b981,color:#d1fae5,stroke-width:2px style A_UCB fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style B_Est fill:#334155,stroke:#f59e0b,color:#fde68a,stroke-width:2px style B_UCB fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2.5px style Winner fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px linkStyle 2 stroke:#06b6d4,stroke-width:2px,stroke-dasharray:5

Key insight: Action B has lower estimate (0.4 vs 0.5) but higher UCB (0.7 vs 0.6) due to uncertainty. UCB chooses B to reduce uncertainty—exactly what we want.

LinUCB: Extending to Contextual Bandits

For linear contextual bandits, UCB extends to:

$a_t = \arg\max_a \left\{ \theta_a^T x_t + \alpha \sqrt{x_t^T A_a^{-1} x_t} \right\}$

where:

Aₐ = XₐᵀXₐ + λI (regularized Gram matrix)
α = exploration parameter (typically 0.1-2.0)
λ = regularization parameter (typically 1.0)

The confidence ellipsoid √(xᵀA⁻¹x) captures uncertainty in the parameter estimate θₐ.

Novel contexts (far from previously seen data) have large confidence radii, encouraging exploration in new regions of the feature space.

Thompson Sampling: Bayesian Perspective

Instead of confidence bounds, Thompson Sampling uses probability matching.

Algorithm

Maintain posterior distribution P(θₐ | data) for each action’s parameters
Sample θ̃ₐ ~ P(θₐ | data) for each action
Choose action with highest sampled reward: aₜ = argmaxₐ θ̃ₐᵀxₜ
Observe reward, update posteriors

Probability Matching

The probability of choosing action a equals the probability that a is optimal given current knowledge:

$\mathbb{P}(\text{choose } a) = \mathbb{P}(a \text{ is optimal} \mid \text{data})$

This naturally balances exploration and exploitation through Bayesian updating:

Early: Posteriors are diffuse → high probability of sampling various actions → exploration
Late: Posteriors concentrate → high probability of sampling best action → exploitation

No manual tuning of exploration rate needed. The uncertainty drives exploration organically.

Why It Works

Thompson Sampling achieves optimal O(√T) regret bounds for many problems while being simpler to implement than UCB variants. The Bayesian framework naturally handles uncertainty.

Example: Bernoulli Bandits

For binary rewards (click/no-click), use Beta-Bernoulli conjugacy:

Prior: θₐ ~ Beta(1, 1) (uniform)

Likelihood: r | θₐ ~ Bernoulli(θₐ)

Posterior after observing data:

$\theta_a \mid \text{data} \sim \text{Beta}(\alpha_a, \beta_a)$

where:

αₐ = 1 + number of successes (clicks)
βₐ = 1 + number of failures (no clicks)

To choose action: Sample θ̃ₐ ~ Beta(αₐ, βₐ) for each action, pick argmax θ̃ₐ.

For Linear Contextual Bandits

Assume Gaussian prior and noise:

Prior: θₐ ~ N(0, σ²I)
Noise: ε ~ N(0, σ²)

Posterior:

$\theta_a \mid D_a \sim \mathcal{N}(\hat{\theta}_a, \Sigma_a)$

where:

$\hat{\theta}_a = (X^T X + \lambda I)^{-1} X^T r$ (ridge regression estimate)
$\Sigma_a = \sigma^2(X^T X + \lambda I)^{-1}$ (posterior covariance)

To choose action: Sample $\tilde{\theta}_a \sim \mathcal{N}(\hat{\theta}_a, \Sigma_a)$ , compute $\tilde{r}_a = \tilde{\theta}_a^T x_t$ , pick $a_t = \arg\max_a \tilde{r}_a$ .

The Fundamental Limits: Lower Bounds

Can we do better than O(√T) regret? No (in the worst case).

Theorem (Lai & Robbins, 1985)

For any algorithm, there exist problem instances where regret is at least:

$R(T) = \Omega(\log T)$

for stochastic bandits (MAB).

For contextual bandits with d-dimensional context and linear rewards:

$R(T) = \Omega(\sqrt{dT})$

Implication

Algorithms achieving O(√(dT log T)) are near-optimal. The log T factor is unavoidable slack, but the √(dT) matches the lower bound up to logarithmic factors.

Intuition for the √T Bound

You need to explore each action enough times to distinguish it from the optimal one. If the gap between rewards is Δ, you need roughly O(1/Δ²) samples per action. With K actions, total exploration cost is O(K/Δ²).

If you don’t know Δ in advance and must be robust to small gaps, worst-case regret is O(√(KT)).

This isn’t just theoretical—it explains why bandit algorithms can’t magically achieve zero regret. There’s a fundamental cost to learning under uncertainty.

Key Takeaways

Essential concepts:

Regret R(T) = cumulative opportunity cost vs optimal policy

Goal: Achieve sublinear regret R(T) = o(T)
Means average regret → 0 as T → ∞ (we’re learning)

Optimal algorithms achieve O(√(dT log T)) for d-dimensional contexts

The √T term is unavoidable (information-theoretic limit)
The √d factor is the cost of personalization
Linear regret O(T) means algorithm never learns (failure)

Exploration-exploitation is fundamental—no free lunch

Explore: Try uncertain actions to gather information
Exploit: Choose known best action to maximize reward
Tradeoff: Short-term reward vs long-term learning

Two principled approaches to balance:

UCB: Optimism under uncertainty (confidence-based)
Thompson: Posterior sampling (Bayesian)
Both achieve near-optimal regret bounds

Reward model matters:

Linear: Fast, interpretable, works for d < 500
GLM: Handles bounded/discrete rewards
Neural: Required for high-dim or complex nonlinear

Practical implications:

If You See…	It Means…	Action
Regret growing linearly	Algorithm not learning	Increase exploration (ε or α)
High variance in rewards	Unstable estimates	Increase regularization (λ)
Slow convergence	Under-exploration	Check confidence bounds
Many dimensions (d > 100)	√d penalty hurts	Feature selection or neural

When to worry about math:

✅ Do worry: Algorithm selection, tuning α/λ, understanding regret curves
❌ Don’t over-optimize: Chasing last 1% theoretical gain—empirical performance matters more

Contextual Bandit Theory: Regret Bounds and Exploration

Introduction: Why Math Matters for Practice

The Contextual Bandit Problem (Formal)

Setup

Goal

Key Insight: Partial Feedback

Regret: The Fundamental Performance Metric

Definition

Interpretation

Sublinear Regret

Typical Regret Bounds

Visual Comparison of Regret Growth

Reward Models: Assumptions About Structure

Linear Reward Model

Generalized Linear Models (GLMs)

Nonlinear Models (Neural Bandits)

The Exploration-Exploitation Tradeoff

The Tension

Toy Example

Mathematical Formalization

Visual Representation

Why This Matters in Practice

The Confidence-Bound Principle

Upper Confidence Bound (UCB)

Why This Works

The Math Behind the Confidence Radius

Visual Intuition for UCB

LinUCB: Extending to Contextual Bandits

Thompson Sampling: Bayesian Perspective

Algorithm

Probability Matching

Why It Works

Example: Bernoulli Bandits

For Linear Contextual Bandits

The Fundamental Limits: Lower Bounds

Theorem (Lai & Robbins, 1985)

Implication

Intuition for the √T Bound

Key Takeaways

Further Reading

Keep Reading

Contextual Bandit Theory: Regret Bounds and Exploration

When to Use Contextual Bandits: The Decision Framework

Beyond the Vibe Check: A Systematic Approach to LLM Evaluation

Differential Transformer Notes

OpenELM Notes