Back to blog
~18 min By Vitor Sousa

Contextual Bandit Theory: Regret Bounds and Exploration

Part 2 of 5: Theory That Informs Practice

TL;DR: Contextual bandits minimize regret—cumulative opportunity cost versus the optimal policy. Optimal algorithms achieve O(√(dT log T)) regret for d-dimensional contexts. The exploration-exploitation tradeoff is fundamental: confidence bounds (UCB) and posterior sampling (Thompson) provide principled solutions. This math directly guides algorithm selection and hyperparameter tuning.

Reading time: ~18 minutes


Introduction: Why Math Matters for Practice

You don’t need to derive regret bounds to deploy bandits in production. But understanding the mathematics directly informs practical decisions:

  • Algorithm selection: Why LinUCB for linear rewards? Why Thompson Sampling often wins empirically?
  • Hyperparameter tuning: What does α control in LinUCB? How to set it?
  • Debugging: Why is my bandit not converging? (Check if regret is sublinear)
  • Feature engineering: Why does dimensionality hurt? (√d penalty in regret bounds)

This post builds intuition for the theory that makes contextual bandits work.


The Contextual Bandit Problem (Formal)

Setup

At each round t = 1, 2, …, T:

  1. Environment reveals context xₜ ∈ X (context space)
  2. Agent chooses action aₜ ∈ A from available action set
  3. Environment generates reward rₜ ~ P(r | xₜ, aₜ) from unknown distribution
  4. Agent observes reward rₜ for chosen action (but not rewards for unchosen actions)

Goal

Learn policy π: XA that maximizes expected cumulative reward:

V(π)=E[t=1Trtπ]V(\pi) = \mathbb{E}\left[\sum_{t=1}^{T} r_t \mid \pi \right]

Key Insight: Partial Feedback

We only observe rewards for actions we take. This is the partial feedback problem—we never know what would have happened if we’d chosen differently (the counterfactual outcome).

Example:

Context: User is 25-year-old in San Francisco
Action chosen: Recommend Product A
Reward observed: Click (reward = 1)

Unknown counterfactuals:
- What if we'd shown Product B? Maybe reward = 0 (worse)
- What if we'd shown Product C? Maybe reward = 1 (equally good)

We'll never know unless we try those actions in similar contexts.

This partial observability is why exploration is necessary—we must try actions to learn their value.


Regret: The Fundamental Performance Metric

Since we don’t know the true reward distributions, we can’t compute V(π) directly. Instead, we measure regret—how much worse we do compared to the optimal policy.

Definition

Let π* be the optimal policy that always chooses the best action for each context. The regret after T rounds is:

R(T)=t=1Trtt=1Trt=TV(π)V(π)R(T) = \sum_{t=1}^{T} r_t^* - \sum_{t=1}^{T} r_t = T \cdot V(\pi^*) - V(\pi)

where rₜ* is the reward we would have gotten from the optimal action at time t.

Interpretation

Regret measures cumulative opportunity cost. If R(T) = 100 after 1000 rounds, we earned 100 less total reward than optimal.

Important: We want algorithms that minimize regret growth.

Sublinear Regret

An algorithm is good if regret grows sublinearly in T:

R(T)=o(T)R(T) = o(T)

Equivalently, average regret goes to zero:

R(T)T0 as T\frac{R(T)}{T} \to 0 \text{ as } T \to \infty

This means we’re converging to optimal performance. Linear regret R(T) = Θ(T) would mean we’re making a constant mistake forever—we never learn.

Typical Regret Bounds

AlgorithmRegret BoundInterpretation
ε-greedyO(T^(2/3))Suboptimal. Explores forever at fixed rate.
UCBO(√(T log T))Near-optimal for MAB. Confidence-based exploration.
Thompson SamplingO(√T)Optimal for many problems. Posterior sampling.
LinUCBO(d√(T log T))Linear contextual bandit. Pays √d factor for d dimensions.

The √T bound is essentially optimal—you can’t do better in the worst case. This comes from a fundamental information-theoretic tradeoff.

Visual Comparison of Regret Growth

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'fontSize':'14px', 'fontFamily':'monospace' } }}%% graph TB subgraph Regret["Cumulative Regret Over Time"] direction LR Bad["Linear O&#40;T&#41;<br/>━━━━━━━━<br/>Never learns<br/>Constant mistakes<br/>❌ Unacceptable"] Subopt["Sublinear O&#40;T^2/3&#41;<br/>━━━━━━━━<br/>ε-greedy<br/>Learning but slow<br/>⚠️ Suboptimal"] Good["O&#40;√T log T&#41;<br/>━━━━━━━━<br/>UCB, LinUCB<br/>Near-optimal<br/>✓ Production-ready"] Optimal["O&#40;√T&#41;<br/>━━━━━━━━<br/>Thompson Sampling<br/>Optimal bound<br/>✓✓ Best theoretical"] end Bad -.worse.-> Subopt Subopt -.worse.-> Good Good -.worse.-> Optimal style Bad fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style Subopt fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2px style Good fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style Optimal fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2.5px style Regret fill:none,stroke:#64748b,stroke-width:2px linkStyle 0,1,2 stroke:#64748b,stroke-width:1px,stroke-dasharray:3

Practical interpretation:

After 10,000 rounds:

  • Linear O(T): ~10,000 cumulative regret (never learned)
  • ε-greedy O(T^(2/3)): ~464 cumulative regret
  • UCB O(√T log T): ~130 cumulative regret
  • Thompson O(√T): ~100 cumulative regret

The difference compounds over time. In production with millions of decisions, UCB/Thompson save significant value vs ε-greedy.


Reward Models: Assumptions About Structure

Real-world reward functions are complex. We make modeling assumptions to make learning tractable.

Linear Reward Model

Assumption: Reward is a linear function of context features:

r(x,a)=θaTx+ϵr(x, a) = \theta_a^T x + \epsilon

where:

  • θₐ ∈ ℝᵈ is a weight vector for action a
  • x ∈ ℝᵈ is the context vector
  • ε ~ N(0, σ²) is noise

When this is reasonable:

✅ Features are well-engineered (embeddings, hand-crafted)
✅ Relationships are approximately linear in feature space
✅ Interpretability is important (coefficients show feature importance)

Limitations:

❌ Misses complex nonlinear interactions
❌ Requires good feature engineering upfront
❌ Can underfit when true function is highly nonlinear

Example: Content Recommendation

Context x: [user_age, user_tenure, content_freshness, 
            content_popularity, topic_match]
Action a: show_content (binary)

Linear model: 
r(x, show) = θ₁·age + θ₂·tenure + θ₃·freshness + 
             θ₄·popularity + θ₅·topic_match

Interpretation: If θ₃ = 0.8, each unit increase in freshness 
                adds 0.8 to expected reward

Generalized Linear Models (GLMs)

Assumption: Reward follows an exponential family distribution with linear predictor:

E[rx,a]=g1(θaTx)\mathbb{E}[r \mid x, a] = g^{-1}(\theta_a^T x)

where g is a link function.

Common cases:

Logistic regression (binary rewards):

P(r=1x,a)=σ(θaTx)=11+eθaTx\mathbb{P}(r = 1 \mid x, a) = \sigma(\theta_a^T x) = \frac{1}{1 + e^{-\theta_a^T x}}

Use when reward is click/no-click, conversion/no-conversion.

Poisson regression (count rewards):

E[rx,a]=eθaTx\mathbb{E}[r \mid x, a] = e^{\theta_a^T x}

Use when reward is number of purchases, page views, etc.

Advantage: Captures bounded or discrete reward structures while maintaining interpretability.

Nonlinear Models (Neural Bandits)

Assumption: Reward is a complex nonlinear function:

r(x,a)=fθa(x)+ϵr(x, a) = f_{\theta_a}(x) + \epsilon

where f is a neural network with parameters θₐ.

When this is necessary:

✅ Complex feature interactions that linear models can’t capture
✅ High-dimensional raw inputs (images, text)
✅ Sufficient data to train deep models

Challenges:

❌ Requires much more data
❌ Less interpretable (black box)
❌ Harder to quantify uncertainty (needed for exploration)
❌ Computationally expensive

When to use each:

Context DimReward StructureRecommended Model
d < 50Linear or mildly nonlinearLinear
d = 50-500Linear with interactionsLinear + feature engineering
d > 500LinearLinear with regularization (large λ)
Any dHighly nonlinearNeural network
Images, textComplex patternsNeural (CNN, Transformer)

The Exploration-Exploitation Tradeoff

This is the fundamental dilemma in bandit problems.

The Tension

Exploitation: Choose the action that has performed best so far. Maximize immediate reward based on current knowledge.

Exploration: Try actions with uncertain rewards. Gather information that might lead to better decisions later.

The conflict: Exploration sacrifices short-term reward for long-term learning. Exploitation maximizes current reward but risks missing better alternatives.

Toy Example

Two actions: A and B
Current estimates: 
  μ̂_A = 0.8 (based on 100 trials)
  μ̂_B = 0.6 (based on 10 trials)

Pure exploitation: Always choose A (highest estimate)
Problem: B's estimate is uncertain. True μ_B might be 0.9, 
         but we have sparse data.

Pure exploration: Choose randomly to gather data
Problem: While gathering data, we're not maximizing reward

Optimal balance: Choose A most of the time, occasionally try B 
                 until confident about its value

Mathematical Formalization

Define the exploitation gap as the difference between estimated best and true best:

Δt=maxaE[rxt,a]E[rxt,at]\Delta_t = \max_a \mathbb{E}[r \mid x_t, a] - \mathbb{E}[r \mid x_t, a_t]

And the exploration benefit as information gain:

It=Reduction in uncertainty about θI_t = \text{Reduction in uncertainty about } \theta

Optimal policies balance these:

at=argmaxa{E[rxt,a]+λI(a)}a_t = \arg\max_a \left\{ \mathbb{E}[r \mid x_t, a] + \lambda \cdot I(a) \right\}

where λ controls exploration intensity.

Visual Representation

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'fontSize':'14px', 'fontFamily':'monospace', 'nodeBorder':'2px' } }}%% graph LR Exploit[Exploitation<br/>━━━━━━━━<br/>Use current knowledge<br/>Maximize immediate reward<br/>Risk: Missing better options] Explore[Exploration<br/>━━━━━━━━<br/>Gather information<br/>Reduce uncertainty<br/>Cost: Lower immediate reward] Balance[Optimal Balance<br/>━━━━━━━━<br/>Exploit high-confidence actions<br/>Explore uncertain alternatives<br/>Minimize regret over time] Exploit -.tradeoff.-> Balance Explore -.tradeoff.-> Balance style Exploit fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Explore fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2.5px style Balance fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2.5px linkStyle 0,1 stroke:#64748b,stroke-width:2px,stroke-dasharray:5

Why This Matters in Practice

If you over-exploit (ε = 0.01):

  • Converge quickly to locally optimal solution
  • Miss potentially better alternatives
  • Regret grows linearly if you lock onto suboptimal action

If you over-explore (ε = 0.5):

  • Waste traffic on clearly inferior actions
  • Slow convergence to optimal policy
  • Users see more bad experiences

Optimal strategy: Exploration rate should decrease over time. Early on, uncertainty is high—explore more. Later, confidence grows—exploit more.


The Confidence-Bound Principle

Many algorithms use optimism under uncertainty: when unsure about an action’s value, assume it might be great.

Upper Confidence Bound (UCB)

Intuition: Add a confidence bonus to each action’s estimate. Choose the action with the highest upper confidence bound.

at=argmaxa{μ^a+2logtna}a_t = \arg\max_a \left\{ \hat{\mu}_a + \sqrt{\frac{2 \log t}{n_a}} \right\}

where:

  • μ^a\hat{\mu}_a = estimated mean reward for action a
  • nₐ = number of times action a has been tried
  • The √ term = confidence radius

Why This Works

  • Well-explored actions (large nₐ): Confidence radius is small. Trust the estimate μ^a\hat{\mu}_a.
  • Under-explored actions (small nₐ): Confidence radius is large. UCB is optimistic.
  • This creates natural exploration: uncertain actions get inflated values and get tried.

The Math Behind the Confidence Radius

The √(2 log t / nₐ) comes from Hoeffding’s inequality:

P(μ^aμa>2logtna)1t4\mathbb{P}\left(|\hat{\mu}_a - \mu_a| > \sqrt{\frac{2 \log t}{n_a}}\right) \leq \frac{1}{t^4}

With high probability (1 - 1/t⁴), the true mean is within the confidence radius. By using the upper bound, we ensure we don’t prematurely dismiss good actions.

Visual Intuition for UCB

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'fontSize':'14px', 'fontFamily':'monospace' } }}%% graph LR subgraph Action_A[" "] A_Est[Estimate: 0.5<br/>Trials: 100<br/>Confidence: ±0.1] A_UCB[UCB: 0.6<br/>✓ Well-explored<br/>Small bonus] end subgraph Action_B[" "] B_Est[Estimate: 0.4<br/>Trials: 10<br/>Confidence: ±0.3] B_UCB[UCB: 0.7<br/>⚡ Uncertain<br/>Large bonus] end A_Est --> A_UCB B_Est --> B_UCB B_UCB -.selected.-> Winner[Action B<br/>chosen despite<br/>lower estimate] style Action_A fill:none,stroke:none style Action_B fill:none,stroke:none style A_Est fill:#334155,stroke:#10b981,color:#d1fae5,stroke-width:2px style A_UCB fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style B_Est fill:#334155,stroke:#f59e0b,color:#fde68a,stroke-width:2px style B_UCB fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2.5px style Winner fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px linkStyle 2 stroke:#06b6d4,stroke-width:2px,stroke-dasharray:5

Key insight: Action B has lower estimate (0.4 vs 0.5) but higher UCB (0.7 vs 0.6) due to uncertainty. UCB chooses B to reduce uncertainty—exactly what we want.

LinUCB: Extending to Contextual Bandits

For linear contextual bandits, UCB extends to:

at=argmaxa{θaTxt+αxtTAa1xt}a_t = \arg\max_a \left\{ \theta_a^T x_t + \alpha \sqrt{x_t^T A_a^{-1} x_t} \right\}

where:

  • Aₐ = XₐᵀXₐ + λI (regularized Gram matrix)
  • α = exploration parameter (typically 0.1-2.0)
  • λ = regularization parameter (typically 1.0)

The confidence ellipsoid √(xᵀA⁻¹x) captures uncertainty in the parameter estimate θₐ.

Novel contexts (far from previously seen data) have large confidence radii, encouraging exploration in new regions of the feature space.


Thompson Sampling: Bayesian Perspective

Instead of confidence bounds, Thompson Sampling uses probability matching.

Algorithm

  1. Maintain posterior distribution P(θₐ | data) for each action’s parameters
  2. Sample θ̃ₐ ~ P(θₐ | data) for each action
  3. Choose action with highest sampled reward: aₜ = argmaxₐ θ̃ₐᵀxₜ
  4. Observe reward, update posteriors

Probability Matching

The probability of choosing action a equals the probability that a is optimal given current knowledge:

P(choose a)=P(a is optimaldata)\mathbb{P}(\text{choose } a) = \mathbb{P}(a \text{ is optimal} \mid \text{data})

This naturally balances exploration and exploitation through Bayesian updating:

  • Early: Posteriors are diffuse → high probability of sampling various actions → exploration
  • Late: Posteriors concentrate → high probability of sampling best action → exploitation

No manual tuning of exploration rate needed. The uncertainty drives exploration organically.

Why It Works

Thompson Sampling achieves optimal O(√T) regret bounds for many problems while being simpler to implement than UCB variants. The Bayesian framework naturally handles uncertainty.

Example: Bernoulli Bandits

For binary rewards (click/no-click), use Beta-Bernoulli conjugacy:

Prior: θₐ ~ Beta(1, 1) (uniform)

Likelihood: r | θₐ ~ Bernoulli(θₐ)

Posterior after observing data:

θadataBeta(αa,βa)\theta_a \mid \text{data} \sim \text{Beta}(\alpha_a, \beta_a)

where:

  • αₐ = 1 + number of successes (clicks)
  • βₐ = 1 + number of failures (no clicks)

To choose action: Sample θ̃ₐ ~ Beta(αₐ, βₐ) for each action, pick argmax θ̃ₐ.

For Linear Contextual Bandits

Assume Gaussian prior and noise:

Prior: θₐ ~ N(0, σ²I)
Noise: ε ~ N(0, σ²)

Posterior:

θaDaN(θ^a,Σa)\theta_a \mid D_a \sim \mathcal{N}(\hat{\theta}_a, \Sigma_a)

where:

  • θ^a=(XTX+λI)1XTr\hat{\theta}_a = (X^T X + \lambda I)^{-1} X^T r (ridge regression estimate)
  • Σa=σ2(XTX+λI)1\Sigma_a = \sigma^2(X^T X + \lambda I)^{-1} (posterior covariance)

To choose action: Sample θ~aN(θ^a,Σa)\tilde{\theta}_a \sim \mathcal{N}(\hat{\theta}_a, \Sigma_a), compute r~a=θ~aTxt\tilde{r}_a = \tilde{\theta}_a^T x_t, pick at=argmaxar~aa_t = \arg\max_a \tilde{r}_a.


The Fundamental Limits: Lower Bounds

Can we do better than O(√T) regret? No (in the worst case).

Theorem (Lai & Robbins, 1985)

For any algorithm, there exist problem instances where regret is at least:

R(T)=Ω(logT)R(T) = \Omega(\log T)

for stochastic bandits (MAB).

For contextual bandits with d-dimensional context and linear rewards:

R(T)=Ω(dT)R(T) = \Omega(\sqrt{dT})

Implication

Algorithms achieving O(√(dT log T)) are near-optimal. The log T factor is unavoidable slack, but the √(dT) matches the lower bound up to logarithmic factors.

Intuition for the √T Bound

You need to explore each action enough times to distinguish it from the optimal one. If the gap between rewards is Δ, you need roughly O(1/Δ²) samples per action. With K actions, total exploration cost is O(K/Δ²).

If you don’t know Δ in advance and must be robust to small gaps, worst-case regret is O(√(KT)).

This isn’t just theoretical—it explains why bandit algorithms can’t magically achieve zero regret. There’s a fundamental cost to learning under uncertainty.


Key Takeaways

Essential concepts:

Regret R(T) = cumulative opportunity cost vs optimal policy

  • Goal: Achieve sublinear regret R(T) = o(T)
  • Means average regret → 0 as T → ∞ (we’re learning)

Optimal algorithms achieve O(√(dT log T)) for d-dimensional contexts

  • The √T term is unavoidable (information-theoretic limit)
  • The √d factor is the cost of personalization
  • Linear regret O(T) means algorithm never learns (failure)

Exploration-exploitation is fundamental—no free lunch

  • Explore: Try uncertain actions to gather information
  • Exploit: Choose known best action to maximize reward
  • Tradeoff: Short-term reward vs long-term learning

Two principled approaches to balance:

  • UCB: Optimism under uncertainty (confidence-based)
  • Thompson: Posterior sampling (Bayesian)
  • Both achieve near-optimal regret bounds

Reward model matters:

  • Linear: Fast, interpretable, works for d < 500
  • GLM: Handles bounded/discrete rewards
  • Neural: Required for high-dim or complex nonlinear

Practical implications:

If You See…It Means…Action
Regret growing linearlyAlgorithm not learningIncrease exploration (ε or α)
High variance in rewardsUnstable estimatesIncrease regularization (λ)
Slow convergenceUnder-explorationCheck confidence bounds
Many dimensions (d > 100)√d penalty hurtsFeature selection or neural

When to worry about math:

Do worry: Algorithm selection, tuning α/λ, understanding regret curves
Don’t over-optimize: Chasing last 1% theoretical gain—empirical performance matters more


Further Reading

Foundational papers:

Textbooks:

Blog posts:


Article series

Adaptive Optimization at Scale: Contextual Bandits from Theory to Production

Part 2 of 5

  1. Part 1 When to Use Contextual Bandits: The Decision Framework
  2. Part 2 Contextual Bandit Theory: Regret Bounds and Exploration
  3. Part 3 Implementing Contextual Bandits: Complete Algorithm Guide
  4. Part 4 Neural Contextual Bandits for High-Dimensional Data
  5. Part 5 Deploying Contextual Bandits: Production Guide and Offline Evaluation

Keep Reading

Mermaid diagram showing three pillars of LLM evaluation: What to Evaluate (Faithfulness vs Helpfulness), How to Evaluate (Methods and Metrics), and Making it Systematic (Process and Monitoring), connected in a circular feedback loop

Beyond the Vibe Check: A Systematic Approach to LLM Evaluation

Stop relying on gut feelings to evaluate LLM outputs. Learn systematic approaches to build trustworthy evaluation pipelines with measurable metrics, proven methods, and production-ready practices. A practical guide covering faithfulness vs helpfulness, LLM-as-judge techniques, bias mitigation, and continuous monitoring.

~60 min

Read article
View all articles