Jan 11, 2026 ~35 min By Vitor Sousa

Reinforcement Learning Foundations for LLM Alignment

Part 1 of 4: The Mathematical Foundations

TL;DR: Before understanding PPO, GRPO, or any modern LLM alignment algorithm, you need solid RL foundations. This article covers MDPs, policy gradients, the variance problem, value functions, Bellman equations, and advantage estimation—all through the lens of language model fine-tuning. By the end, you’ll understand why these algorithms are designed the way they are.

Reading time: ~35 minutes

Why RL for Language Models?

You’ve trained a language model on terabytes of text. It can complete sentences, answer questions, even write code. But it also happily generates harmful content, confidently states falsehoods, and ignores user preferences. Supervised fine-tuning on curated examples helps, but you can’t enumerate every possible good response.

What you need is a way to optimize for “goodness” directly—to tell the model “responses like this are better than responses like that” and have it learn the pattern. This is exactly what reinforcement learning provides.

The insight behind RLHF (Reinforcement Learning from Human Feedback) is powerful: instead of showing the model correct outputs, we show it a reward signal indicating how good its outputs are. The model then learns to maximize this reward.

But RL for LLMs isn’t straightforward. The action space is enormous (vocabulary size per token), episodes are variable length, rewards are sparse (often only at generation end), and we need stable training for billion-parameter models.

Understanding the foundations deeply will help you see why algorithms like PPO and GRPO make specific design choices—and when those choices might not be right for your problem.

The MDP Framework
Policies and the Objective
Policy Gradients: Learning Without a Model
The Variance Problem
Value Functions and Bellman Equations
Advantage Functions: The Key Insight
Actor-Critic Methods
From General RL to LLM Fine-Tuning
Key Takeaways

The MDP Framework

Reinforcement learning formalizes sequential decision-making through Markov Decision Processes (MDPs). Understanding this framework precisely will help you see how LLM generation maps onto RL concepts.

Definition

An MDP is defined by the tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ :

Component	Symbol	Description
State space	$\mathcal{S}$	All possible situations the agent can be in
Action space	$\mathcal{A}$	All possible actions the agent can take
Transition dynamics	$P(s' \\| s, a)$	Probability of reaching state $s'$ from state $s$ after action $a$
Reward function	$R(s, a, s')$	Immediate reward for transition $(s, a, s')$
Discount factor	$\gamma \in [0, 1]$	How much to weight future vs. immediate rewards

The Markov Property

The “Markov” in MDP refers to the Markov property: the future depends only on the current state, not on how we got there.

$P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t)$

This is crucial for tractability—we don’t need to remember the entire history, just the current state.

The Agent-Environment Loop

At each timestep $t$ :

Agent observes state $s_t$
Agent selects action $a_t$ according to its policy
Environment transitions to $s_{t+1} \sim P(\cdot | s_t, a_t)$
Agent receives reward $r_t = R(s_t, a_t, s_{t+1})$
Repeat until terminal state or horizon

LLM Generation as an MDP

For language model fine-tuning, we can map concepts as follows:

RL Concept	LLM Equivalent
State $s_t$	Prompt + tokens generated so far: $(q, o_1, \ldots, o_{t-1})$
Action $a_t$	Next token $o_t$ from vocabulary $\mathcal{V}$
Transition	Deterministic: append chosen token to sequence
Reward	Often sparse: $r_t = 0$ for $t < T$ , $r_T = R(q, o)$ at end
Episode	One complete generation from prompt to EOS token

The state space is enormous (all possible token sequences up to context length), and the action space is the vocabulary size (typically 32K-128K tokens). This scale motivates many of the algorithmic choices we’ll see.

Key Insight: The Markov property technically holds for LLM generation—the next token distribution depends only on the current sequence, not on how we generated it. However, the state itself encodes the full history, so we’re not actually discarding information.

Policies and the Objective

What is a Policy?

A policy $\pi$ maps states to actions. It can be:

Deterministic: $a = \pi(s)$ — always the same action in state $s$
Stochastic: $a \sim \pi(a|s)$ — a probability distribution over actions

For LLMs, the policy is inherently stochastic: $\pi_\theta(o_t | s_t) = \pi_\theta(o_t | q, o_{<t})$ is the probability of token $o_t$ given the context. The parameters $\theta$ are the model weights.

The Return

The return $G_t$ is the cumulative (discounted) reward from time $t$ onwards:

$G_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots$

The discount factor $\gamma$ serves multiple purposes:

Mathematical: Ensures the sum converges for infinite horizons
Practical: Encodes preference for immediate vs. delayed rewards
LLM setting: Often $\gamma = 1$ since episodes are finite

The RL Objective

The goal of RL is to find a policy that maximizes expected return:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right] = \mathbb{E}_{\tau \sim \pi_\theta}[G_0]$

where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)$ is a trajectory sampled by following policy $\pi_\theta$ .

The expectation is over:

Initial state distribution $s_0 \sim \rho_0$
Actions from policy $a_t \sim \pi_\theta(a|s_t)$
Transitions from dynamics $s_{t+1} \sim P(s'|s_t, a_t)$

For LLMs Specifically

In the LLM setting, the objective simplifies to:

$J(\theta) = \mathbb{E}_{q \sim P(Q), \, o \sim \pi_\theta(O|q)}[R(q, o)]$

where:

$q$ is a prompt from the prompt distribution $P(Q)$
$o = (o_1, \ldots, o_T)$ is a complete response sampled from the model
$R(q, o)$ is the reward (typically from a reward model)

With sparse rewards (only at episode end), there’s no discounting to worry about within an episode.

Policy Gradients: Learning Without a Model

How do we optimize $J(\theta)$ ? We can’t compute it analytically—it requires integrating over all possible trajectories. But we can estimate its gradient and use gradient ascent.

The Policy Gradient Theorem

The policy gradient theorem (Sutton et al., 1999) provides a remarkable result: we can estimate $\nabla_\theta J(\theta)$ using only samples, without knowing the transition dynamics $P$ .

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]$

Let’s unpack this:

$\nabla_\theta \log \pi_\theta(a_t|s_t)$ is the score function—the direction that increases the probability of action $a_t$
$G_t$ is the return from time $t$ —how good the outcome was
The product says: “increase probability of actions that led to high returns”

Intuition: Credit Assignment

The gradient has an elegant interpretation:

Sample a trajectory by running the policy
For each action taken, compute how much total reward followed
If reward was high, increase that action’s probability
If reward was low, decrease that action’s probability

This is the core of policy gradient methods: reinforce good actions, discourage bad ones.

The REINFORCE Algorithm

The simplest policy gradient algorithm is REINFORCE (Williams, 1992):

Algorithm: REINFORCE

For each episode:
    1. Sample trajectory τ = (s₀, a₀, r₀, ..., s_T, a_T, r_T) using π_θ
    
    2. For t = 0 to T:
        Compute return: G_t = Σₖ γᵏ r_{t+k}
    
    3. Compute gradient estimate:
        ∇̂ = Σₜ ∇_θ log π_θ(aₜ|sₜ) · Gₜ
    
    4. Update: θ ← θ + α∇̂

Why This Works: The Log-Derivative Trick

The derivation uses the log-derivative trick:

$\nabla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \cdot \nabla_\theta \log \pi_\theta(a|s)$

This converts a gradient of a probability into an expectation we can sample:

$\nabla_\theta \mathbb{E}_{a \sim \pi_\theta}[f(a)] = \mathbb{E}_{a \sim \pi_\theta}[f(a) \cdot \nabla_\theta \log \pi_\theta(a|s)]$

The full derivation for trajectories is more involved but follows the same principle.

Key Insight: Policy gradients are model-free—we don’t need to know $P(s'|s,a)$ . We only need to sample from the environment and observe rewards. This is crucial for LLMs where the “environment” is just string concatenation.

The Variance Problem

REINFORCE is elegant but has a critical flaw: high variance. The gradient estimates are so noisy that learning is impractically slow.

Sources of Variance

Consider what affects the gradient estimate $\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t$ :

Stochastic policy: Different action samples give different gradients
Stochastic transitions: Same action can lead to different states
Long horizons: $G_t$ includes all future rewards, accumulating randomness
Sparse rewards: Most of the signal comes from rare events

A Concrete Example

Suppose you’re training an LLM to solve math problems. The reward is +1 for correct, 0 for incorrect.

Episode 1: Model generates a correct solution. Every token gets gradient proportional to +1.

Episode 2: Model generates an incorrect solution. Every token gets gradient proportional to 0.

The problem: In episode 1, all tokens get positive reinforcement—including lucky guesses, unnecessary steps, and tokens that happened to work but weren’t essential. The gradient doesn’t distinguish “this token was crucial” from “this token was irrelevant.”

Variance Reduction with Baselines

A key insight: we can subtract any baseline $b(s_t)$ that doesn’t depend on actions without changing the expected gradient:

$\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_{t} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))\right]$

Why doesn’t this change the expectation?

$\mathbb{E}_{a \sim \pi}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = b(s) \cdot \mathbb{E}_{a \sim \pi}\left[\nabla_\theta \log \pi_\theta(a|s)\right] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0$

The baseline can be anything that doesn’t depend on the action—but choosing it well dramatically reduces variance.

Optimal Baseline

The variance-minimizing baseline is approximately the expected return from state $s_t$ :

$b^*(s_t) \approx \mathbb{E}[G_t | s_t]$

This is the value function $V(s_t)$ , which we’ll explore next.

Intuition: Relative Performance

With a good baseline, the gradient update becomes:

$\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - V(s_t))$

This says:

If $G_t > V(s_t)$ : Action was better than expected → increase probability
If $G_t < V(s_t)$ : Action was worse than expected → decrease probability
If $G_t = V(s_t)$ : Action was exactly as expected → no change

This relative signal is far more informative than absolute returns.

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'12px', 'fontFamily':'monospace' } }}%% graph LR subgraph Absolute["Without Baseline REINFORCE"] direction TB A1["G = 100 ━━━━━━━━ Big positive ∇ Reinforce strongly"] A2["G = 95 ━━━━━━━━ Big positive ∇ Reinforce strongly"] A3["G = 0 ━━━━━━━━ Zero ∇ No learning"] A1 ~~~ A2 ~~~ A3 end Arrow["Add baseline V(s) = 97 ━━━━━━━━ Relative signal reduces variance"] subgraph Relative["With Baseline Advantage = G − V(s)"] direction TB B1["A = +3 ━━━━━━━━ Small positive ∇ Slightly better"] B2["A = −2 ━━━━━━━━ Small negative ∇ Slightly worse"] B3["A = −97 ━━━━━━━━ Large negative ∇ Much worse"] B1 ~~~ B2 ~~~ B3 end Absolute --> Arrow --> Relative style A1 fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2px style A2 fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2px style A3 fill:#1e293b,stroke:#64748b,color:#94a3b8,stroke-width:2px style B1 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style B2 fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style B3 fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2.5px style Arrow fill:#334155,stroke:#06b6d4,color:#67e8f9,stroke-width:2px style Absolute fill:#0f172a,stroke:#f59e0b,color:#fde68a,stroke-width:1.5px style Relative fill:#0f172a,stroke:#10b981,color:#d1fae5,stroke-width:1.5px

Key Insight: The baseline transforms absolute returns into relative performance measures. This is the foundation of advantage-based methods—and explains why GRPO’s group normalization works.

Value Functions and Bellman Equations

To use $V(s)$ as a baseline, we need to estimate it. Value functions are central to RL and worth understanding deeply.

State Value Function

The state value function $V^\pi(s)$ is the expected return starting from state $s$ and following policy $\pi$ :

$V^\pi(s) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \,\Big|\, s_t = s\right]$

This answers: “How good is it to be in state $s$ ?”

Action Value Function

The action value function (or Q-function) $Q^\pi(s, a)$ is the expected return starting from state $s$ , taking action $a$ , then following policy $\pi$ :

$Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k r_{t+k} \,\Big|\, s_t = s, a_t = a\right]$

This answers: “How good is it to take action $a$ in state $s$ ?”

Relationship Between V and Q

The value function is the expected Q-value under the policy:

$V^\pi(s) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)] = \sum_a \pi(a|s) Q^\pi(s, a)$

The Bellman Equation

Value functions satisfy a recursive relationship called the Bellman equation:

$V^\pi(s) = \mathbb{E}_{a \sim \pi, s' \sim P}\left[r(s, a, s') + \gamma V^\pi(s')\right]$

This says: “The value of a state equals the expected immediate reward plus the discounted value of the next state.”

For Q-functions:

$Q^\pi(s, a) = \mathbb{E}_{s' \sim P}\left[r(s, a, s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s', a')\right]$

Why Bellman Equations Matter

The Bellman equation enables bootstrapping: we can estimate $V(s)$ using our current estimate of $V(s')$ , without waiting for the episode to end.

Monte Carlo: Wait for episode end, compute $G_t = \sum_k \gamma^k r_{t+k}$ , update $V(s_t) \leftarrow V(s_t) + \alpha(G_t - V(s_t))$

Temporal Difference (TD): After one step, update using $V(s_t) \leftarrow V(s_t) + \alpha(r_t + \gamma V(s_{t+1}) - V(s_t))$

TD learning has lower variance (one-step update vs. full episode) but introduces bias (using an estimate $V(s_{t+1})$ instead of the true expected return).

TD Error

The TD error is the difference between our estimate and the bootstrapped target:

$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

If our value function is accurate, $\mathbb{E}[\delta_t] = 0$ . The TD error measures how “surprised” we are by what happened.

Key Insight: Bellman equations let us learn value functions incrementally, updating after each step rather than waiting for episodes to complete. This is especially valuable for long episodes like LLM generation.

Advantage Functions: The Key Insight

The advantage function combines V and Q to measure how much better an action is than average:

$A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$

Interpretation

$A(s, a) > 0$ : Action $a$ is better than the average action in state $s$
$A(s, a) < 0$ : Action $a$ is worse than average
$A(s, a) = 0$ : Action $a$ is exactly average

The advantage answers the crucial question: “Is this action better or worse than what I’d typically do?”

Why Advantages for Policy Gradients?

Using advantages in the policy gradient gives:

$\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^\pi(s_t, a_t)\right]$

This is optimal for variance reduction because:

$\mathbb{E}_a[A(s, a)] = 0$ by construction (advantages are centered)
The signal focuses on relative action quality
It separates “was the action good?” from “was the state good?”

Estimating Advantages

We rarely know $A^\pi$ exactly. Common estimators:

1. Monte Carlo Advantage: $\hat{A}_t = G_t - V(s_t)$ Use actual returns minus estimated value. Unbiased but high variance.

2. TD Advantage: $\hat{A}_t = r_t + \gamma V(s_{t+1}) - V(s_t) = \delta_t$ One-step TD error. Low variance but biased.

3. Generalized Advantage Estimation (GAE): $\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{k=0}^{\infty} (\gamma\lambda)^k \delta_{t+k}$ Interpolates between MC ( $\lambda=1$ ) and TD ( $\lambda=0$ ). We’ll explore GAE deeply in Part 2.

Properties of Good Advantage Estimators

Property	Description
Low bias	$\mathbb{E}[\hat{A}] \approx A$ — estimates are accurate on average
Low variance	Estimates don’t fluctuate wildly between samples
Computational efficiency	Can be computed without excessive overhead

There’s typically a bias-variance tradeoff: lower variance estimators introduce more bias (by relying on learned value functions).

Key Insight: The advantage function is the “right” quantity for policy gradients. It tells us exactly what we want to know: was this action better or worse than average? All modern policy optimization methods (PPO, GRPO, etc.) are fundamentally about estimating advantages well.

Actor-Critic Methods

Actor-critic methods combine policy gradients (the “actor”) with learned value functions (the “critic”). This architecture underlies PPO and most modern RL algorithms.

The Two Components

Actor: The policy $\pi_\theta(a|s)$ that selects actions

Parametrized by $\theta$
Optimized via policy gradients
Goal: maximize expected return

Critic: The value function $V_\psi(s)$ or $Q_\psi(s, a)$

Parametrized by $\psi$
Optimized via Bellman equation (TD learning)
Goal: accurately estimate expected returns

Why Two Networks?

Each component benefits the other:

Critic helps actor: Provides baseline/advantage estimates for lower-variance policy gradients
Actor helps critic: Generates on-policy data for value function training

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'12px', 'fontFamily':'monospace' } }}%% graph TB subgraph AC["Actor-Critic Architecture"] direction TB S["State s_t"] subgraph Networks["Neural Networks"] direction LR Actor["Actor π_θ ━━━━━━━━ Policy network Outputs P(a|s)"] Critic["Critic V_ψ ━━━━━━━━ Value network Outputs V(s)"] end subgraph Environment["Environment Interaction"] direction LR Action["Action a_t ~ π(·|s)"] Env["Environment"] Reward["Reward r_t"] NextS["Next State s_t+1"] end Adv["Advantage ━━━━━━━━ A = r_t + γV(s_t+1) − V(s_t)"] subgraph Updates["Parameter Updates"] direction LR ActorUpdate["Actor Update ━━━━━━━━ θ ← θ + α∇_θ log π · A"] CriticUpdate["Critic Update ━━━━━━━━ ψ ← ψ − α∇_ψ (V − target)²"] end end S --> Actor S --> Critic Actor --> Action Action --> Env Env --> Reward Env --> NextS Critic --> Adv Reward --> Adv NextS -.->|"V(s_t+1)"| Adv Adv --> ActorUpdate Adv --> CriticUpdate ActorUpdate -.->|"update θ"| Actor CriticUpdate -.->|"update ψ"| Critic style S fill:#0f172a,stroke:#8b5cf6,color:#c4b5fd,stroke-width:2px style Actor fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Critic fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2.5px style Action fill:#334155,stroke:#64748b,color:#e2e8f0,stroke-width:1.5px style Env fill:#334155,stroke:#64748b,color:#e2e8f0,stroke-width:1.5px style Reward fill:#334155,stroke:#64748b,color:#e2e8f0,stroke-width:1.5px style NextS fill:#334155,stroke:#64748b,color:#e2e8f0,stroke-width:1.5px style Adv fill:#1e293b,stroke:#f59e0b,color:#fde68a,stroke-width:2.5px style ActorUpdate fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style CriticUpdate fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px style Networks fill:#0f172a,stroke:#475569,color:#94a3b8,stroke-width:1px style Environment fill:#0f172a,stroke:#475569,color:#94a3b8,stroke-width:1px style Updates fill:#0f172a,stroke:#475569,color:#94a3b8,stroke-width:1px style AC fill:none,stroke:#334155,color:#94a3b8,stroke-width:1px

The A2C Algorithm

Advantage Actor-Critic (A2C) is a foundational actor-critic algorithm:

Algorithm: A2C

For each batch of experience:
    1. Collect trajectories using current policy π_θ
    
    2. Compute advantages:
        For each timestep t:
            δ_t = r_t + γV_ψ(s_{t+1}) - V_ψ(s_t)  # TD error
            (Or use GAE for multi-step advantages)
    
    3. Update critic (minimize value loss):
        L_critic = Σ_t (V_ψ(s_t) - G_t)²  # Or TD targets
        ψ ← ψ - α_critic ∇_ψ L_critic
    
    4. Update actor (maximize policy objective):
        L_actor = Σ_t log π_θ(a_t|s_t) · Â_t
        θ ← θ + α_actor ∇_θ L_actor

On-Policy vs. Off-Policy

On-policy methods (A2C, PPO, GRPO) require data from the current policy:

Samples from $\pi_\theta$ used to update $\pi_\theta$
Must discard data after each update
Simpler theory but less sample efficient

Off-policy methods (DQN, SAC) can reuse old data:

Samples from any policy can update $\pi_\theta$
Requires importance sampling or replay buffers
More sample efficient but more complex

LLM fine-tuning typically uses on-policy methods because:

The state space (text) is too large for replay buffers
On-policy methods are more stable for large models
Sample efficiency matters less when each “sample” is a full text generation

Challenges in Actor-Critic

Training instability: Actor and critic must improve together; if one diverges, both fail
Sample efficiency: On-policy methods need fresh samples after each update
Hyperparameter sensitivity: Learning rates, advantage estimation, entropy bonuses all interact
Scale: For LLMs, the critic is another massive network to train

Key Insight: Actor-critic methods are powerful but complex. The critic provides essential variance reduction, but at the cost of doubled model capacity and potential instability. This motivates GRPO’s approach of eliminating the critic entirely.

From General RL to LLM Fine-Tuning

Let’s map everything we’ve learned to the specific setting of language model alignment.

The LLM-RL Correspondence

General RL	LLM Fine-Tuning
State $s$	Prompt + generated tokens $(q, o_{<t})$
Action $a$	Next token $o_t \in \mathcal{V}$
Policy $\pi(a	s)$
Transition $P(s’	s,a)$
Reward $R$	Reward model $r_\phi(q, o)$ (often sparse)
Episode	One complete generation
Value function $V(s)$	Expected reward from partial generation

Unique Challenges for LLMs

1. Enormous Action Space

Vocabulary sizes of 32K-128K tokens mean:

Can’t enumerate all actions
Must use function approximation (the LLM itself)
Exploration is implicit in stochastic sampling

2. Sparse Rewards

Typically, reward comes only at generation end:

$r_t = 0$ for $t < T$
$r_T = R(q, o)$ — the reward model score

This makes credit assignment hard: which tokens were responsible for the reward?

3. Variable-Length Episodes

Generations can be 10 tokens or 1000 tokens:

Can’t use fixed-horizon methods
Must handle EOS token properly
Padding and masking complexities

4. The Reference Model Constraint

Unlike standard RL, we don’t want to maximize reward unconditionally. We want to improve while staying close to a reference model:

$\max_\theta \mathbb{E}[R(q, o)] - \beta \cdot \text{KL}(\pi_\theta || \pi_{\text{ref}})$

This prevents:

Reward hacking: Finding exploits in the reward model
Mode collapse: Generating the same high-reward response always
Capability loss: Forgetting useful behaviors from pretraining

5. Scale

Training billion-parameter models requires:

Massive memory for model weights, gradients, optimizer states
Distributed training across many GPUs
Careful numerical stability

The RLHF Pipeline

What’s Coming Next

With these foundations, we’re ready to understand PPO (Part 2):

How clipping provides stable updates
How GAE estimates advantages
Why it requires four models
And why that’s a problem

Then GRPO (Part 3) will show how to eliminate the critic by clever use of group statistics.

Finally, GDPO (Part 4) will address multi-reward settings where GRPO falls short.

Key Takeaways

The MDP Framework

RL formalizes sequential decisions as states, actions, rewards
Markov property: future depends only on current state
LLM generation maps naturally to this framework

Policy Gradients

We can optimize expected reward using gradient ascent
The policy gradient theorem enables learning without knowing dynamics
REINFORCE is simple but high-variance

The Variance Problem

Raw policy gradients are too noisy for practical use
Baselines reduce variance without changing expected gradient
The optimal baseline is approximately the value function

Value Functions

$V(s)$ = expected return from state $s$
$Q(s,a)$ = expected return from state $s$ taking action $a$
Bellman equations enable bootstrapped learning

Advantage Functions

$A(s,a) = Q(s,a) - V(s)$ measures relative action quality
Advantages are centered: $\mathbb{E}_a[A(s,a)] = 0$
This is the key quantity for policy optimization

Actor-Critic

Actor (policy) + Critic (value function) work together
Critic provides variance reduction for actor updates
Doubles the model capacity required

LLM-Specific Considerations

Sparse rewards make credit assignment hard
Reference model constraint prevents reward hacking
Scale demands memory-efficient algorithms

What’s Next

In Part 2: PPO for Language Models, we’ll see how Proximal Policy Optimization addresses many challenges:

Trust regions for stable updates
GAE for flexible advantage estimation
Clipping for simplicity

But we’ll also see PPO’s costs: the four-model architecture that strains GPU memory, and the complexity that makes implementation tricky. This sets the stage for GRPO’s elegant simplification.

Reinforcement Learning Foundations for LLM Alignment

Why RL for Language Models?

Table of Contents

The MDP Framework

Definition

The Markov Property

The Agent-Environment Loop

LLM Generation as an MDP

Policies and the Objective

What is a Policy?

The Return

The RL Objective

For LLMs Specifically

Policy Gradients: Learning Without a Model

The Policy Gradient Theorem

Intuition: Credit Assignment

The REINFORCE Algorithm

Why This Works: The Log-Derivative Trick

The Variance Problem

Sources of Variance

A Concrete Example

Variance Reduction with Baselines

Optimal Baseline

Intuition: Relative Performance

Value Functions and Bellman Equations

State Value Function

Action Value Function

Relationship Between V and Q

The Bellman Equation

Why Bellman Equations Matter

TD Error

Advantage Functions: The Key Insight

Interpretation

Why Advantages for Policy Gradients?

Estimating Advantages

Properties of Good Advantage Estimators

Actor-Critic Methods

The Two Components

Why Two Networks?

The A2C Algorithm

On-Policy vs. Off-Policy

Challenges in Actor-Critic

From General RL to LLM Fine-Tuning

The LLM-RL Correspondence

Unique Challenges for LLMs

The RLHF Pipeline

What’s Coming Next

Key Takeaways

The MDP Framework

Policy Gradients

The Variance Problem

Value Functions

Advantage Functions

Actor-Critic

LLM-Specific Considerations

What’s Next

Further Reading

Keep Reading

GRPO: Eliminating the Value Network

PPO for Language Models: The RLHF Workhorse

Reinforcement Learning Foundations for LLM Alignment

Deploying Contextual Bandits: Production Guide and Offline Evaluation

Neural Contextual Bandits for High-Dimensional Data

Implementing Contextual Bandits: Complete Algorithm Guide