Feb 11, 2026 ~25 min By Vitor Sousa

advanced Prerequisites: Reinforcement Learning Foundations for LLM Alignment , PPO for Language Models: The RLHF Workhorse , GRPO: Eliminating the Value Network

GDPO: Multi-Reward RL Done Right

Part 4 of 4: The Multi-Reward Frontier

TL;DR: GRPO works beautifully for single rewards, but breaks down with multiple rewards. When you normalize the sum of rewards, distinct reward combinations collapse to identical advantages—you lose training signal resolution. GDPO (Group reward-Decoupled normalization Policy Optimization) fixes this by normalizing each reward independently, then combining. Simple change, significant improvement for tool calling, math reasoning, and any multi-objective alignment task.

Reading time: ~25 minutes

Prerequisites: Part 3: GRPO covers group-relative advantage estimation.

The Multi-Reward Reality

Modern LLM training rarely optimizes a single reward. Consider these common scenarios:

Tool Calling:

Format reward: Does the output follow the required JSON schema?
Correctness reward: Are the tool calls semantically correct?

Math Reasoning:

Format reward: Does the output use <think> and <answer> tags?
Correctness reward: Is the final answer mathematically correct?
Integer reward: Is the answer an integer (not a float)?

Coding:

Execution reward: Does the code run without errors?
Correctness reward: Does it pass test cases?
Style reward: Does it follow conventions?

The naive approach: sum the rewards and apply GRPO. This doesn’t work well.

The Advantage Collapse Problem

GRPO with Multiple Rewards: The Naive Approach

Suppose we have two rewards: format ( $r^f$ ) and correctness ( $r^c$ ). The naive approach:

Combine rewards: $r_i = w_f \cdot r_i^f + w_c \cdot r_i^c$
Apply GRPO normalization: $\hat{A}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$

This is exactly what happens if you use standard GRPO with a multi-reward setup.

The Collapse

Consider four outputs with these reward combinations:

Output	Format $r^f$	Correctness $r^c$	Combined $r = r^f + r^c$
$o_1$	1.0	0.0	1.0
$o_2$	0.0	1.0	1.0
$o_3$	1.0	1.0	2.0
$o_4$	0.0	0.0	0.0

Now apply GRPO normalization to combined rewards:

$\text{mean}(\mathbf{r}) = (1 + 1 + 2 + 0) / 4 = 1.0$
$\text{std}(\mathbf{r}) = \sqrt{((1-1)^2 + (1-1)^2 + (2-1)^2 + (0-1)^2)/4} = \sqrt{0.5} \approx 0.71$

Output	Combined $r$	Advantage $\hat{A}$
$o_1$	1.0	$(1-1)/0.71 = 0$
$o_2$	1.0	$(1-1)/0.71 = 0$
$o_3$	2.0	$(2-1)/0.71 = 1.41$
$o_4$	0.0	$(0-1)/0.71 = -1.41$

The problem: Outputs $o_1$ and $o_2$ get identical advantages (zero), despite having completely different reward profiles:

$o_1$ : Good format, bad correctness
$o_2$ : Bad format, good correctness

The normalization has collapsed these distinct cases into the same training signal.

Why This Matters

With advantage collapse:

Lost resolution: The model can’t distinguish “format good, correctness bad” from “format bad, correctness good”
Suboptimal convergence: Training signal is weaker than it could be
Training instability: In extreme cases, can cause early training failure

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'11px', 'fontFamily':'monospace' } }}%% graph LR subgraph Collapse["GRPO: Advantage Collapse"] direction TB O1_grpo["o₁: Format ✓ Correct ✗ r = 1.0 ━━━━━━━━ Â = 0"] O2_grpo["o₂: Format ✗ Correct ✓ r = 1.0 ━━━━━━━━ Â = 0"] Same["Same advantage! ━━━━━━━━ Can't distinguish different failures"] O1_grpo --> Same O2_grpo --> Same end subgraph Preserved["GDPO: Signal Preserved"] direction TB O1_gdpo["o₁: Format ✓ Correct ✗ Â_f = +1, Â_c = −1 ━━━━━━━━ Combined Â = 0"] O2_gdpo["o₂: Format ✗ Correct ✓ Â_f = −1, Â_c = +1 ━━━━━━━━ Combined Â = 0"] Diff["Different gradients! ━━━━━━━━ Format vs correctness learned independently"] O1_gdpo --> Diff O2_gdpo --> Diff end style O1_grpo fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style O2_grpo fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style Same fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2.5px style O1_gdpo fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style O2_gdpo fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style Diff fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Collapse fill:#0f172a,stroke:#ef4444,color:#fecaca style Preserved fill:#0f172a,stroke:#10b981,color:#d1fae5

Key Insight: When rewards measure different aspects (format vs. correctness), combining them before normalization destroys information about which aspect each output excels at.

Mathematical Analysis

Let’s formalize why advantage collapse happens.

GRPO’s Combined Normalization

Given $K$ reward functions with weights ${w_k}$ , GRPO computes:

$r_i = \sum_{k=1}^{K} w_k \cdot r_i^{(k)}$

$\hat{A}_i^{\text{GRPO}} = \frac{r_i - \mu_r}{\sigma_r}$

where $\mu_r = \text{mean}(\mathbf{r})$ and $\sigma_r = \text{std}(\mathbf{r})$ .

The Collapse Condition

Outputs $o_i$ and $o_j$ have identical advantages when:

$\sum_{k=1}^{K} w_k \cdot r_i^{(k)} = \sum_{k=1}^{K} w_k \cdot r_j^{(k)}$

This happens whenever different reward combinations sum to the same value—regardless of which specific rewards contributed.

Information Loss

Consider the space of possible reward vectors in $\mathbb{R}^K$ (K rewards). GRPO projects this to $\mathbb{R}^1$ (the weighted sum), then normalizes.

Information lost: All reward vectors mapping to the same sum become indistinguishable.

For $K=2$ rewards with equal weights, the level sets are lines in 2D:

$(0, 2)$ and $(2, 0)$ and $(1, 1)$ all map to sum $= 2$
These represent very different behaviors, but get identical treatment

Variance Collapse

Even more problematic: when rewards are highly correlated or when one dominates, the variance of combined rewards shrinks.

Example: If format reward is always 1 (all outputs follow format), then:

$r_i = 1 \cdot w_f + r_i^c \cdot w_c$
Variance comes only from correctness
Format reward contributes nothing to the training signal

GDPO: The Solution

Group reward-Decoupled normalization Policy Optimization (GDPO) normalizes each reward independently before combining.

The GDPO Algorithm

For each reward $k \in {1, \ldots, K}$ :

Step 1: Normalize each reward independently

$\hat{A}_i^{(k)} = \frac{r_i^{(k)} - \mu^{(k)}}{\sigma^{(k)}}$

where $\mu^{(k)} = \text{mean}(\mathbf{r}^{(k)})$ and $\sigma^{(k)} = \text{std}(\mathbf{r}^{(k)})$ .

Step 2: Combine normalized advantages

$\hat{A}*i^{\text{pre}} = \sum*{k=1}^{K} w_k \cdot \hat{A}_i^{(k)}$

Step 3: Final batch normalization

$\hat{A}*i^{\text{GDPO}} = \frac{\hat{A}\*i^{\text{pre}} - \mu\*{\text{pre}}}{\sigma*{\text{pre}}}$

Why This Works

Decoupled normalization preserves individual reward signals:

Going back to our example:

Output	$r^f$	$r^c$	$\hat{A}^f$	$\hat{A}^c$	$\hat{A}^{\text{pre}}$
$o_1$	1.0	0.0	+1.0	-1.0	0.0
$o_2$	0.0	1.0	-1.0	+1.0	0.0
$o_3$	1.0	1.0	+1.0	+1.0	+2.0
$o_4$	0.0	0.0	-1.0	-1.0	-2.0

Now $o_1$ and $o_2$ still have the same final advantage (after step 3), but the gradient is different:

For $o_1$ : Gradient encourages improving correctness (negative $\hat{A}^c$ ), maintaining format (positive $\hat{A}^f$ )

For $o_2$ : Gradient encourages improving format (negative $\hat{A}^f$ ), maintaining correctness (positive $\hat{A}^c$ )

The model learns which specific reward to improve for each output.

Visual Comparison

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'11px', 'fontFamily':'monospace' } }}%% graph TB subgraph GRPO["GRPO: Combined Normalization"] direction TB G_R1["r₁ (format)"] --> G_Sum["Sum r = w₁r₁ + w₂r₂"] G_R2["r₂ (correct)"] --> G_Sum G_Sum --> G_Norm["Normalize Â = (r − μ) / σ"] G_Norm --> G_Adv["Advantage Â"] end subgraph GDPO["GDPO: Decoupled Normalization"] direction TB D_R1["r₁ (format)"] --> D_N1["Normalize Â₁ = (r₁ − μ₁) / σ₁"] D_R2["r₂ (correct)"] --> D_N2["Normalize Â₂ = (r₂ − μ₂) / σ₂"] D_N1 --> D_Comb["Combine Â_pre = w₁Â₁ + w₂Â₂"] D_N2 --> D_Comb D_Comb --> D_BN["Batch Normalize Â = (Â_pre − μ) / σ"] D_BN --> D_Adv["Advantage Â"] end style G_Sum fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style G_Norm fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style D_N1 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style D_N2 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style D_Comb fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px style D_BN fill:#1e293b,stroke:#8b5cf6,color:#e9d5ff,stroke-width:2px style GRPO fill:#0f172a,stroke:#ef4444,color:#fecaca style GDPO fill:#0f172a,stroke:#10b981,color:#d1fae5

Key Insight: GDPO’s decoupled normalization ensures each reward contributes meaningfully to the gradient, regardless of its scale or correlation with other rewards.

Implementation

GDPO is a straightforward modification to GRPO. Here’s the key difference:

GRPO (Original)

def grpo_advantages(rewards_per_func, reward_weights, num_generations):
    """
    GRPO: Normalize combined rewards.
    
    Args:
        rewards_per_func: [batch, K] rewards from K reward functions
        reward_weights: [K] weights for each reward
        num_generations: G, outputs per prompt
    """
    # Combine rewards first
    rewards = (rewards_per_func * reward_weights).sum(dim=1)  # [batch]
    
    # Then normalize
    mean_grouped = rewards.view(-1, num_generations).mean(dim=1)
    std_grouped = rewards.view(-1, num_generations).std(dim=1)
    
    mean_grouped = mean_grouped.repeat_interleave(num_generations)
    std_grouped = std_grouped.repeat_interleave(num_generations)
    
    advantages = (rewards - mean_grouped) / (std_grouped + 1e-4)
    
    return advantages

GDPO (Decoupled)

def gdpo_advantages(rewards_per_func, reward_weights, num_generations):
    """
    GDPO: Normalize each reward independently, then combine.
    
    Args:
        rewards_per_func: [batch, K] rewards from K reward functions
        reward_weights: [K] weights for each reward
        num_generations: G, outputs per prompt
    """
    K = len(reward_weights)
    device = rewards_per_func.device
    
    # Handle NaN values
    rewards_per_func = torch.nan_to_num(rewards_per_func)
    
    all_reward_advantages = []
    
    # Step 1: Normalize each reward independently
    for k in range(K):
        reward_k = rewards_per_func[:, k]  # [batch]
        
        # Group-wise normalization for this reward
        mean_k = reward_k.view(-1, num_generations).mean(dim=1)
        std_k = reward_k.view(-1, num_generations).std(dim=1)
        
        mean_k = mean_k.repeat_interleave(num_generations)
        std_k = std_k.repeat_interleave(num_generations)
        
        advantage_k = (reward_k - mean_k) / (std_k + 1e-4)
        all_reward_advantages.append(advantage_k)
    
    # Step 2: Combine normalized advantages
    combined_advantages = torch.stack(all_reward_advantages, dim=1)  # [batch, K]
    pre_bn_advantages = (combined_advantages * reward_weights).sum(dim=1)  # [batch]
    
    # Step 3: Final batch normalization
    bn_mean = pre_bn_advantages.mean()
    bn_std = pre_bn_advantages.std()
    
    advantages = (pre_bn_advantages - bn_mean) / (bn_std + 1e-4)
    
    return advantages

Complete GDPO Trainer

"""
Group reward-Decoupled normalization Policy Optimization (GDPO)
===============================================================

Extends GRPO for multi-reward settings by normalizing each
reward independently before combining.

Reference: NVIDIA (Liu et al., 2026)
"""

import torch
import torch.nn.functional as F
from torch import Tensor
from dataclasses import dataclass
from typing import List, Tuple, Callable


@dataclass
class GDPOConfig:
    """GDPO hyperparameters."""
    group_size: int = 64
    clip_epsilon: float = 0.2
    kl_coef: float = 0.04
    learning_rate: float = 1e-6
    max_grad_norm: float = 1.0
    eps: float = 1e-4


def compute_gdpo_advantages(
    rewards_list: List[Tensor],
    reward_weights: List[float],
    eps: float = 1e-4,
) -> Tuple[Tensor, List[Tensor]]:
    """
    Compute GDPO advantages with decoupled normalization.
    
    Args:
        rewards_list: List of [G] tensors, one per reward function
        reward_weights: Weights for each reward
        eps: Numerical stability constant
        
    Returns:
        advantages: [G] final normalized advantages
        per_reward_advantages: List of [G] tensors for logging
    """
    K = len(rewards_list)
    G = rewards_list[0].shape[0]
    device = rewards_list[0].device
    
    # Step 1: Normalize each reward independently
    normalized_rewards = []
    for k in range(K):
        r_k = rewards_list[k]
        mean_k = r_k.mean()
        std_k = r_k.std() + eps
        normalized_k = (r_k - mean_k) / std_k
        normalized_rewards.append(normalized_k)
    
    # Step 2: Combine with weights
    combined = torch.zeros(G, device=device)
    for k in range(K):
        combined += reward_weights[k] * normalized_rewards[k]
    
    # Step 3: Final batch normalization
    final_mean = combined.mean()
    final_std = combined.std() + eps
    advantages = (combined - final_mean) / final_std
    
    return advantages, normalized_rewards


class GDPOTrainer:
    """
    GDPO trainer for multi-reward LLM alignment.
    
    Key difference from GRPO: each reward is normalized
    independently before combining, preserving signal resolution.
    """
    
    def __init__(
        self,
        policy_model: torch.nn.Module,
        reference_model: torch.nn.Module,
        reward_functions: List[Callable],
        reward_weights: List[float],
        tokenizer,
        config: GDPOConfig,
    ):
        self.policy = policy_model
        self.reference = reference_model
        self.reward_fns = reward_functions
        self.reward_weights = torch.tensor(reward_weights)
        self.tokenizer = tokenizer
        self.config = config
        
        assert len(reward_functions) == len(reward_weights), \
            "Must have same number of reward functions and weights"
        
        # Freeze reference
        for param in self.reference.parameters():
            param.requires_grad = False
        
        self.optimizer = torch.optim.AdamW(
            self.policy.parameters(),
            lr=config.learning_rate,
        )
    
    @torch.no_grad()
    def sample_and_score(
        self,
        prompt_text: str,
        prompt_ids: Tensor,
        attention_mask: Tensor,
    ) -> Tuple[Tensor, List[Tensor], Tensor]:
        """
        Sample G outputs and compute all rewards.
        
        Returns:
            output_ids: [G, seq_len] generated sequences
            rewards_list: List of [G] tensors, one per reward function
            output_masks: [G, seq_len] attention masks
        """
        G = self.config.group_size
        prompt_len = prompt_ids.shape[1]
        
        # Expand for batch generation
        prompt_ids = prompt_ids.expand(G, -1)
        attention_mask = attention_mask.expand(G, -1)
        
        # Sample
        self.policy.eval()
        outputs = self.policy.generate(
            input_ids=prompt_ids,
            attention_mask=attention_mask,
            max_new_tokens=512,
            do_sample=True,
            temperature=1.0,
            pad_token_id=self.tokenizer.pad_token_id,
        )
        
        output_ids = outputs
        output_masks = (output_ids != self.tokenizer.pad_token_id).long()
        
        # Score with each reward function
        rewards_list = []
        for reward_fn in self.reward_fns:
            rewards_k = []
            for i in range(G):
                response_ids = output_ids[i, prompt_len:]
                response_text = self.tokenizer.decode(
                    response_ids, skip_special_tokens=True
                )
                r = reward_fn(prompt_text, response_text)
                rewards_k.append(r)
            rewards_list.append(
                torch.tensor(rewards_k, device=output_ids.device)
            )
        
        return output_ids, rewards_list, output_masks
    
    def compute_log_probs(
        self,
        model: torch.nn.Module,
        input_ids: Tensor,
        attention_mask: Tensor,
        response_start: int,
    ) -> Tensor:
        """Compute per-token log probs for response."""
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        
        shift_logits = logits[:, :-1, :]
        shift_labels = input_ids[:, 1:]
        
        log_probs = F.log_softmax(shift_logits, dim=-1)
        token_log_probs = log_probs.gather(
            dim=-1, index=shift_labels.unsqueeze(-1)
        ).squeeze(-1)
        
        return token_log_probs[:, response_start - 1:]
    
    def train_step(
        self,
        prompt_text: str,
        prompt_ids: Tensor,
        attention_mask: Tensor,
    ) -> dict:
        """
        Single GDPO training step.
        """
        prompt_len = prompt_ids.shape[1]
        
        # Sample and score
        output_ids, rewards_list, output_masks = self.sample_and_score(
            prompt_text, prompt_ids, attention_mask
        )
        
        # Compute GDPO advantages
        advantages, per_reward_adv = compute_gdpo_advantages(
            rewards_list,
            self.reward_weights.tolist(),
            self.config.eps,
        )
        
        # Get log probs
        with torch.no_grad():
            old_log_probs = self.compute_log_probs(
                self.policy, output_ids, output_masks, prompt_len
            )
            ref_log_probs = self.compute_log_probs(
                self.reference, output_ids, output_masks, prompt_len
            )
        
        # Update
        self.policy.train()
        policy_log_probs = self.compute_log_probs(
            self.policy, output_ids, output_masks, prompt_len
        )
        
        response_mask = output_masks[:, prompt_len:]
        
        # Compute loss (same as GRPO)
        loss, metrics = self._compute_loss(
            policy_log_probs,
            old_log_probs,
            ref_log_probs,
            advantages,
            response_mask,
        )
        
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(
            self.policy.parameters(), 
            self.config.max_grad_norm
        )
        self.optimizer.step()
        
        # Add per-reward metrics
        for k, rewards_k in enumerate(rewards_list):
            metrics[f"reward_{k}_mean"] = rewards_k.mean().item()
            metrics[f"reward_{k}_std"] = rewards_k.std().item()
            metrics[f"adv_{k}_mean"] = per_reward_adv[k].mean().item()
        
        return metrics
    
    def _compute_loss(
        self,
        policy_log_probs: Tensor,
        old_log_probs: Tensor,
        ref_log_probs: Tensor,
        advantages: Tensor,
        mask: Tensor,
    ) -> Tuple[Tensor, dict]:
        """Compute GDPO loss with clipping and KL."""
        G, seq_len = policy_log_probs.shape
        
        # Importance ratio
        ratio = torch.exp(policy_log_probs - old_log_probs)
        
        # Expand advantages
        adv_expanded = advantages.unsqueeze(1).expand(-1, seq_len)
        
        # Clipped objective
        clipped = torch.clamp(
            ratio,
            1 - self.config.clip_epsilon,
            1 + self.config.clip_epsilon
        )
        
        policy_loss = -torch.min(
            ratio * adv_expanded,
            clipped * adv_expanded
        )
        
        # KL penalty
        log_ratio = ref_log_probs - policy_log_probs
        kl = torch.exp(log_ratio) - log_ratio - 1
        
        # Combine
        token_loss = policy_loss + self.config.kl_coef * kl
        
        # Mask and average
        masked_loss = (token_loss * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
        loss = masked_loss.mean()
        
        with torch.no_grad():
            clip_frac = ((ratio - 1).abs() > self.config.clip_epsilon).float()
            clip_frac = (clip_frac * mask).sum() / mask.sum()
        
        metrics = {
            "loss": loss.item(),
            "mean_kl": (kl * mask).sum().item() / mask.sum().item(),
            "clip_fraction": clip_frac.item(),
            "mean_advantage": advantages.mean().item(),
        }
        
        return loss, metrics

Experimental Results

The GDPO paper demonstrates consistent improvements across multiple tasks.

Tool Calling (2 Rewards)

Training Qwen2.5-1.5B-Instruct with format + correctness rewards:

Method	Format Compliance	Tool Accuracy	Training Stability
GRPO	78%	61%	Some oscillation
GDPO	94%	73%	Stable convergence

GDPO achieves better performance on both metrics with more stable training curves.

Math Reasoning (3 Rewards)

Training on GSM8K with format + correctness + integer rewards:

Method	Format	Accuracy	Integer	Overall
GRPO	82%	45%	71%	42%
GDPO	96%	52%	89%	51%

The three-reward setting amplifies GDPO’s advantage—more rewards mean more potential for collapse.

Code Generation

Similar patterns: GDPO consistently outperforms GRPO on execution, correctness, and style metrics.

Key Observations

More rewards → bigger gap: The advantage collapse problem worsens with more rewards
Training stability: GDPO shows smoother loss curves and fewer failures
Complementary rewards: When rewards measure different aspects, GDPO’s decoupling is most beneficial

Key Insight: GDPO’s improvement scales with the number of rewards. For single-reward settings, use GRPO. For multi-reward, GDPO is the clear choice.

When to Use GDPO vs GRPO

Use GRPO When:

Single reward: No multi-reward setting
Highly correlated rewards: Rewards that move together don’t benefit from decoupling
Simplicity preferred: GRPO is simpler to implement

Use GDPO When:

Multiple rewards: Two or more distinct reward functions
Complementary rewards: Rewards measure different aspects (format vs. correctness)
Training instability with GRPO: If GRPO shows oscillation or failure, try GDPO
Production multi-objective alignment: Default for real-world multi-reward scenarios

Decision Tree

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'12px', 'fontFamily':'monospace' } }}%% graph TD Start{"How many rewards?"} Start -->|"One"| GRPO["Use GRPO ━━━━━━━━ No multi-reward complexity"] Start -->|"Multiple"| Q2{"Are rewards complementary?"} Q2 -->|"Yes (different aspects)"| GDPO["Use GDPO ━━━━━━━━ Preserves signal resolution"] Q2 -->|"No (highly correlated)"| Q3{"Training stable?"} Q3 -->|"Yes"| GRPO2["Use GRPO ━━━━━━━━ Simpler, works"] Q3 -->|"No"| GDPO2["Try GDPO ━━━━━━━━ May stabilize training"] style Start fill:#334155,stroke:#64748b,color:#e5e7eb style GRPO fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px style GRPO2 fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px style GDPO fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style GDPO2 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Q2 fill:#334155,stroke:#64748b,color:#e5e7eb style Q3 fill:#334155,stroke:#64748b,color:#e5e7eb

Production Patterns

Pattern 1: Format + Correctness

The most common multi-reward setup:

def format_reward(prompt: str, response: str) -> float:
    """Binary: follows required structure?"""
    has_think = "<think>" in response and "</think>" in response
    has_answer = "<answer>" in response and "</answer>" in response
    return 1.0 if (has_think and has_answer) else 0.0

def correctness_reward(prompt: str, response: str) -> float:
    """Continuous: semantically correct?"""
    # Extract answer and check against ground truth
    # Returns value in [-1, 1]
    ...

reward_fns = [format_reward, correctness_reward]
reward_weights = [1.0, 1.0]  # Equal weighting

Use GDPO—format and correctness are complementary.

Pattern 2: Hierarchical Rewards

When one reward is prerequisite for another:

def hierarchical_correctness(prompt: str, response: str) -> float:
    """Only evaluate correctness if format is satisfied."""
    if format_reward(prompt, response) < 0.5:
        return 0.0  # Can't evaluate without proper format
    return base_correctness_reward(prompt, response)

reward_fns = [format_reward, hierarchical_correctness]

GDPO still helps, but consider the dependency structure.

Pattern 3: Weighted Multi-Objective

Different importance for different rewards:

reward_fns = [safety_reward, helpfulness_reward, conciseness_reward]
reward_weights = [2.0, 1.0, 0.5]  # Safety prioritized

GDPO’s per-reward normalization ensures each contributes proportionally.

Pattern 4: Dynamic Weights

Adjusting weights during training:

def get_weights(step: int) -> List[float]:
    """Start with format emphasis, shift to correctness."""
    format_w = max(0.5, 1.0 - step / 10000)
    correct_w = min(1.5, 0.5 + step / 10000)
    return [format_w, correct_w]

GDPO handles this naturally since normalization is per-reward.

Key Takeaways

The Problem: Advantage Collapse

When GRPO normalizes combined rewards, distinct reward combinations can collapse to identical advantages, losing training signal resolution.

The Solution: GDPO

Normalize each reward independently, then combine:

# GRPO (problematic)
combined = sum(w * r for w, r in zip(weights, rewards))
advantage = (combined - mean) / std

# GDPO (preserves signal)
normalized = [(r - mean(r)) / std(r) for r in rewards]
combined = sum(w * n for w, n in zip(weights, normalized))
advantage = (combined - mean) / std  # Final batch norm

When to Use GDPO

Multiple rewards measuring different aspects
Tool calling, math reasoning, code generation
Any multi-objective LLM alignment

Implementation

GDPO is a drop-in replacement for GRPO’s advantage computation. Minimal code change, significant improvement.

The Broader Lesson

As LLM training becomes more sophisticated, we’ll see more multi-reward settings. Understanding how normalization affects training signal is crucial for designing effective alignment pipelines.

Series Summary

Over four articles, we’ve built a complete understanding of policy optimization for LLMs:

Part	Topic	Key Contribution
1	RL Foundations	MDPs, policy gradients, variance reduction, value functions
2	PPO	Clipped objectives, GAE, the four-model architecture
3	GRPO	Group-relative advantages, eliminating the value network
4	GDPO	Multi-reward settings, decoupled normalization

Each algorithm addresses specific limitations of its predecessor:

PPO addresses policy gradient variance
GRPO addresses PPO’s memory overhead
GDPO addresses GRPO’s multi-reward limitations

Understanding this progression helps you choose the right tool for your alignment task.

GDPO: Multi-Reward RL Done Right

The Multi-Reward Reality

Table of Contents

The Advantage Collapse Problem

GRPO with Multiple Rewards: The Naive Approach

The Collapse

Why This Matters

Mathematical Analysis

GRPO’s Combined Normalization

The Collapse Condition

Information Loss

Variance Collapse

GDPO: The Solution

The GDPO Algorithm

Why This Works

Visual Comparison

Implementation

GRPO (Original)

GDPO (Decoupled)

Complete GDPO Trainer

Experimental Results

Tool Calling (2 Rewards)

Math Reasoning (3 Rewards)

Code Generation

Key Observations

When to Use GDPO vs GRPO

Use GRPO When:

Use GDPO When:

Decision Tree

Production Patterns

Pattern 1: Format + Correctness

Pattern 2: Hierarchical Rewards

Pattern 3: Weighted Multi-Objective

Pattern 4: Dynamic Weights

Key Takeaways

The Problem: Advantage Collapse

The Solution: GDPO

When to Use GDPO

Implementation

The Broader Lesson

Series Summary

Further Reading

Cite this article

Continue reading

Discussion

Keep Reading