Back to blog
~25 min By Vitor Sousa
Part 4 of 4 in Policy Optimization for LLMs: From Fundamentals to Production View full series →
advanced Prerequisites: Reinforcement Learning Foundations for LLM Alignment , PPO for Language Models: The RLHF Workhorse , GRPO: Eliminating the Value Network

GDPO: Multi-Reward RL Done Right

Part 4 of 4: The Multi-Reward Frontier

TL;DR: GRPO works beautifully for single rewards, but breaks down with multiple rewards. When you normalize the sum of rewards, distinct reward combinations collapse to identical advantages—you lose training signal resolution. GDPO (Group reward-Decoupled normalization Policy Optimization) fixes this by normalizing each reward independently, then combining. Simple change, significant improvement for tool calling, math reasoning, and any multi-objective alignment task.

Reading time: ~25 minutes

Prerequisites: Part 3: GRPO covers group-relative advantage estimation.


The Multi-Reward Reality

Modern LLM training rarely optimizes a single reward. Consider these common scenarios:

Tool Calling:

  • Format reward: Does the output follow the required JSON schema?
  • Correctness reward: Are the tool calls semantically correct?

Math Reasoning:

  • Format reward: Does the output use <think> and <answer> tags?
  • Correctness reward: Is the final answer mathematically correct?
  • Integer reward: Is the answer an integer (not a float)?

Coding:

  • Execution reward: Does the code run without errors?
  • Correctness reward: Does it pass test cases?
  • Style reward: Does it follow conventions?

The naive approach: sum the rewards and apply GRPO. This doesn’t work well.


Table of Contents

  1. The Advantage Collapse Problem
  2. Mathematical Analysis
  3. GDPO: The Solution
  4. Implementation
  5. Experimental Results
  6. When to Use GDPO vs GRPO
  7. Production Patterns
  8. Key Takeaways

The Advantage Collapse Problem

GRPO with Multiple Rewards: The Naive Approach

Suppose we have two rewards: format (rfr^f) and correctness (rcr^c). The naive approach:

  1. Combine rewards: ri=wfrif+wcricr_i = w_f \cdot r_i^f + w_c \cdot r_i^c
  2. Apply GRPO normalization: A^i=rimean(r)std(r)\hat{A}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}

This is exactly what happens if you use standard GRPO with a multi-reward setup.

The Collapse

Consider four outputs with these reward combinations:

OutputFormat rfr^fCorrectness rcr^cCombined r=rf+rcr = r^f + r^c
o1o_11.00.01.0
o2o_20.01.01.0
o3o_31.01.02.0
o4o_40.00.00.0

Now apply GRPO normalization to combined rewards:

  • mean(r)=(1+1+2+0)/4=1.0\text{mean}(\mathbf{r}) = (1 + 1 + 2 + 0) / 4 = 1.0
  • std(r)=((11)2+(11)2+(21)2+(01)2)/4=0.50.71\text{std}(\mathbf{r}) = \sqrt{((1-1)^2 + (1-1)^2 + (2-1)^2 + (0-1)^2)/4} = \sqrt{0.5} \approx 0.71
OutputCombined rrAdvantage A^\hat{A}
o1o_11.0(11)/0.71=0(1-1)/0.71 = 0
o2o_21.0(11)/0.71=0(1-1)/0.71 = 0
o3o_32.0(21)/0.71=1.41(2-1)/0.71 = 1.41
o4o_40.0(01)/0.71=1.41(0-1)/0.71 = -1.41

The problem: Outputs o1o_1 and o2o_2 get identical advantages (zero), despite having completely different reward profiles:

  • o1o_1: Good format, bad correctness
  • o2o_2: Bad format, good correctness

The normalization has collapsed these distinct cases into the same training signal.

Why This Matters

With advantage collapse:

  1. Lost resolution: The model can’t distinguish “format good, correctness bad” from “format bad, correctness good”
  2. Suboptimal convergence: Training signal is weaker than it could be
  3. Training instability: In extreme cases, can cause early training failure
%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'11px', 'fontFamily':'monospace' } }}%% graph LR subgraph Collapse["<b>GRPO: Advantage Collapse</b>"] direction TB O1_grpo["o₁: Format ✓ Correct ✗<br/>r = 1.0<br/>━━━━━━━━<br/> = 0"] O2_grpo["o₂: Format ✗ Correct ✓<br/>r = 1.0<br/>━━━━━━━━<br/> = 0"] Same["<b>Same advantage!</b><br/>━━━━━━━━<br/>Can't distinguish<br/>different failures"] O1_grpo --> Same O2_grpo --> Same end subgraph Preserved["<b>GDPO: Signal Preserved</b>"] direction TB O1_gdpo["o₁: Format ✓ Correct ✗<br/>Â_f = +1, Â_c = −1<br/>━━━━━━━━<br/>Combined  = 0"] O2_gdpo["o₂: Format ✗ Correct ✓<br/>Â_f = −1, Â_c = +1<br/>━━━━━━━━<br/>Combined  = 0"] Diff["<b>Different gradients!</b><br/>━━━━━━━━<br/>Format vs correctness<br/>learned independently"] O1_gdpo --> Diff O2_gdpo --> Diff end style O1_grpo fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style O2_grpo fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style Same fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2.5px style O1_gdpo fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style O2_gdpo fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style Diff fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Collapse fill:#0f172a,stroke:#ef4444,color:#fecaca style Preserved fill:#0f172a,stroke:#10b981,color:#d1fae5

Key Insight: When rewards measure different aspects (format vs. correctness), combining them before normalization destroys information about which aspect each output excels at.


Mathematical Analysis

Let’s formalize why advantage collapse happens.

GRPO’s Combined Normalization

Given KK reward functions with weights wk{w_k}, GRPO computes:

ri=k=1Kwkri(k)r_i = \sum_{k=1}^{K} w_k \cdot r_i^{(k)}

A^iGRPO=riμrσr\hat{A}_i^{\text{GRPO}} = \frac{r_i - \mu_r}{\sigma_r}

where μr=mean(r)\mu_r = \text{mean}(\mathbf{r}) and σr=std(r)\sigma_r = \text{std}(\mathbf{r}).

The Collapse Condition

Outputs oio_i and ojo_j have identical advantages when:

k=1Kwkri(k)=k=1Kwkrj(k)\sum_{k=1}^{K} w_k \cdot r_i^{(k)} = \sum_{k=1}^{K} w_k \cdot r_j^{(k)}

This happens whenever different reward combinations sum to the same value—regardless of which specific rewards contributed.

Information Loss

Consider the space of possible reward vectors in RK\mathbb{R}^K (K rewards). GRPO projects this to R1\mathbb{R}^1 (the weighted sum), then normalizes.

Information lost: All reward vectors mapping to the same sum become indistinguishable.

For K=2K=2 rewards with equal weights, the level sets are lines in 2D:

  • (0,2)(0, 2) and (2,0)(2, 0) and (1,1)(1, 1) all map to sum =2= 2
  • These represent very different behaviors, but get identical treatment

Variance Collapse

Even more problematic: when rewards are highly correlated or when one dominates, the variance of combined rewards shrinks.

Example: If format reward is always 1 (all outputs follow format), then:

  • ri=1wf+ricwcr_i = 1 \cdot w_f + r_i^c \cdot w_c
  • Variance comes only from correctness
  • Format reward contributes nothing to the training signal

GDPO: The Solution

Group reward-Decoupled normalization Policy Optimization (GDPO) normalizes each reward independently before combining.

The GDPO Algorithm

For each reward k1,,Kk \in {1, \ldots, K}:

Step 1: Normalize each reward independently

A^i(k)=ri(k)μ(k)σ(k)\hat{A}_i^{(k)} = \frac{r_i^{(k)} - \mu^{(k)}}{\sigma^{(k)}}

where μ(k)=mean(r(k))\mu^{(k)} = \text{mean}(\mathbf{r}^{(k)}) and σ(k)=std(r(k))\sigma^{(k)} = \text{std}(\mathbf{r}^{(k)}).

Step 2: Combine normalized advantages

A^ipre=k=1KwkA^i(k)\hat{A}*i^{\text{pre}} = \sum*{k=1}^{K} w_k \cdot \hat{A}_i^{(k)}

Step 3: Final batch normalization

A^iGDPO=A^\*ipreμ\*preσpre\hat{A}*i^{\text{GDPO}} = \frac{\hat{A}\*i^{\text{pre}} - \mu\*{\text{pre}}}{\sigma*{\text{pre}}}

Why This Works

Decoupled normalization preserves individual reward signals:

Going back to our example:

Outputrfr^frcr^cA^f\hat{A}^fA^c\hat{A}^cA^pre\hat{A}^{\text{pre}}
o1o_11.00.0+1.0-1.00.0
o2o_20.01.0-1.0+1.00.0
o3o_31.01.0+1.0+1.0+2.0
o4o_40.00.0-1.0-1.0-2.0

Now o1o_1 and o2o_2 still have the same final advantage (after step 3), but the gradient is different:

For o1o_1: Gradient encourages improving correctness (negative A^c\hat{A}^c), maintaining format (positive A^f\hat{A}^f)

For o2o_2: Gradient encourages improving format (negative A^f\hat{A}^f), maintaining correctness (positive A^c\hat{A}^c)

The model learns which specific reward to improve for each output.

Visual Comparison

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'11px', 'fontFamily':'monospace' } }}%% graph TB subgraph GRPO["<b>GRPO: Combined Normalization</b>"] direction TB G_R1["r₁ (format)"] --> G_Sum["<b>Sum</b><br/>r = w₁r₁ + w₂r₂"] G_R2["r₂ (correct)"] --> G_Sum G_Sum --> G_Norm["<b>Normalize</b><br/>Â = (r − μ) / σ"] G_Norm --> G_Adv["Advantage Â"] end subgraph GDPO["<b>GDPO: Decoupled Normalization</b>"] direction TB D_R1["r₁ (format)"] --> D_N1["<b>Normalize</b><br/>Â₁ = (r₁ − μ₁) / σ₁"] D_R2["r₂ (correct)"] --> D_N2["<b>Normalize</b><br/>Â₂ = (r₂ − μ₂) / σ₂"] D_N1 --> D_Comb["<b>Combine</b><br/>Â_pre = w₁Â₁ + w₂Â₂"] D_N2 --> D_Comb D_Comb --> D_BN["<b>Batch Normalize</b><br/>Â = (Â_pre − μ) / σ"] D_BN --> D_Adv["Advantage Â"] end style G_Sum fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style G_Norm fill:#1e293b,stroke:#ef4444,color:#fecaca,stroke-width:2px style D_N1 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style D_N2 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2px style D_Comb fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px style D_BN fill:#1e293b,stroke:#8b5cf6,color:#e9d5ff,stroke-width:2px style GRPO fill:#0f172a,stroke:#ef4444,color:#fecaca style GDPO fill:#0f172a,stroke:#10b981,color:#d1fae5

Key Insight: GDPO’s decoupled normalization ensures each reward contributes meaningfully to the gradient, regardless of its scale or correlation with other rewards.


Implementation

GDPO is a straightforward modification to GRPO. Here’s the key difference:

GRPO (Original)

def grpo_advantages(rewards_per_func, reward_weights, num_generations):
    """
    GRPO: Normalize combined rewards.
    
    Args:
        rewards_per_func: [batch, K] rewards from K reward functions
        reward_weights: [K] weights for each reward
        num_generations: G, outputs per prompt
    """
    # Combine rewards first
    rewards = (rewards_per_func * reward_weights).sum(dim=1)  # [batch]
    
    # Then normalize
    mean_grouped = rewards.view(-1, num_generations).mean(dim=1)
    std_grouped = rewards.view(-1, num_generations).std(dim=1)
    
    mean_grouped = mean_grouped.repeat_interleave(num_generations)
    std_grouped = std_grouped.repeat_interleave(num_generations)
    
    advantages = (rewards - mean_grouped) / (std_grouped + 1e-4)
    
    return advantages

GDPO (Decoupled)

def gdpo_advantages(rewards_per_func, reward_weights, num_generations):
    """
    GDPO: Normalize each reward independently, then combine.
    
    Args:
        rewards_per_func: [batch, K] rewards from K reward functions
        reward_weights: [K] weights for each reward
        num_generations: G, outputs per prompt
    """
    K = len(reward_weights)
    device = rewards_per_func.device
    
    # Handle NaN values
    rewards_per_func = torch.nan_to_num(rewards_per_func)
    
    all_reward_advantages = []
    
    # Step 1: Normalize each reward independently
    for k in range(K):
        reward_k = rewards_per_func[:, k]  # [batch]
        
        # Group-wise normalization for this reward
        mean_k = reward_k.view(-1, num_generations).mean(dim=1)
        std_k = reward_k.view(-1, num_generations).std(dim=1)
        
        mean_k = mean_k.repeat_interleave(num_generations)
        std_k = std_k.repeat_interleave(num_generations)
        
        advantage_k = (reward_k - mean_k) / (std_k + 1e-4)
        all_reward_advantages.append(advantage_k)
    
    # Step 2: Combine normalized advantages
    combined_advantages = torch.stack(all_reward_advantages, dim=1)  # [batch, K]
    pre_bn_advantages = (combined_advantages * reward_weights).sum(dim=1)  # [batch]
    
    # Step 3: Final batch normalization
    bn_mean = pre_bn_advantages.mean()
    bn_std = pre_bn_advantages.std()
    
    advantages = (pre_bn_advantages - bn_mean) / (bn_std + 1e-4)
    
    return advantages

Complete GDPO Trainer

"""
Group reward-Decoupled normalization Policy Optimization (GDPO)
===============================================================

Extends GRPO for multi-reward settings by normalizing each
reward independently before combining.

Reference: NVIDIA (Liu et al., 2026)
"""

import torch
import torch.nn.functional as F
from torch import Tensor
from dataclasses import dataclass
from typing import List, Tuple, Callable


@dataclass
class GDPOConfig:
    """GDPO hyperparameters."""
    group_size: int = 64
    clip_epsilon: float = 0.2
    kl_coef: float = 0.04
    learning_rate: float = 1e-6
    max_grad_norm: float = 1.0
    eps: float = 1e-4


def compute_gdpo_advantages(
    rewards_list: List[Tensor],
    reward_weights: List[float],
    eps: float = 1e-4,
) -> Tuple[Tensor, List[Tensor]]:
    """
    Compute GDPO advantages with decoupled normalization.
    
    Args:
        rewards_list: List of [G] tensors, one per reward function
        reward_weights: Weights for each reward
        eps: Numerical stability constant
        
    Returns:
        advantages: [G] final normalized advantages
        per_reward_advantages: List of [G] tensors for logging
    """
    K = len(rewards_list)
    G = rewards_list[0].shape[0]
    device = rewards_list[0].device
    
    # Step 1: Normalize each reward independently
    normalized_rewards = []
    for k in range(K):
        r_k = rewards_list[k]
        mean_k = r_k.mean()
        std_k = r_k.std() + eps
        normalized_k = (r_k - mean_k) / std_k
        normalized_rewards.append(normalized_k)
    
    # Step 2: Combine with weights
    combined = torch.zeros(G, device=device)
    for k in range(K):
        combined += reward_weights[k] * normalized_rewards[k]
    
    # Step 3: Final batch normalization
    final_mean = combined.mean()
    final_std = combined.std() + eps
    advantages = (combined - final_mean) / final_std
    
    return advantages, normalized_rewards


class GDPOTrainer:
    """
    GDPO trainer for multi-reward LLM alignment.
    
    Key difference from GRPO: each reward is normalized
    independently before combining, preserving signal resolution.
    """
    
    def __init__(
        self,
        policy_model: torch.nn.Module,
        reference_model: torch.nn.Module,
        reward_functions: List[Callable],
        reward_weights: List[float],
        tokenizer,
        config: GDPOConfig,
    ):
        self.policy = policy_model
        self.reference = reference_model
        self.reward_fns = reward_functions
        self.reward_weights = torch.tensor(reward_weights)
        self.tokenizer = tokenizer
        self.config = config
        
        assert len(reward_functions) == len(reward_weights), \
            "Must have same number of reward functions and weights"
        
        # Freeze reference
        for param in self.reference.parameters():
            param.requires_grad = False
        
        self.optimizer = torch.optim.AdamW(
            self.policy.parameters(),
            lr=config.learning_rate,
        )
    
    @torch.no_grad()
    def sample_and_score(
        self,
        prompt_text: str,
        prompt_ids: Tensor,
        attention_mask: Tensor,
    ) -> Tuple[Tensor, List[Tensor], Tensor]:
        """
        Sample G outputs and compute all rewards.
        
        Returns:
            output_ids: [G, seq_len] generated sequences
            rewards_list: List of [G] tensors, one per reward function
            output_masks: [G, seq_len] attention masks
        """
        G = self.config.group_size
        prompt_len = prompt_ids.shape[1]
        
        # Expand for batch generation
        prompt_ids = prompt_ids.expand(G, -1)
        attention_mask = attention_mask.expand(G, -1)
        
        # Sample
        self.policy.eval()
        outputs = self.policy.generate(
            input_ids=prompt_ids,
            attention_mask=attention_mask,
            max_new_tokens=512,
            do_sample=True,
            temperature=1.0,
            pad_token_id=self.tokenizer.pad_token_id,
        )
        
        output_ids = outputs
        output_masks = (output_ids != self.tokenizer.pad_token_id).long()
        
        # Score with each reward function
        rewards_list = []
        for reward_fn in self.reward_fns:
            rewards_k = []
            for i in range(G):
                response_ids = output_ids[i, prompt_len:]
                response_text = self.tokenizer.decode(
                    response_ids, skip_special_tokens=True
                )
                r = reward_fn(prompt_text, response_text)
                rewards_k.append(r)
            rewards_list.append(
                torch.tensor(rewards_k, device=output_ids.device)
            )
        
        return output_ids, rewards_list, output_masks
    
    def compute_log_probs(
        self,
        model: torch.nn.Module,
        input_ids: Tensor,
        attention_mask: Tensor,
        response_start: int,
    ) -> Tensor:
        """Compute per-token log probs for response."""
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        
        shift_logits = logits[:, :-1, :]
        shift_labels = input_ids[:, 1:]
        
        log_probs = F.log_softmax(shift_logits, dim=-1)
        token_log_probs = log_probs.gather(
            dim=-1, index=shift_labels.unsqueeze(-1)
        ).squeeze(-1)
        
        return token_log_probs[:, response_start - 1:]
    
    def train_step(
        self,
        prompt_text: str,
        prompt_ids: Tensor,
        attention_mask: Tensor,
    ) -> dict:
        """
        Single GDPO training step.
        """
        prompt_len = prompt_ids.shape[1]
        
        # Sample and score
        output_ids, rewards_list, output_masks = self.sample_and_score(
            prompt_text, prompt_ids, attention_mask
        )
        
        # Compute GDPO advantages
        advantages, per_reward_adv = compute_gdpo_advantages(
            rewards_list,
            self.reward_weights.tolist(),
            self.config.eps,
        )
        
        # Get log probs
        with torch.no_grad():
            old_log_probs = self.compute_log_probs(
                self.policy, output_ids, output_masks, prompt_len
            )
            ref_log_probs = self.compute_log_probs(
                self.reference, output_ids, output_masks, prompt_len
            )
        
        # Update
        self.policy.train()
        policy_log_probs = self.compute_log_probs(
            self.policy, output_ids, output_masks, prompt_len
        )
        
        response_mask = output_masks[:, prompt_len:]
        
        # Compute loss (same as GRPO)
        loss, metrics = self._compute_loss(
            policy_log_probs,
            old_log_probs,
            ref_log_probs,
            advantages,
            response_mask,
        )
        
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(
            self.policy.parameters(), 
            self.config.max_grad_norm
        )
        self.optimizer.step()
        
        # Add per-reward metrics
        for k, rewards_k in enumerate(rewards_list):
            metrics[f"reward_{k}_mean"] = rewards_k.mean().item()
            metrics[f"reward_{k}_std"] = rewards_k.std().item()
            metrics[f"adv_{k}_mean"] = per_reward_adv[k].mean().item()
        
        return metrics
    
    def _compute_loss(
        self,
        policy_log_probs: Tensor,
        old_log_probs: Tensor,
        ref_log_probs: Tensor,
        advantages: Tensor,
        mask: Tensor,
    ) -> Tuple[Tensor, dict]:
        """Compute GDPO loss with clipping and KL."""
        G, seq_len = policy_log_probs.shape
        
        # Importance ratio
        ratio = torch.exp(policy_log_probs - old_log_probs)
        
        # Expand advantages
        adv_expanded = advantages.unsqueeze(1).expand(-1, seq_len)
        
        # Clipped objective
        clipped = torch.clamp(
            ratio,
            1 - self.config.clip_epsilon,
            1 + self.config.clip_epsilon
        )
        
        policy_loss = -torch.min(
            ratio * adv_expanded,
            clipped * adv_expanded
        )
        
        # KL penalty
        log_ratio = ref_log_probs - policy_log_probs
        kl = torch.exp(log_ratio) - log_ratio - 1
        
        # Combine
        token_loss = policy_loss + self.config.kl_coef * kl
        
        # Mask and average
        masked_loss = (token_loss * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
        loss = masked_loss.mean()
        
        with torch.no_grad():
            clip_frac = ((ratio - 1).abs() > self.config.clip_epsilon).float()
            clip_frac = (clip_frac * mask).sum() / mask.sum()
        
        metrics = {
            "loss": loss.item(),
            "mean_kl": (kl * mask).sum().item() / mask.sum().item(),
            "clip_fraction": clip_frac.item(),
            "mean_advantage": advantages.mean().item(),
        }
        
        return loss, metrics

Experimental Results

The GDPO paper demonstrates consistent improvements across multiple tasks.

Tool Calling (2 Rewards)

Training Qwen2.5-1.5B-Instruct with format + correctness rewards:

MethodFormat ComplianceTool AccuracyTraining Stability
GRPO78%61%Some oscillation
GDPO94%73%Stable convergence

GDPO achieves better performance on both metrics with more stable training curves.

Math Reasoning (3 Rewards)

Training on GSM8K with format + correctness + integer rewards:

MethodFormatAccuracyIntegerOverall
GRPO82%45%71%42%
GDPO96%52%89%51%

The three-reward setting amplifies GDPO’s advantage—more rewards mean more potential for collapse.

Code Generation

Similar patterns: GDPO consistently outperforms GRPO on execution, correctness, and style metrics.

Key Observations

  1. More rewards → bigger gap: The advantage collapse problem worsens with more rewards
  2. Training stability: GDPO shows smoother loss curves and fewer failures
  3. Complementary rewards: When rewards measure different aspects, GDPO’s decoupling is most beneficial

Key Insight: GDPO’s improvement scales with the number of rewards. For single-reward settings, use GRPO. For multi-reward, GDPO is the clear choice.


When to Use GDPO vs GRPO

Use GRPO When:

  • Single reward: No multi-reward setting
  • Highly correlated rewards: Rewards that move together don’t benefit from decoupling
  • Simplicity preferred: GRPO is simpler to implement

Use GDPO When:

  • Multiple rewards: Two or more distinct reward functions
  • Complementary rewards: Rewards measure different aspects (format vs. correctness)
  • Training instability with GRPO: If GRPO shows oscillation or failure, try GDPO
  • Production multi-objective alignment: Default for real-world multi-reward scenarios

Decision Tree

%%{init: { 'theme':'base', 'themeVariables': { 'primaryColor':'#0b1220', 'primaryTextColor':'#e5e7eb', 'primaryBorderColor':'#10b981', 'lineColor':'#06b6d4', 'secondaryColor':'#0f172a', 'tertiaryColor':'#1e293b', 'fontSize':'12px', 'fontFamily':'monospace' } }}%% graph TD Start{"<b>How many<br/>rewards?</b>"} Start -->|"One"| GRPO["<b>Use GRPO</b><br/>━━━━━━━━<br/>No multi-reward<br/>complexity"] Start -->|"Multiple"| Q2{"<b>Are rewards<br/>complementary?</b>"} Q2 -->|"Yes (different aspects)"| GDPO["<b>Use GDPO</b><br/>━━━━━━━━<br/>Preserves signal<br/>resolution"] Q2 -->|"No (highly correlated)"| Q3{"<b>Training<br/>stable?</b>"} Q3 -->|"Yes"| GRPO2["<b>Use GRPO</b><br/>━━━━━━━━<br/>Simpler, works"] Q3 -->|"No"| GDPO2["<b>Try GDPO</b><br/>━━━━━━━━<br/>May stabilize<br/>training"] style Start fill:#334155,stroke:#64748b,color:#e5e7eb style GRPO fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px style GRPO2 fill:#1e293b,stroke:#06b6d4,color:#cffafe,stroke-width:2px style GDPO fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style GDPO2 fill:#1e293b,stroke:#10b981,color:#d1fae5,stroke-width:2.5px style Q2 fill:#334155,stroke:#64748b,color:#e5e7eb style Q3 fill:#334155,stroke:#64748b,color:#e5e7eb

Production Patterns

Pattern 1: Format + Correctness

The most common multi-reward setup:

def format_reward(prompt: str, response: str) -> float:
    """Binary: follows required structure?"""
    has_think = "<think>" in response and "</think>" in response
    has_answer = "<answer>" in response and "</answer>" in response
    return 1.0 if (has_think and has_answer) else 0.0

def correctness_reward(prompt: str, response: str) -> float:
    """Continuous: semantically correct?"""
    # Extract answer and check against ground truth
    # Returns value in [-1, 1]
    ...

reward_fns = [format_reward, correctness_reward]
reward_weights = [1.0, 1.0]  # Equal weighting

Use GDPO—format and correctness are complementary.

Pattern 2: Hierarchical Rewards

When one reward is prerequisite for another:

def hierarchical_correctness(prompt: str, response: str) -> float:
    """Only evaluate correctness if format is satisfied."""
    if format_reward(prompt, response) < 0.5:
        return 0.0  # Can't evaluate without proper format
    return base_correctness_reward(prompt, response)

reward_fns = [format_reward, hierarchical_correctness]

GDPO still helps, but consider the dependency structure.

Pattern 3: Weighted Multi-Objective

Different importance for different rewards:

reward_fns = [safety_reward, helpfulness_reward, conciseness_reward]
reward_weights = [2.0, 1.0, 0.5]  # Safety prioritized

GDPO’s per-reward normalization ensures each contributes proportionally.

Pattern 4: Dynamic Weights

Adjusting weights during training:

def get_weights(step: int) -> List[float]:
    """Start with format emphasis, shift to correctness."""
    format_w = max(0.5, 1.0 - step / 10000)
    correct_w = min(1.5, 0.5 + step / 10000)
    return [format_w, correct_w]

GDPO handles this naturally since normalization is per-reward.


Key Takeaways

The Problem: Advantage Collapse

When GRPO normalizes combined rewards, distinct reward combinations can collapse to identical advantages, losing training signal resolution.

The Solution: GDPO

Normalize each reward independently, then combine:

# GRPO (problematic)
combined = sum(w * r for w, r in zip(weights, rewards))
advantage = (combined - mean) / std

# GDPO (preserves signal)
normalized = [(r - mean(r)) / std(r) for r in rewards]
combined = sum(w * n for w, n in zip(weights, normalized))
advantage = (combined - mean) / std  # Final batch norm

When to Use GDPO

  • Multiple rewards measuring different aspects
  • Tool calling, math reasoning, code generation
  • Any multi-objective LLM alignment

Implementation

GDPO is a drop-in replacement for GRPO’s advantage computation. Minimal code change, significant improvement.

The Broader Lesson

As LLM training becomes more sophisticated, we’ll see more multi-reward settings. Understanding how normalization affects training signal is crucial for designing effective alignment pipelines.


Series Summary

Over four articles, we’ve built a complete understanding of policy optimization for LLMs:

PartTopicKey Contribution
1RL FoundationsMDPs, policy gradients, variance reduction, value functions
2PPOClipped objectives, GAE, the four-model architecture
3GRPOGroup-relative advantages, eliminating the value network
4GDPOMulti-reward settings, decoupled normalization

Each algorithm addresses specific limitations of its predecessor:

  • PPO addresses policy gradient variance
  • GRPO addresses PPO’s memory overhead
  • GDPO addresses GRPO’s multi-reward limitations

Understanding this progression helps you choose the right tool for your alignment task.


Further Reading

GDPO:

Foundations:

Multi-Objective RL:

Implementations:

Article series

Policy Optimization for LLMs: From Fundamentals to Production

Part 4 of 4

  1. Part 1 Reinforcement Learning Foundations for LLM Alignment
  2. Part 2 PPO for Language Models: The RLHF Workhorse
  3. Part 3 GRPO: Eliminating the Value Network
  4. Part 4 GDPO: Multi-Reward RL Done Right

Cite this article

Sousa, V. (2026). GDPO: Multi-Reward RL Done Right. vitorsousa.com. https://www.vitorsousa.com/blog//

@article{sousa2026,
  title={GDPO: Multi-Reward RL Done Right},
  author={Sousa, Vitor},
  year={2026},
  url={https://www.vitorsousa.com/blog//}
}

Discussion

Found something useful, spotted an error, or want to add context? Comments are powered by GitHub Discussions.

Keep Reading

View all articles