GDPO: Multi-Reward RL Done Right
Part 4 of 4: The Multi-Reward Frontier
TL;DR: GRPO works beautifully for single rewards, but breaks down with multiple rewards. When you normalize the sum of rewards, distinct reward combinations collapse to identical advantages—you lose training signal resolution. GDPO (Group reward-Decoupled normalization Policy Optimization) fixes this by normalizing each reward independently, then combining. Simple change, significant improvement for tool calling, math reasoning, and any multi-objective alignment task.
Reading time: ~25 minutes
Prerequisites: Part 3: GRPO covers group-relative advantage estimation.
The Multi-Reward Reality
Modern LLM training rarely optimizes a single reward. Consider these common scenarios:
Tool Calling:
- Format reward: Does the output follow the required JSON schema?
- Correctness reward: Are the tool calls semantically correct?
Math Reasoning:
- Format reward: Does the output use
<think>and<answer>tags? - Correctness reward: Is the final answer mathematically correct?
- Integer reward: Is the answer an integer (not a float)?
Coding:
- Execution reward: Does the code run without errors?
- Correctness reward: Does it pass test cases?
- Style reward: Does it follow conventions?
The naive approach: sum the rewards and apply GRPO. This doesn’t work well.
Table of Contents
- The Advantage Collapse Problem
- Mathematical Analysis
- GDPO: The Solution
- Implementation
- Experimental Results
- When to Use GDPO vs GRPO
- Production Patterns
- Key Takeaways
The Advantage Collapse Problem
GRPO with Multiple Rewards: The Naive Approach
Suppose we have two rewards: format () and correctness (). The naive approach:
- Combine rewards:
- Apply GRPO normalization:
This is exactly what happens if you use standard GRPO with a multi-reward setup.
The Collapse
Consider four outputs with these reward combinations:
| Output | Format | Correctness | Combined |
|---|---|---|---|
| 1.0 | 0.0 | 1.0 | |
| 0.0 | 1.0 | 1.0 | |
| 1.0 | 1.0 | 2.0 | |
| 0.0 | 0.0 | 0.0 |
Now apply GRPO normalization to combined rewards:
| Output | Combined | Advantage |
|---|---|---|
| 1.0 | ||
| 1.0 | ||
| 2.0 | ||
| 0.0 |
The problem: Outputs and get identical advantages (zero), despite having completely different reward profiles:
- : Good format, bad correctness
- : Bad format, good correctness
The normalization has collapsed these distinct cases into the same training signal.
Why This Matters
With advantage collapse:
- Lost resolution: The model can’t distinguish “format good, correctness bad” from “format bad, correctness good”
- Suboptimal convergence: Training signal is weaker than it could be
- Training instability: In extreme cases, can cause early training failure
Key Insight: When rewards measure different aspects (format vs. correctness), combining them before normalization destroys information about which aspect each output excels at.
Mathematical Analysis
Let’s formalize why advantage collapse happens.
GRPO’s Combined Normalization
Given reward functions with weights , GRPO computes:
where and .
The Collapse Condition
Outputs and have identical advantages when:
This happens whenever different reward combinations sum to the same value—regardless of which specific rewards contributed.
Information Loss
Consider the space of possible reward vectors in (K rewards). GRPO projects this to (the weighted sum), then normalizes.
Information lost: All reward vectors mapping to the same sum become indistinguishable.
For rewards with equal weights, the level sets are lines in 2D:
- and and all map to sum
- These represent very different behaviors, but get identical treatment
Variance Collapse
Even more problematic: when rewards are highly correlated or when one dominates, the variance of combined rewards shrinks.
Example: If format reward is always 1 (all outputs follow format), then:
- Variance comes only from correctness
- Format reward contributes nothing to the training signal
GDPO: The Solution
Group reward-Decoupled normalization Policy Optimization (GDPO) normalizes each reward independently before combining.
The GDPO Algorithm
For each reward :
Step 1: Normalize each reward independently
where and .
Step 2: Combine normalized advantages
Step 3: Final batch normalization
Why This Works
Decoupled normalization preserves individual reward signals:
Going back to our example:
| Output | |||||
|---|---|---|---|---|---|
| 1.0 | 0.0 | +1.0 | -1.0 | 0.0 | |
| 0.0 | 1.0 | -1.0 | +1.0 | 0.0 | |
| 1.0 | 1.0 | +1.0 | +1.0 | +2.0 | |
| 0.0 | 0.0 | -1.0 | -1.0 | -2.0 |
Now and still have the same final advantage (after step 3), but the gradient is different:
For : Gradient encourages improving correctness (negative ), maintaining format (positive )
For : Gradient encourages improving format (negative ), maintaining correctness (positive )
The model learns which specific reward to improve for each output.
Visual Comparison
Key Insight: GDPO’s decoupled normalization ensures each reward contributes meaningfully to the gradient, regardless of its scale or correlation with other rewards.
Implementation
GDPO is a straightforward modification to GRPO. Here’s the key difference:
GRPO (Original)
def grpo_advantages(rewards_per_func, reward_weights, num_generations):
"""
GRPO: Normalize combined rewards.
Args:
rewards_per_func: [batch, K] rewards from K reward functions
reward_weights: [K] weights for each reward
num_generations: G, outputs per prompt
"""
# Combine rewards first
rewards = (rewards_per_func * reward_weights).sum(dim=1) # [batch]
# Then normalize
mean_grouped = rewards.view(-1, num_generations).mean(dim=1)
std_grouped = rewards.view(-1, num_generations).std(dim=1)
mean_grouped = mean_grouped.repeat_interleave(num_generations)
std_grouped = std_grouped.repeat_interleave(num_generations)
advantages = (rewards - mean_grouped) / (std_grouped + 1e-4)
return advantages
GDPO (Decoupled)
def gdpo_advantages(rewards_per_func, reward_weights, num_generations):
"""
GDPO: Normalize each reward independently, then combine.
Args:
rewards_per_func: [batch, K] rewards from K reward functions
reward_weights: [K] weights for each reward
num_generations: G, outputs per prompt
"""
K = len(reward_weights)
device = rewards_per_func.device
# Handle NaN values
rewards_per_func = torch.nan_to_num(rewards_per_func)
all_reward_advantages = []
# Step 1: Normalize each reward independently
for k in range(K):
reward_k = rewards_per_func[:, k] # [batch]
# Group-wise normalization for this reward
mean_k = reward_k.view(-1, num_generations).mean(dim=1)
std_k = reward_k.view(-1, num_generations).std(dim=1)
mean_k = mean_k.repeat_interleave(num_generations)
std_k = std_k.repeat_interleave(num_generations)
advantage_k = (reward_k - mean_k) / (std_k + 1e-4)
all_reward_advantages.append(advantage_k)
# Step 2: Combine normalized advantages
combined_advantages = torch.stack(all_reward_advantages, dim=1) # [batch, K]
pre_bn_advantages = (combined_advantages * reward_weights).sum(dim=1) # [batch]
# Step 3: Final batch normalization
bn_mean = pre_bn_advantages.mean()
bn_std = pre_bn_advantages.std()
advantages = (pre_bn_advantages - bn_mean) / (bn_std + 1e-4)
return advantages
Complete GDPO Trainer
"""
Group reward-Decoupled normalization Policy Optimization (GDPO)
===============================================================
Extends GRPO for multi-reward settings by normalizing each
reward independently before combining.
Reference: NVIDIA (Liu et al., 2026)
"""
import torch
import torch.nn.functional as F
from torch import Tensor
from dataclasses import dataclass
from typing import List, Tuple, Callable
@dataclass
class GDPOConfig:
"""GDPO hyperparameters."""
group_size: int = 64
clip_epsilon: float = 0.2
kl_coef: float = 0.04
learning_rate: float = 1e-6
max_grad_norm: float = 1.0
eps: float = 1e-4
def compute_gdpo_advantages(
rewards_list: List[Tensor],
reward_weights: List[float],
eps: float = 1e-4,
) -> Tuple[Tensor, List[Tensor]]:
"""
Compute GDPO advantages with decoupled normalization.
Args:
rewards_list: List of [G] tensors, one per reward function
reward_weights: Weights for each reward
eps: Numerical stability constant
Returns:
advantages: [G] final normalized advantages
per_reward_advantages: List of [G] tensors for logging
"""
K = len(rewards_list)
G = rewards_list[0].shape[0]
device = rewards_list[0].device
# Step 1: Normalize each reward independently
normalized_rewards = []
for k in range(K):
r_k = rewards_list[k]
mean_k = r_k.mean()
std_k = r_k.std() + eps
normalized_k = (r_k - mean_k) / std_k
normalized_rewards.append(normalized_k)
# Step 2: Combine with weights
combined = torch.zeros(G, device=device)
for k in range(K):
combined += reward_weights[k] * normalized_rewards[k]
# Step 3: Final batch normalization
final_mean = combined.mean()
final_std = combined.std() + eps
advantages = (combined - final_mean) / final_std
return advantages, normalized_rewards
class GDPOTrainer:
"""
GDPO trainer for multi-reward LLM alignment.
Key difference from GRPO: each reward is normalized
independently before combining, preserving signal resolution.
"""
def __init__(
self,
policy_model: torch.nn.Module,
reference_model: torch.nn.Module,
reward_functions: List[Callable],
reward_weights: List[float],
tokenizer,
config: GDPOConfig,
):
self.policy = policy_model
self.reference = reference_model
self.reward_fns = reward_functions
self.reward_weights = torch.tensor(reward_weights)
self.tokenizer = tokenizer
self.config = config
assert len(reward_functions) == len(reward_weights), \
"Must have same number of reward functions and weights"
# Freeze reference
for param in self.reference.parameters():
param.requires_grad = False
self.optimizer = torch.optim.AdamW(
self.policy.parameters(),
lr=config.learning_rate,
)
@torch.no_grad()
def sample_and_score(
self,
prompt_text: str,
prompt_ids: Tensor,
attention_mask: Tensor,
) -> Tuple[Tensor, List[Tensor], Tensor]:
"""
Sample G outputs and compute all rewards.
Returns:
output_ids: [G, seq_len] generated sequences
rewards_list: List of [G] tensors, one per reward function
output_masks: [G, seq_len] attention masks
"""
G = self.config.group_size
prompt_len = prompt_ids.shape[1]
# Expand for batch generation
prompt_ids = prompt_ids.expand(G, -1)
attention_mask = attention_mask.expand(G, -1)
# Sample
self.policy.eval()
outputs = self.policy.generate(
input_ids=prompt_ids,
attention_mask=attention_mask,
max_new_tokens=512,
do_sample=True,
temperature=1.0,
pad_token_id=self.tokenizer.pad_token_id,
)
output_ids = outputs
output_masks = (output_ids != self.tokenizer.pad_token_id).long()
# Score with each reward function
rewards_list = []
for reward_fn in self.reward_fns:
rewards_k = []
for i in range(G):
response_ids = output_ids[i, prompt_len:]
response_text = self.tokenizer.decode(
response_ids, skip_special_tokens=True
)
r = reward_fn(prompt_text, response_text)
rewards_k.append(r)
rewards_list.append(
torch.tensor(rewards_k, device=output_ids.device)
)
return output_ids, rewards_list, output_masks
def compute_log_probs(
self,
model: torch.nn.Module,
input_ids: Tensor,
attention_mask: Tensor,
response_start: int,
) -> Tensor:
"""Compute per-token log probs for response."""
outputs = model(input_ids, attention_mask=attention_mask)
logits = outputs.logits
shift_logits = logits[:, :-1, :]
shift_labels = input_ids[:, 1:]
log_probs = F.log_softmax(shift_logits, dim=-1)
token_log_probs = log_probs.gather(
dim=-1, index=shift_labels.unsqueeze(-1)
).squeeze(-1)
return token_log_probs[:, response_start - 1:]
def train_step(
self,
prompt_text: str,
prompt_ids: Tensor,
attention_mask: Tensor,
) -> dict:
"""
Single GDPO training step.
"""
prompt_len = prompt_ids.shape[1]
# Sample and score
output_ids, rewards_list, output_masks = self.sample_and_score(
prompt_text, prompt_ids, attention_mask
)
# Compute GDPO advantages
advantages, per_reward_adv = compute_gdpo_advantages(
rewards_list,
self.reward_weights.tolist(),
self.config.eps,
)
# Get log probs
with torch.no_grad():
old_log_probs = self.compute_log_probs(
self.policy, output_ids, output_masks, prompt_len
)
ref_log_probs = self.compute_log_probs(
self.reference, output_ids, output_masks, prompt_len
)
# Update
self.policy.train()
policy_log_probs = self.compute_log_probs(
self.policy, output_ids, output_masks, prompt_len
)
response_mask = output_masks[:, prompt_len:]
# Compute loss (same as GRPO)
loss, metrics = self._compute_loss(
policy_log_probs,
old_log_probs,
ref_log_probs,
advantages,
response_mask,
)
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(
self.policy.parameters(),
self.config.max_grad_norm
)
self.optimizer.step()
# Add per-reward metrics
for k, rewards_k in enumerate(rewards_list):
metrics[f"reward_{k}_mean"] = rewards_k.mean().item()
metrics[f"reward_{k}_std"] = rewards_k.std().item()
metrics[f"adv_{k}_mean"] = per_reward_adv[k].mean().item()
return metrics
def _compute_loss(
self,
policy_log_probs: Tensor,
old_log_probs: Tensor,
ref_log_probs: Tensor,
advantages: Tensor,
mask: Tensor,
) -> Tuple[Tensor, dict]:
"""Compute GDPO loss with clipping and KL."""
G, seq_len = policy_log_probs.shape
# Importance ratio
ratio = torch.exp(policy_log_probs - old_log_probs)
# Expand advantages
adv_expanded = advantages.unsqueeze(1).expand(-1, seq_len)
# Clipped objective
clipped = torch.clamp(
ratio,
1 - self.config.clip_epsilon,
1 + self.config.clip_epsilon
)
policy_loss = -torch.min(
ratio * adv_expanded,
clipped * adv_expanded
)
# KL penalty
log_ratio = ref_log_probs - policy_log_probs
kl = torch.exp(log_ratio) - log_ratio - 1
# Combine
token_loss = policy_loss + self.config.kl_coef * kl
# Mask and average
masked_loss = (token_loss * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)
loss = masked_loss.mean()
with torch.no_grad():
clip_frac = ((ratio - 1).abs() > self.config.clip_epsilon).float()
clip_frac = (clip_frac * mask).sum() / mask.sum()
metrics = {
"loss": loss.item(),
"mean_kl": (kl * mask).sum().item() / mask.sum().item(),
"clip_fraction": clip_frac.item(),
"mean_advantage": advantages.mean().item(),
}
return loss, metrics
Experimental Results
The GDPO paper demonstrates consistent improvements across multiple tasks.
Tool Calling (2 Rewards)
Training Qwen2.5-1.5B-Instruct with format + correctness rewards:
| Method | Format Compliance | Tool Accuracy | Training Stability |
|---|---|---|---|
| GRPO | 78% | 61% | Some oscillation |
| GDPO | 94% | 73% | Stable convergence |
GDPO achieves better performance on both metrics with more stable training curves.
Math Reasoning (3 Rewards)
Training on GSM8K with format + correctness + integer rewards:
| Method | Format | Accuracy | Integer | Overall |
|---|---|---|---|---|
| GRPO | 82% | 45% | 71% | 42% |
| GDPO | 96% | 52% | 89% | 51% |
The three-reward setting amplifies GDPO’s advantage—more rewards mean more potential for collapse.
Code Generation
Similar patterns: GDPO consistently outperforms GRPO on execution, correctness, and style metrics.
Key Observations
- More rewards → bigger gap: The advantage collapse problem worsens with more rewards
- Training stability: GDPO shows smoother loss curves and fewer failures
- Complementary rewards: When rewards measure different aspects, GDPO’s decoupling is most beneficial
Key Insight: GDPO’s improvement scales with the number of rewards. For single-reward settings, use GRPO. For multi-reward, GDPO is the clear choice.
When to Use GDPO vs GRPO
Use GRPO When:
- Single reward: No multi-reward setting
- Highly correlated rewards: Rewards that move together don’t benefit from decoupling
- Simplicity preferred: GRPO is simpler to implement
Use GDPO When:
- Multiple rewards: Two or more distinct reward functions
- Complementary rewards: Rewards measure different aspects (format vs. correctness)
- Training instability with GRPO: If GRPO shows oscillation or failure, try GDPO
- Production multi-objective alignment: Default for real-world multi-reward scenarios
Decision Tree
Production Patterns
Pattern 1: Format + Correctness
The most common multi-reward setup:
def format_reward(prompt: str, response: str) -> float:
"""Binary: follows required structure?"""
has_think = "<think>" in response and "</think>" in response
has_answer = "<answer>" in response and "</answer>" in response
return 1.0 if (has_think and has_answer) else 0.0
def correctness_reward(prompt: str, response: str) -> float:
"""Continuous: semantically correct?"""
# Extract answer and check against ground truth
# Returns value in [-1, 1]
...
reward_fns = [format_reward, correctness_reward]
reward_weights = [1.0, 1.0] # Equal weighting
Use GDPO—format and correctness are complementary.
Pattern 2: Hierarchical Rewards
When one reward is prerequisite for another:
def hierarchical_correctness(prompt: str, response: str) -> float:
"""Only evaluate correctness if format is satisfied."""
if format_reward(prompt, response) < 0.5:
return 0.0 # Can't evaluate without proper format
return base_correctness_reward(prompt, response)
reward_fns = [format_reward, hierarchical_correctness]
GDPO still helps, but consider the dependency structure.
Pattern 3: Weighted Multi-Objective
Different importance for different rewards:
reward_fns = [safety_reward, helpfulness_reward, conciseness_reward]
reward_weights = [2.0, 1.0, 0.5] # Safety prioritized
GDPO’s per-reward normalization ensures each contributes proportionally.
Pattern 4: Dynamic Weights
Adjusting weights during training:
def get_weights(step: int) -> List[float]:
"""Start with format emphasis, shift to correctness."""
format_w = max(0.5, 1.0 - step / 10000)
correct_w = min(1.5, 0.5 + step / 10000)
return [format_w, correct_w]
GDPO handles this naturally since normalization is per-reward.
Key Takeaways
The Problem: Advantage Collapse
When GRPO normalizes combined rewards, distinct reward combinations can collapse to identical advantages, losing training signal resolution.
The Solution: GDPO
Normalize each reward independently, then combine:
# GRPO (problematic)
combined = sum(w * r for w, r in zip(weights, rewards))
advantage = (combined - mean) / std
# GDPO (preserves signal)
normalized = [(r - mean(r)) / std(r) for r in rewards]
combined = sum(w * n for w, n in zip(weights, normalized))
advantage = (combined - mean) / std # Final batch norm
When to Use GDPO
- Multiple rewards measuring different aspects
- Tool calling, math reasoning, code generation
- Any multi-objective LLM alignment
Implementation
GDPO is a drop-in replacement for GRPO’s advantage computation. Minimal code change, significant improvement.
The Broader Lesson
As LLM training becomes more sophisticated, we’ll see more multi-reward settings. Understanding how normalization affects training signal is crucial for designing effective alignment pipelines.
Series Summary
Over four articles, we’ve built a complete understanding of policy optimization for LLMs:
| Part | Topic | Key Contribution |
|---|---|---|
| 1 | RL Foundations | MDPs, policy gradients, variance reduction, value functions |
| 2 | PPO | Clipped objectives, GAE, the four-model architecture |
| 3 | GRPO | Group-relative advantages, eliminating the value network |
| 4 | GDPO | Multi-reward settings, decoupled normalization |
Each algorithm addresses specific limitations of its predecessor:
- PPO addresses policy gradient variance
- GRPO addresses PPO’s memory overhead
- GDPO addresses GRPO’s multi-reward limitations
Understanding this progression helps you choose the right tool for your alignment task.
Further Reading
GDPO:
- GDPO: Group reward-Decoupled Normalization Policy Optimization (Liu et al., 2026)
- GitHub: NVlabs/GDPO
Foundations:
- DeepSeekMath — GRPO origin
- PPO — The baseline
Multi-Objective RL:
Implementations:
Article series
Policy Optimization for LLMs: From Fundamentals to Production
- Part 1 Reinforcement Learning Foundations for LLM Alignment
- Part 2 PPO for Language Models: The RLHF Workhorse
- Part 3 GRPO: Eliminating the Value Network
- Part 4 GDPO: Multi-Reward RL Done Right
Cite this article
Sousa, V. (2026). GDPO: Multi-Reward RL Done Right. vitorsousa.com. https://www.vitorsousa.com/blog//
@article{sousa2026,
title={GDPO: Multi-Reward RL Done Right},
author={Sousa, Vitor},
year={2026},
url={https://www.vitorsousa.com/blog//}
} Enjoyed this? Get notified when I publish new articles.
Subscribe via RSS
Discussion
Found something useful, spotted an error, or want to add context? Comments are powered by GitHub Discussions.