Back to blog
~18 min By Vitor Sousa
Part 5 of 5 in Policy Optimization for LLMs: From Fundamentals to Production View full series →
advanced Prerequisites: Reinforcement Learning Foundations for LLM Alignment , PPO for Language Models: The RLHF Workhorse , GRPO: Eliminating the Value Network , GDPO: Multi-Reward RL Done Right

After GDPO: A Map of the GRPO Family

Part 5 of 5: The Map

TL;DR: DAPO, Dr. GRPO, GSPO, GDPO, CISPO, VAPO, GiGPO, GMPO, REINFORCE++… it looks like GRPO fragmented into a zoo. It didn’t. Every post-GRPO method keeps the same skeleton — drop the critic, compute a group-relative advantage — and varies along a small set of design axes: token vs. sequence granularity, clip width, length and standard-deviation normalization, KL on or off, how advantages are normalized, and how groups are sampled. Hold the axes and the zoo collapses into a handful of deliberate trade-offs. This post maps the four variants worth treating in depth, gives you a decision guide for when to reach for each, and equips you to reason about the next variant you meet — not just the ones listed here.

Reading time: ~18 minutes

Prerequisites: This is the capstone of the series. It assumes you already know GRPO. If not, start with Part 3: GRPO — and ideally Part 2: PPO and Part 4: GDPO — then come back. This post sits on top of those; it does not re-derive them.


The Zoo Is Real

Open any RL-for-LLMs reading list from the last year and you will drown in acronyms. DAPO. Dr. GRPO. GSPO. GDPO. CISPO. GMPO. VAPO. GiGPO. REINFORCE++. There is even a curated index whose entire job is to keep up with the post-R1 GRPO family. The natural reaction is the one everyone has: the field fragmented, and I now have ten algorithms to learn instead of one.

That reaction is wrong, and it is wrong in a useful way. These are not ten algorithms. They are one algorithm — GRPO — being paid down along a handful of engineering debts that GRPO left on the table. Once you can name those debts, every “new” method announces itself as which debt it pays and how. You stop memorizing the zoo and start reading coordinates on a map.

This post is that map. It is deliberately a synthesis, not a listicle: the value is in the axes and the judgement, not in summaries you could get from each paper’s abstract.


The Shared Skeleton (What Nobody Changes)

Start with what every member of the family keeps. Strip away the branding and they all make GRPO’s two core moves:

  1. Drop the critic. No learned value network VψV_\psi. PPO’s four-model architecture loses a model, and with it roughly a third of the memory.
  2. Compute a group-relative advantage. For a prompt qq, sample a group of GG completions {o1,,oG}\{o_1, \ldots, o_G\} from the old policy, score them, and use the group statistics as the baseline:

A^i,t=r~i=rimean(r)std(r),r={r1,,rG}.\hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}, \qquad \mathbf{r} = \{r_1, \ldots, r_G\}.

That is the whole skeleton. At heart it is REINFORCE with a baseline — the mean reward of the group is the baseline — which is exactly the lineage Ahmadian et al. (2024) argued we should return to for RLHF. Everything is then plugged into a PPO-style clipped surrogate:

Li,t=min[ρi,tA^i,t,  clip(ρi,t,1ε,1+ε)A^i,t],ρi,t=πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t).\mathcal{L}_{i,t} = \min\left[\rho_{i,t}\, \hat{A}_{i,t}, \; \text{clip}(\rho_{i,t}, 1-\varepsilon, 1+\varepsilon)\, \hat{A}_{i,t}\right], \qquad \rho_{i,t} = \frac{\pi_\theta(o_{i,t}\,|\,q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}\,|\,q, o_{i,<t})}.

Every variant in this post inherits these two equations. What they argue about is the fine print — and the fine print turns out to be a short list.


The Axes of Variation

This is the intellectual core of the post. Spend time here; the variants below are just coordinates in the space these axes define.

Granularity. Is the importance ratio ρ\rho and its clipping computed per token or per sequence? GRPO is token-level. A single rare token can produce a wildly large or small ratio, and that noise compounds — badly in mixture-of-experts models, where the active experts can change between the sampling policy and the training policy. Sequence-level methods average the signal over the whole completion so the unit of optimization matches the unit of reward.

Clip width. PPO/GRPO clip symmetrically, [1ε,1+ε][1-\varepsilon, 1+\varepsilon]. But the clip ceiling quietly caps upside: it limits how much probability mass a good-but-currently-unlikely token can gain in one step. Those low-probability tokens are often the interesting ones — the “wait,” “let me reconsider,” fork-and-reflect tokens that long chain-of-thought reasoning depends on. Widening or de-symmetrizing the clip (“clip-higher”) controls the explore/exploit edge and is the main lever against entropy collapse, where the policy sharpens into a narrow, repetitive mode and stops learning.

Length and std normalization. GRPO divides the per-response loss by response length and divides the advantage by the group’s standard deviation. Both are “harmless-looking” conveniences that bias the gradient. Length normalization makes the optimizer prefer longer wrong answers (more tokens over which to spread a negative advantage means a smaller per-token penalty); std normalization over-weights easy prompts where rewards barely vary. These biases are why responses balloon during training without getting better.

KL term. Keep the KL penalty to the reference model, or drop it? KL anchors the policy and prevents drift, but it costs a reference model in memory and can hold the policy back when you want it to move far from the base model (as in R1-Zero-style training from a base checkpoint). Many recent recipes drop it entirely.

Advantage normalization. Per-group versus global (batch-wide) normalization, and — crucially in multi-objective settings — single-reward versus multi-reward. When several rewards are summed and then normalized together, distinct reward combinations can collapse to identical advantages, blinding the policy to one objective. This is the failure Part 4 is built around.

Sampling. Static groups versus dynamic sampling. A group where every completion is correct (or every one is wrong) has zero advantage variance and therefore contributes no gradient — it is wasted compute. Dynamically filtering or resampling degenerate groups keeps every batch informative.

Six axes. That is the entire design space the family lives in. Here is the same idea as a picture — each axis, and the variant that pushes hardest along it:

The six axes — and the variant that moves along each Each track runs from GRPO's default (amber) to the variant that pushes hardest along it (green).

Granularity token-level sequence-level GRPO GSPO

Clip width tight, symmetric wide (clip-higher) GRPO DAPO

Length + std norm on (biased) removed (unbiased) GRPO Dr. GRPO

Advantage norm pooled rewards decoupled per reward GRPO GDPO

KL term kept dropped GRPO DAPO / Dr. GRPO

Sampling static groups dynamic sampling GRPO DAPO

Note that DAPO appears on three axes — it is not one trick but a bundle of axis-moves shipped together.

Two patterns jump out of this picture. First, most variants move along one axis — they are surgical. Second, DAPO moves along several at once; it is less a single idea than a curated bundle. Keep both observations in mind as we walk the variants.

What do these axes look like in a training run? The two biases that motivate Dr. GRPO and DAPO show up as characteristic curve shapes:

Response length training steps → GRPO (inflates) Dr. GRPO (flat)

Policy entropy training steps → symmetric clip (collapses) clip-higher (holds)

Schematic, not from any single paper — redrawn to show the shapes. Length inflation and entropy collapse are the two failure curves the family was built to flatten.

The Variants: Problem → Change → When

With the axes in hand, here are the four worth knowing in depth, plus the frontier cluster. Read each as a triple: the problem it noticed, the change it made, and the situation that should make you reach for it.

VariantCore change vs. GRPOProblem it fixesReach for it when…
GRPO— (the baseline)PPO’s critic costyou want a sane critic-free default
Dr. GRPOremoves length + std normalization → unbiased gradientresponse-length inflation; token wasteresponses balloon; you want token efficiency
DAPOclip-higher + dynamic sampling + token-level loss + overlong reward shapingentropy collapse, reward noise, long-CoT instabilityscaling up; long chains-of-thought; training diverging
GSPOsequence-level importance ratio + sequence-level clippingnoisy token ratios; MoE instabilityMoE models; large-scale stability
GDPOnormalize each reward independently before combiningadvantage collapse under multiple rewardsmulti-objective alignment (tool use, multi-criteria)
CISPOclips importance-sampling weights, not token updateslow-prob “fork”/reflection tokens being clipped outyou’re losing reflective “aha” tokens

Dr. GRPO — debias the gradient

Understanding R1-Zero-Like Training: A Critical Perspective (Liu et al., 2025) noticed that GRPO’s length and standard-deviation normalization are not neutral conveniences — they are biases. The length term systematically rewards longer wrong answers, which is exactly the runaway-verbosity pathology practitioners kept seeing. The fix is almost embarrassingly small: delete both normalizers. No division by response length, no division by group std. The gradient becomes unbiased, responses stop inflating, and the model is markedly more token-efficient. Built on the lightweight Oat framework, with a clean reference implementation. Reach for it the moment you see length climbing without accuracy following.

DAPO — the production bundle

DAPO: An Open-Source LLM Reinforcement Learning System at Scale (Yu et al., 2025) is the one that is really four things at once — Decoupled Clip and Dynamic sAmpling Policy Optimization. It ships clip-higher (de-symmetrize the clip ceiling to keep low-probability exploratory tokens alive and fight entropy collapse), dynamic sampling (drop degenerate all-correct/all-wrong groups so every batch carries gradient), token-level policy-gradient loss (so long completions are not under-weighted in long-CoT training), and overlong reward shaping (soften the penalty for responses that run past the length budget, reducing reward noise). Together they reach 50 points on AIME 2024 with a Qwen2.5-32B base, with code and data fully open on top of verl (repo). DAPO is what you reach for when you are scaling up real long-CoT training and instability — diverging loss, collapsing entropy — is the thing standing between you and a good model.

GSPO — match the unit of optimization to the unit of reward

Group Sequence Policy Optimization (Zheng et al., 2025, the Qwen team) made a single sharp observation: the reward is assigned to a sequence, so why is the importance ratio computed per token? Token-level ratios are noisy, and that noise is catastrophic for mixture-of-experts models, where the set of active experts can differ between the sampling and training passes. GSPO defines the importance ratio on the sequence likelihood and clips at the sequence level. The result stabilizes MoE RL training — enough that it underpins the latest Qwen3 models. Reach for it when you are training MoE models or chasing large-scale stability, and when token-level ratios are giving you grief. (Implementations live in verl and TRL; the Qwen write-up is the readable intro.)

GDPO — keep multiple rewards from colliding

This is Part 4, so I will be brief. GDPO: Group reward-Decoupled Normalization Policy Optimization (Liu et al., 2026, NVIDIA) attacks the multi-reward case. Sum several rewards and normalize the total, and different reward combinations can map to the same advantage — the policy goes blind to one objective. GDPO normalizes each reward independently before combining, preserving the resolution of every signal. Reach for it whenever you are optimizing more than one reward at once — tool calling with format-plus-correctness, math with format-plus-answer-plus-integer, coding with execution-plus-tests-plus-style. (If you suspect this is already happening to you, the advantage-collapse checklist is a quick diagnostic.)

The frontier cluster

Beyond the four, a frontier worth knowing as coordinates, not as separate disciplines:

  • CISPO, from the MiniMax-M1 technical report (2025), clips the importance-sampling weights rather than the token updates. The motivation is precisely the clip-width axis above: standard clipping silently discards the low-probability fork-and-reflect tokens, so CISPO keeps every token contributing and reports a 2× speedup over DAPO in their setting. Reach for it when you suspect your clip is eating your model’s “aha” moments.
  • SPO (Single-stream Policy Optimization, Xu & Ding, 2025) goes group-free: it replaces the on-the-fly group baseline with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch. It is a glimpse of where the family heads once you decide the group itself is the bottleneck.

You can now place the rest of the zoo — VAPO, GMPO, GiGPO, REINFORCE++ — without my help: each one moves along one or two of the six axes. That is the entire point of the map.


A Decision Guide

The variants reduce to a flowchart. Start from the symptom, not the acronym.

Running GRPO and hitting a wall? Diagnose the symptom, then read across. Responses ballooning, token waste Dr. GRPO debias length + std Long-CoT at scale, entropy collapsing DAPO clip-higher + dynamic sampling MoE or large-scale instability GSPO sequence-level ratio + clip Multiple rewards colliding GDPO decouple reward normalization Losing reflection / "aha" tokens CISPO clip the IS weights, not tokens No specific failure? Plain GRPO (or REINFORCE-with-baseline / RLOO) is a strong, simple default.
These are not mutually exclusive — DAPO already bundles several of these moves, and verl lets you compose them.

The honest version of this flowchart has a caveat: the moves compose. You can run Dr. GRPO’s debiasing and clip-higher and drop KL at the same time; that is roughly what the minimal from-scratch repos do by default. The flowchart tells you which lever a symptom points to, not that you may pull only one.


”Is This Just PPO With Extra Steps?”

A fair skeptic will push back: if everything is REINFORCE-with-a-baseline wearing a PPO surrogate, did we really invent anything?

Two uncomfortable findings say mostly, the skeptic is right — and that is worth internalizing.

First, the on-policy story is a myth. Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm (Yao et al., 2025) derives group-relative REINFORCE from first principles and shows it admits a native off-policy interpretation — the within-group mean baseline does not require the on-policy assumption everyone attaches to it. Where this lands is sobering: the convergence point of the whole family is something close to “PPO with a global advantage and no critic.” The variants are different routes to the same neighborhood, not different destinations.

Second, and more deflating for algorithm enthusiasts: The Art of Scaling RL Compute for LLMs (Khatri, Madaan et al., 2025) — a 400,000+ GPU-hour study — finds that loss aggregation, normalization, curriculum, and the choice of off-policy algorithm primarily move compute efficiency, not the asymptotic performance ceiling. The ceiling is set elsewhere: by the data and the reward. Algorithms get you to the ceiling faster; they rarely raise it.

That is the judgement this map is really teaching. The axes are real and worth tuning — they decide whether your run is stable, efficient, and reaches the ceiling at all. But if you are choosing a variant hoping it will out-level your data, you have the wrong lever in your hand.


The Production Lens

Which points straight at the thing that actually decides outcomes: every variant on this map optimizes against a reward, and every reward is a proxy.

In math and code you get lucky — a verifier exists. The answer is right or it is not; the unit tests pass or they do not. Reinforcement learning with verifiable rewards (RLVR) works there precisely because the proxy is nearly the real thing. Step outside that world — helpfulness, faithfulness, tone, “did the agent actually accomplish the task” — and there is no clean verifier. You build one (a reward model, an LLM judge, a heuristic), and the optimizer’s job becomes finding its seams. Whatever gap exists between your proxy and what you actually want, a good optimizer will widen it. That is reward hacking, and a better optimizer hacks a flawed reward faster.

So the practical ranking is the opposite of the one the zoo implies. Outside math and code, the bottleneck is almost never the optimizer; it is the verifier. Picking GSPO over DAPO is a second-order decision. Building a reward your optimizer cannot game is the first-order one — and it is the genuinely hard, under-glamorous engineering that decides whether any of these algorithms help you. Hold this as a principle: spend your scrutiny on the reward, then let the axes tune the run.


Feeling the Differences on One GPU

You do not need a cluster to build intuition for the axes — you need a small model and an afternoon. The deltas between these methods are legible at tiny scale, and a handful of repos are built to make them legible:

  • GRPO-Zero implements GRPO from scratch with DAPO’s improvements (token-level loss, no KL, overlong filtering) on the Countdown task. It is the single best repo to dissect if you want to see the axis-moves in code rather than read about them.
  • simple_GRPO is ~200 lines across two files and trains in under an hour on a single A800 — teaching-oriented, nothing hidden.
  • TinyZero reproduces R1-Zero-style training on verl for under $30, a clean baseline to fork.
  • Unsloth’s GRPO notebooks get the R1-Zero “aha” on 5–7 GB of VRAM via QLoRA — about as close to a laptop as this gets.

For the research framing of why small-model RL is even worth your time, Reinforcement Learning for Reasoning in Small LLMs (Dang & Ngo, 2025) trained a 1.5B model to 46.7% on AIME 2024 for roughly $42 on 4×A40 in 24 hours — and was candid about what broke. If you would rather run the full family under one roof, verl implements almost all of it (PPO, GRPO, GSPO, DAPO, Dr. GRPO, RLOO, REINFORCE++, PRIME) with verifiable rewards; TRL is the gentlest on-ramp via its GRPOTrainer. This is exactly the territory my own rlvr-from-scratch project lives in.


Where the Family Is Heading

The axes will keep being the right lens, even as the methods drift past the ones named here. Three directions are already visible. Group-free methods like SPO ask whether the group baseline — the move that defined GRPO — is itself the next bottleneck, replacing it with a persistent value tracker. Agentic and hierarchical advantages push group-relative credit assignment down to steps and sub-trajectories for tool-using, multi-turn agents. And verifier-free RL tries to escape the proxy problem from the other side — learning a reward signal without a hand-built verifier at all, which, if it works, attacks the actual ceiling rather than the path to it.

None of these will look like a clean break. Each will arrive as another coordinate: a different granularity, a different clip, a different normalization, a different way to sample. GRPO never fragmented into a zoo. It exposed a design space — and once you can read the axes, you are not learning the next algorithm so much as recognizing where on the map it sits.

That is the whole series in one sentence: from PPO’s critic, to GRPO’s group baseline, to GDPO’s decoupled rewards, the field has been walking a small set of axes the entire time. You can see the full arc on the series page.


References

Foundations (linked, not re-derived here):

  • Schulman et al. (2017), Proximal Policy Optimization AlgorithmsarXiv:1707.06347
  • Ahmadian et al. (2024), Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMsarXiv:2402.14740
  • Shao et al. (2024), DeepSeekMath (GRPO origin) — arXiv:2402.03300
  • DeepSeek-AI / Guo et al. (2025), DeepSeek-R1arXiv:2501.12948

The four core variants:

  • Liu et al. (2025), Understanding R1-Zero-Like Training: A Critical Perspective (Dr. GRPO) — arXiv:2503.20783
  • Yu et al. (2025), DAPO: An Open-Source LLM Reinforcement Learning System at ScalearXiv:2503.14476
  • Zheng et al. (2025), Group Sequence Policy Optimization (GSPO) — arXiv:2507.18071
  • Liu et al. (2026), GDPO: Group reward-Decoupled Normalization Policy OptimizationarXiv:2601.05242

Frontier and critical perspective:

  • MiniMax (2025), MiniMax-M1 (introduces CISPO) — arXiv:2506.13585
  • Yao et al. (2025), Group-Relative REINFORCE Is Secretly an Off-Policy AlgorithmarXiv:2509.24203
  • Khatri, Madaan et al. (2025), The Art of Scaling RL Compute for LLMsarXiv:2510.13786
  • Xu & Ding (2025), Single-stream Policy Optimization (SPO) — arXiv:2509.13232

Small-model / single-GPU:

Article series

Policy Optimization for LLMs: From Fundamentals to Production

Part 5 of 5

  1. Part 1 Reinforcement Learning Foundations for LLM Alignment
  2. Part 2 PPO for Language Models: The RLHF Workhorse
  3. Part 3 GRPO: Eliminating the Value Network
  4. Part 4 GDPO: Multi-Reward RL Done Right
  5. Part 5 After GDPO: A Map of the GRPO Family

Cite this article

Sousa, V. (2026). After GDPO: A Map of the GRPO Family. vitorsousa.com. https://www.vitorsousa.com/blog//

@article{sousa2026,
  title={After GDPO: A Map of the GRPO Family},
  author={Sousa, Vitor},
  year={2026},
  url={https://www.vitorsousa.com/blog//}
}

Discussion

Found something useful, spotted an error, or want to add context? Comments are powered by GitHub Discussions.

Keep Reading

View all articles