Understanding GRPO and how GRPO is changing LLM Training

Reinforcement learning for large language models has hit a critical evolution point with Group Relative Policy Optimization (GRPO). This breakthrough approach tackles the fundamental inefficiencies that have plagued traditional RL methods like PPO when applied to massive language models. By eliminating separate value networks and introducing relative scoring mechanisms, GRPO creates a more resource-efficient path to improved model performance—particularly for mathematical reasoning and problem-solving capabilities.

The technique generates multiple outputs for each prompt and uses their collective performance as a baseline for advantage calculation. This group-based approach replaces the traditional value function, reducing memory requirements by approximately 50% while maintaining or improving training effectiveness. The elegant formula normalizes rewards within each batch, creating a stable learning signal without the computational overhead of a separate critic network.

For AI product teams struggling with computational constraints during model fine-tuning, GRPO offers a practical solution that balances resource efficiency with performance gains. DeepSeek’s R1 model demonstrates these benefits, achieving significant improvements on mathematical reasoning benchmarks while using substantially fewer computational resources than traditional methods would require.

In this article, we will discuss the following:

1
Core GRPO mechanisms and value function elimination
2
Statistical advantages of relative scoring approaches
3
Implementation requirements and technical specifications
4
DeepSeek-R1’s architecture and performance improvements
5
Reference model configuration and hyperparameter optimization
6
Reward function engineering for effective training

GRPO fundamentals and RL context in LLM training

The core principles of GRPO

Group Relative Policy Optimization (GRPO) represents a significant advancement in reinforcement learning for large language models. It addresses fundamental limitations in traditional RL methods when applied to LLMs. GRPO eliminates the need for a separate value network. This reduces memory usage and computational requirements substantially.

The algorithm works by generating multiple outputs for each input prompt. It then uses the mean reward of these responses as a baseline. This group-based approach creates a more efficient and stable training process compared to standard PPO implementations.

Value function challenges in LLM training

Traditional PPO implementations face several critical challenges when applied to language models. The value function typically requires another neural network of comparable size to the policy model. This doubles memory requirements during training.

The value model struggles to accurately estimate expected rewards for partial sequences. This leads to high variance in advantage estimates. For LLMs, this problem becomes particularly acute due to the sparse nature of rewards, which often only occur at the end of a sequence.

Value models can also be difficult to train effectively. They often fail to generalize well across diverse prompt types and reasoning paths.

GRPO’s relative scoring approach

GRPO's innovation lies in its relative advantage estimation approach. It samples multiple responses for each prompt and computes advantages based on their relative performance within the group. The formula is simple:

This normalization process eliminates the need for a separate critic network. It provides a stable baseline for each prompt context. The approach aligns naturally with comparative reward models, which evaluate responses relative to each other.

GRPO also integrates KL divergence directly into the loss function. This prevents the policy from drifting too far from a reference model. The result is a more efficient training process that maintains stability while improving mathematical reasoning and problem-solving capabilities. Let's explore how the elimination of the value function represents GRPO's key breakthrough in the reinforcement learning landscape.

Value function elimination: GRPO's key innovation

Replacing the value function with group sampling

Value function elimination represents GRPO's fundamental breakthrough in policy optimization. Unlike traditional PPO, which requires a separate value function model to estimate state values, GRPO removes this requirement entirely. Instead, it generates multiple outputs for each input and uses their relative performance as a baseline for advantage estimation.

The approach is elegantly simple. For each input question, GRPO samples multiple responses from the policy model. It then calculates the average reward of these responses and uses this as the baseline. Any response with rewards above this average receives a positive advantage, while those below receive a negative advantage.

Statistical benefits of relative scoring

This group-relative approach offers significant statistical advantages. By normalizing rewards within each group, GRPO effectively eliminates the need for absolute reward scaling. The advantage calculation becomes:

This normalization reduces variance in the advantage estimates, making training more stable. It also aligns naturally with comparative reward modeling, where models are typically trained on preference pairs rather than absolute scores.

Technical and resource advantages

The elimination of the value function dramatically reduces computational costs. In traditional PPO implementations, the value network is typically another neural network of comparable size to the policy model. By removing this requirement, GRPO cuts memory usage by nearly half.

This simplification makes GRPO particularly valuable for training large language models with significant resource constraints. DeepSeek’s success with the R1 and DeepSeek-Math models demonstrates how this approach enables efficient training while maintaining high performance.

KL divergence integration

Another key innovation in GRPO is the direct integration of the KL divergence term into the loss function rather than adding it to the reward. This adjustment helps stabilize training further by preventing the policy from deviating too far from the reference model.

The simplified objective becomes:

This approach maintains the conservative update philosophy of PPO while streamlining the implementation.

A diagram comparing PPO architecture (which includes policy and value networks) with GRPO architecture (which features only a policy network and group sampling), emphasizing the memory and computational savings. | Source: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Now that we understand how GRPO eliminates the value function, let's examine the specific methodology behind its group-based advantage calculation.

Group-based advantage calculation methodology

The foundation of GRPO's efficiency

Group Relative Policy Optimization (GRPO) fundamentally changes how advantage estimation works in reinforcement learning for large language models. It eliminates the need for a separate value function network by using group-based sampling to establish baselines. This approach significantly reduces memory usage while maintaining effective training signals.

Calculating Relative Advantages

The core innovation of GRPO lies in its advantage calculation formula:

Where:

r_i is the reward for a specific response
mean(G) is the average reward across all responses in the group
std(G) is the standard deviation of rewards within the group

This z-score normalization creates a relative performance metric that indicates how much better or worse each response is compared to others generated for the same input. Positive advantages indicate above-average responses, while negative advantages highlight underperforming ones.

Comparative analysis with PPO

Unlike traditional PPO which requires maintaining a separate value network of comparable size to the policy model, GRPO leverages group statistics to estimate advantages. This distinction yields several benefits:

Memory efficiency: No additional neural network parameters for value estimation
Reduced variance: Group-normalization stabilizes training by providing robust baselines
Alignment with reward modeling: Better matches the comparative nature of reward model training

Mathematical Formulation

The complete advantage calculation process in GRPO includes:

1
Generating G responses for each input query using the current policy
2
Computing rewards for each response using a reward function
3
Calculating group statistics (mean and standard deviation)
4
Normalizing each reward relative to the group

This approach creates a self-contained system where each batch establishes its own performance baseline, enabling more robust training signals without the computational overhead of a separate critic model.

The methodology’s elegance lies in its simplicity: using statistical normalization to replace an entire neural network component.

The impact on training dynamics

By normalizing advantages within groups, GRPO creates a more stable learning environment. Rewards that might be noisy or inconsistent across different inputs become meaningful when viewed relative to other responses for the same prompt. This helps the model focus on relative improvements rather than chasing absolute reward values that might vary widely across different problems.

The group-based approach also naturally encourages exploration, as the model discovers which response patterns consistently outperform others in the same context. This creates a more nuanced learning signal than traditional advantage estimation, particularly valuable when learning complex reasoning tasks.

GRPO's approach to advantage calculation represents a significant advancement in applying reinforcement learning to language models, making it more accessible for research and development. With this methodological foundation established, we can now explore how GRPO implements multiple output sampling and KL divergence for optimal training.

Multiple output sampling and KL divergence implementation

The GRPO advantage estimation approach

GRPO directly incorporates multiple output sampling to establish a foundation for advantage estimation. For each prompt, the model generates several outputs which are then scored using a reward model. This sampling replaces the traditional value network, reducing memory consumption by approximately 50% compared to PPO implementations.

The advantage for each output is calculated by normalizing its reward relative to the group mean:

This normalization creates a zero-centered metric that indicates how much better or worse each output performs compared to the group average. Outputs with positive advantages are reinforced, while those with negative advantages are discouraged.

KL Divergence as a Stability Constraint

GRPO integrates KL divergence directly into its loss function rather than including it as part of the reward signal as in PPO. This mathematical term measures how much the current policy deviates from the reference model:

The KL divergence term serves as a penalty that prevents the model from drifting too far from its initial capabilities. This implementation helps maintain stability during training while still allowing the model to improve through reinforcement.

Implementation Requirements

Key implementation parameters include:

Group size: Typically 64 samples per question
Clipping parameter (ε): Often set to 0.2
KL coefficient (β): Usually around 0.04
Learning rate: Approximately 1e-6

A single update per exploration stage is generally recommended to ensure training stability. DeepSeek-R1’s implementation balances computational efficiency with effective learning by optimizing how multiple outputs are handled and integrated into the training process.

Using this approach, DeepSeek-R1 demonstrated significant improvements on benchmarks, including increasing GSM8K accuracy from 82.9% to 88.2% and MATH scores from 46.8% to 51.7%, all while requiring less computational resources than traditional methods. Let's examine how DeepSeek-R1 specifically implemented this architecture to achieve these impressive results.

DeepSeek-R1's GRPO training architecture

Core framework and optimization

DeepSeek-R1 utilizes Group Relative Policy Optimization (GRPO) as its reinforcement learning backbone. GRPO improves upon traditional Proximal Policy Optimization (PPO) by eliminating the need for a separate value function model. This architectural choice significantly reduces memory usage while maintaining training effectiveness.

The key innovation in GRPO is its approach to advantage calculation. For each prompt, the model generates multiple responses (64 samples per question) and uses the mean reward of these responses as a baseline. This group-based method normalizes rewards within batches, creating a relative scoring system that drives optimization.

Implementation specifics

DeepSeek implemented GRPO with the following parameters:

Learning rate for policy model: 1e-6
KL coefficient: 0.04
Samples per question: 64
Maximum sequence length: 1024
Batch size: 1024
Single update per exploration stage

This configuration achieved remarkable improvements on mathematical reasoning benchmarks:

GSM8K: Enhanced from 82.9% to 88.2%
MATH: Boosted from 46.8% to 51.7%
CMATH (out-of-domain): Improved from 84.6% to 88.8%

Computational efficiency

Unlike PPO which requires both a policy network and a critic network, GRPO’s critic-free architecture substantially reduces computational overhead. Traditional approaches would double memory requirements, making GRPO particularly valuable for training large language models with limited resources.

The algorithm directly incorporates the KL divergence term into the loss function rather than adding it to the reward signal. This approach helps maintain policy stability while allowing for effective learning. By avoiding the heavy memory footprint of a separate value function, DeepSeek-R1 could focus computational resources on optimizing reasoning capabilities.

Reward structure

The training utilized a dual-focused reward system:

1
Accuracy rewards
Evaluated solution correctness
2
Format rewards
Encouraged proper reasoning structure using <think> and <answer> tags

This reward combination guided the model to develop structured reasoning patterns while maintaining accuracy, proving that well-designed reward functions can effectively shape complex reasoning behaviors without requiring step-by-step supervision.

Technical Implementation

GRPO’s mathematical foundation relies on calculating relative advantages through normalization. Each response's advantage is computed by comparing its reward to the group mean. This approach reduces variance in advantage estimates while eliminating the need for a separate value function network.

This elegant simplification represents a significant advancement in reinforcement learning for large language models, offering both computational efficiency and performance gains. To fully implement GRPO in your own systems, understanding proper reference model configuration is essential for maintaining training stability.

Diagram of GRPO architecture showing multiple response generation, reward calculation, and advantage estimation without a value function | Source: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

GRPO implementation: Reference model configuration

Setting up the reference model

The reference model is an anchor in GRPO training, preventing drastic policy shifts. Configure it properly by maintaining a frozen copy of the pre-RL model. Set a KL coefficient of 0.04 to balance exploration and stability. This parameter controls the regularization strength that keeps your policy model from deviating too far from the reference.

KL divergence integration

Unlike PPO, GRPO incorporates KL divergence directly in the loss function rather than in the reward signal. This approach provides better training stability and more controlled updates. Implement this by adding the KL penalty term:

Batch size and update frequency

Use a substantial batch size of 1024 for stable training. Perform only a single update per exploration stage to maintain consistency. This prevents the policy from changing too dramatically between data collection phases, which is critical for GRPO’s stability.

Reference model refresh strategy

Consider occasionally refreshing your reference model during extended training. While the initial reference provides stability, periodic updates can prevent the training policy from becoming too constrained by an increasingly outdated reference point. This balance is especially important for long-running GRPO implementations.

Hyperparameter sensitivity

The reference model configuration is particularly sensitive to two hyperparameters: learning rate and KL coefficient. Start with a conservative learning rate of 1e-6 for the policy model. This prevents early divergence while the model develops effective reasoning patterns through group-relative advantage estimation. With the technical framework in place, the effectiveness of GRPO depends heavily on crafting appropriate reward functions to guide the learning process.

Reward function engineering for GRPO

Understanding reward design in GRPO

Group Relative Policy Optimization (GRPO) relies heavily on well-crafted reward functions to guide model behavior. These functions serve as the core feedback mechanism that steers language models toward desired reasoning capabilities. Effective reward engineering requires balancing multiple factors to ensure stable training and optimal performance.

Types of Reward Functions

Three main reward classes are typically implemented in GRPO systems:

1
Accuracy rewards
Evaluate whether responses contain correct results or solutions
2
Format rewards
Ensure responses follow specified structures (e.g., reasoning inside <think> tags)
3
Language consistency rewards
Maintain coherent language usage throughout responses

Many implementations combine these rewards through weighted summation to create a comprehensive feedback signal. The weight assigned to each component significantly impacts the behaviors the model prioritizes during training.

Normalization Techniques

Reward normalization is crucial for GRPO stability. Raw rewards often vary widely in magnitude, potentially destabilizing training. Effective approaches include:

Z-score normalization: Transforms rewards relative to group mean and standard deviation
Min-max scaling: Bounds rewards within a predefined range
Clipping: Restricts extreme reward values to prevent gradient explosions

One key GRPO insight:

Standard normalization decreases variance by using group statistics rather than requiring a separate value network, significantly reducing memory requirements during training.

Reward scaling considerations

The degree of reward scaling dramatically affects training dynamics:

1
Aggressive scaling amplifies small differences between responses
2
Conservative scaling promotes more stable, gradual learning
3
Task-specific scaling better addresses domain requirements

For instance, DeepSeek-R1 employed distinct scaling strategies for mathematical reasoning versus code generation tasks. The paper notes that "optimization across tasks required carefully calibrated reward coefficients to avoid unintended behaviors."

Avoiding reward hacking

A critical challenge in reward engineering is preventing the model from exploiting loopholes. This requires:

Regular evaluation of generated responses for unexpected patterns
Iterative refinement of reward functions to close exploits
Inclusion of penalties for artificially gaming the system

The success of DeepSeek’s approach demonstrates that well-designed rule-based rewards can effectively drive sophisticated reasoning without creating the incentives for reward hacking that neural reward models sometimes enable.

Conclusion

GRPO represents a significant advancement in reinforcement learning for LLMs by addressing fundamental inefficiencies in traditional approaches. By eliminating the need for separate value networks and introducing group-based advantage calculation, this technique reduces memory requirements by approximately 50% while improving model performance on complex reasoning tasks.

The key technical takeaways include the elegant simplicity of calculating advantages using group statistics, the direct integration of KL divergence into the loss function rather than the reward signal, and the effective use of multiple output sampling to establish performance baselines. These innovations provide a more stable and efficient training framework that can be implemented with relatively straightforward adjustments to existing RL pipelines.

For product teams, GRPO offers a path to developing more capable AI products with fewer computational resources, potentially accelerating release cycles and reducing infrastructure costs. Engineers should consider starting with conservative hyperparameters (learning rate ~1e-6, KL coefficient ~0.04) and carefully designed reward functions that balance accuracy with desired formatting behaviors. Leadership should recognize GRPO's strategic value in maximizing return on AI infrastructure investments while enabling more sophisticated reasoning capabilities in their language model applications.

GRPO fundamentals and RL context in LLM training

The core principles of GRPO

Value function challenges in LLM training

GRPO’s relative scoring approach

Value function elimination: GRPO's key innovation

Replacing the value function with group sampling

Statistical benefits of relative scoring

Technical and resource advantages

KL divergence integration

Group-based advantage calculation methodology

The foundation of GRPO's efficiency

Calculating Relative Advantages

Comparative analysis with PPO

Mathematical Formulation

The impact on training dynamics

Multiple output sampling and KL divergence implementation

The GRPO advantage estimation approach

KL Divergence as a Stability Constraint

Implementation Requirements

DeepSeek-R1's GRPO training architecture

Core framework and optimization

Implementation specifics

Computational efficiency

Reward structure

Accuracy rewards

Format rewards

Technical Implementation

GRPO implementation: Reference model configuration

Setting up the reference model

KL divergence integration

Batch size and update frequency

Reference model refresh strategy

Hyperparameter sensitivity

Reward function engineering for GRPO

Understanding reward design in GRPO

Types of Reward Functions

Accuracy rewards

Format rewards

Language consistency rewards

Normalization Techniques

One key GRPO insight:

Reward scaling considerations

Avoiding reward hacking

Conclusion

Ship reliable AI faster