Introduction
DeepSeek-R1, an open-source-model, achieved comparable capacity to OpenA1’s o1 exemplary connected reasoning tasks. If this wasn’t already impressive, newer open-source models for illustration DeepScaleR-1.5B-Preview by Agentica, portion of the Berkeley AI Research and Sky Computing lab, surpass o1 capacity successful benchmarks for illustration AIME2024. This level of capacity connected AIME2024 is particularly important because mathematical reasoning has traditionally been a weakness for connection models.
Trained connected 40,000 mathematics problems complete 3,800 A100 GPU hours, DeepScaleR uses a caller iterative discourse lengthening scheme. The training process involves curating a high-quality dataset, utilizing an Outcome Reward Model (ORM) alternatively of a Process Reward Model (PRM), and progressively expanding the discourse model from 8K to 24K tokens. This attack allows the exemplary to study effective reasoning patterns astatine shorter contexts earlier scaling to longer ones, importantly improving efficiency. The authors person open-sourced their dataset, code, and training logs to further investigation successful scaling intelligence pinch RL.
Prerequisites
There are 2 sections of this article. (1) An overview of the exemplary and (2) its implementation. The overview requires immoderate familiarity pinch LLM training and evaluation. Knowledge of reinforcement learning algorithms for illustration Proximal Policy Optimization (PPO) instrumentality learning concepts for illustration nonaccomplishment functions would beryllium helpful. The implementation, wherever we tally the exemplary connected DigitalOcean’s GPU droplets, is very straightforward and requires minimal coding experience. Feel free to skip the overview conception if you’re simply willing successful moving this model.
Model Performance
Here’s a quote outlining DeepScaleR’s capacity from Agentica’s blog post, “DeepScaleR: Surpassing O1-Preview pinch a 1.5B Model by Scaling RL”: “Our results corroborate this: RL scaling improved AIME accuracy from 28.9% to 43.1%! These findings propose that neither SFT nor RL unsocial is sufficient. Instead, by combining high-quality SFT distillation pinch RL scaling, we tin genuinely unlock the reasoning imaginable of LLMs.”
What’s the value of 43.1% Pass@1 accuracy connected AIME2024 (+14.3% betterment complete the guidelines model)?
AIME2024 (American Invitational Mathematics Examination) is simply a highly regarded benchmark derived from a prestigious precocious schoolhouse mathematics competition. You’ll spot this benchmark travel up a batch erstwhile evaluating reasoning models looking astatine problem solving abilities.
A 43.1% Pass@1 accuracy intends that successful a azygous attempt, the exemplary correctly solves 43.1% of the AIME problems it encounters. The 14.3% betterment complete the guidelines exemplary (from 28.9% to 43.1%) demonstrates that combining SFT distillation pinch RL scaling is simply a promising approach.
Training Recipe
Group Relative Policy Optimization (GRPO)
GRPO builds upon PPO (Proximal Policy Optimization), an extensively utilized reinforcement learning algorithm, and introduces two important changes. The first cardinal summation involves normalizing the advantage function crossed each samples generated from the aforesaid prompt. The advantage usability quantifies how overmuch amended a peculiar action is compared to the expected value. By normalizing these advantages crossed samples from the aforesaid prompt, GRPO creates a more accordant standard for comparing the comparative benefits of different actions, which is peculiarly valuable erstwhile handling aggregate samples from the aforesaid starting point.
The 2nd awesome summation incorporates KL (Kullback-Leibler) divergence regularization connected apical of PPO’s surrogate nonaccomplishment function. KL divergence measures the quality betwixt 2 probability distributions. By adding this arsenic a regularization word to PPO’s existing nonaccomplishment function, GRPO helps forestall the caller argumentation from deviating from the aged 1 (policy drift).
Reward Function
The reward usability involves a binary attack to evaluating responses wherever a reward worth of 1 is assigned for answers that successfully walk some LaTeX and Sympy validation checks, indicating correct mathematical formatting and content.
Conversely, it assigns a reward worth of 0 for immoderate answers that are either incorrect aliases neglect to meet the due formatting requirements.The strategy deliberately excludes partial reward mechanisms (PRMs) and does not supply intermediate feedback during the information process.
Iterative Context Lengthening
Training ample reasoning models utilizing reinforcement learning requires important computational resources. To reside this challenge, an adaptive training strategy was utilized wherever they started pinch shorter contexts and gradually accrued the magnitude arsenic the model’s capacity improves. This optimization reduces some the financial costs and wide training duration.
Implementation
DeepScaleR tin beryllium served pinch high-performance conclusion systems specified arsenic vLLM, Hugging Face Text Generation Inference (TGI), SGLang, and TensorRT-LLM. In this tutorial, we will show you really you tin tally DeepScaleR-1.5B pinch the vLLM conclusion motor connected DigitalOcean GPU Droplets.
Step 1: Set up a GPU Droplet successful a Jupyter Notebook environment
To group up a GPU Droplet successful a Jupyter Notebook environment, travel this tutorial: “Setting Up the GPU Droplet Environment for AI/ML Coding - Jupyter Labs”.
Step 2: Import the Model
Initialize the connection exemplary by creating an LLM object, specifying the exemplary way arsenic “agentica-org/DeepScaleR-1.5B-Preview”.
from vllm import LLM, SamplingParams llm = LLM(model= "agentica-org/DeepScaleR-1.5B-Preview")Step 3: Format the Prompt and Sampling Parameters
Define your input punctual (replace “[enter punctual here]” pinch your existent punctual text). Set the sampling parameters successful a SamplingParams object, including arguments for illustration temperature, top_p, and max_tokens.
prompt = [enter punctual here] sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens = 512)Step 4: Generate Your Results
To make matter utilizing the llm.generate() function, commencement by calling the usability and passing successful the parameters we defined successful the past measurement (prompt and sampling_params). Then, iterate done each output produced by the procreation process. From each output, extract the original punctual and the corresponding generated text. Finally, people some the punctual and the generated matter for you to evaluate.
outputs = llm.generate(prompt,sampling_params) for output in outputs: punctual = output.prompt generated_text = output.outputs[0].text print(f"Prompt:{prompt!r}, Generated text: {generated_text!r}")Conclusion
And location you person it. We dream this article gives you the inheritance you request to understand the invention down DeepScaleR and immoderate vulnerability to the existent paradigm of reasoning models. We promote you to research pinch this reasoning exemplary and others for your desired use-case.
Check retired this tutorial for inspiration: A Comprehensive Guide to Fine-Tuning Reasoning Models: Fine-Tuning DeepSeek-R1 connected Medical CoT pinch DigitalOcean’s GPU Droplets
References
https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2
agentica-org/DeepScaleR-1.5B-Preview · Hugging Face
GitHub - agentica-project/deepscaler: Democratizing Reinforcement Learning for LLMs
DeepSeekMath: Pushing the Limits of Mathematical Reasoning successful Open Language Models
Quickstart — vLLM