170 Commits

Autor SHA1 Mensagem Data
Edward Beeching 715c8787fb add back grad accumulations steps (#612) 2025-04-17 16:41:39 +02:00
lewtun 4c9b0f25d9 Fix TP once and for all :) (#613)
* Update evaluation.py

* Fix import
2025-04-17 15:25:59 +02:00
lewtun 14a81d2bd4 Update evaluation.py (#611) 2025-04-17 11:11:49 +02:00
lewtun 5112bfc401 Fix SFT for base models (#604)
* Fix pad token bug in SFT

* Add ChatML default

* Clean up

* Refactor grpo model load

* Add doc

* Bump deepspeed
2025-04-16 11:45:50 +02:00
lewtun bcbb1da401 Update evaluation.py (#608) 2025-04-16 10:37:38 +02:00
lewtun 8eb1b7860a Set DP=1 due to vLLM <> LightEval hanging (#600)
* Update evaluate.slurm

* Disable DP

* Fix
2025-04-16 10:24:33 +02:00
lewtun 8cf42663fd Clean up recipes (#596) 2025-04-11 20:09:15 +02:00
Edward Beeching 068f13f236 Hotfix bin reward (#597)
* add WIP code GRPO configs

* hotfix bin reward

* remove unwanted files

* remote configs
2025-04-11 17:45:38 +02:00
lewtun 04dbf21989 Bump TRL and vLLM (#595)
* Bump TRL and vLLM

* Fix style

* Bump liger

* Add liger
2025-04-11 16:32:33 +02:00
Edward Beeching c1eadaa097 E2B Router bug fixes (#592)
* fix eval system prompt

* style

* fix a rare issue where the execution is None

* fixes a bug in the e2b router
2025-04-11 14:04:59 +02:00
Edward Beeching 3a0e89678c Fix eval system prompt (#591)
* fix eval system prompt

* style
2025-04-11 11:23:06 +02:00
Shenghang Tsai 2a7bb45f05 Update README.md (#590) 2025-04-10 13:11:35 +02:00
lewtun bf08f56849 [WIP] Bump lighteval with proper pass@1 (#584)
* Bump lighteval with proper pass@1

* Bump lighteval

* Update AIME24
2025-04-08 20:53:34 +02:00
Edward Beeching 1b3bf043dc Adds a E2B router server that executes batches of scripts (#561)
* adds a dedicated e2b server to handle batches of requests

* fix reward tests

* update slow reward

* style

* updates e2b router to be more generic

* refactor

* refactoring

* licence, cleanup

* update tests

* style

* fix import when e2b not present

* style

* rename sandbox file

* rename to RoutedSandbox

* update readme

* nits

* nits2

* unlimited max time

* update logs path
2025-04-07 21:01:06 +02:00
lewtun 2636a2130f Add WandB groups to logging (#573) 2025-04-02 15:48:59 +02:00
lewtun ca8664df1c Fix missing prompt columns in recipes (#574) 2025-04-02 15:48:48 +02:00
lewtun 4f5b21e21d Fix accuracy reward for math (#566)
* Fix accuracy reward for math

* Add typing

* Add unit test

* Return None for invalid samples

* Fix order of answers

* Fix type

* Use None for non-verifiable answers
2025-04-01 12:04:26 +02:00
Edward Beeching 9915e06f1e Async code reward fixes (#546)
* expose num parallel code executions

* add e2b benchmarking script

* adds new parallel code execution with better execption handling

* style

* update default

* increase sandbox timeout

* Add pretty table and Sandbox IDs

* Add Sandbox ID

* fix merge

---------

Co-authored-by: Lewis Tunstall <lewis.c.tunstall@gmail.com>
2025-03-28 14:08:15 +01:00
Zhou Shao 1802bec75f fix dataset parsing error (#540)
* fix dataset parsing error

support defined question field to fix errors when datasets' question field is not 'problem'

* add question field config

add script_args: question field

* refactor: datasets prompt column

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-03-28 13:17:04 +01:00
lewtun 4ec555b0c8 Restore single-node instructions to run GRPO (#549) 2025-03-27 10:29:07 +01:00
lewtun 8000dd2384 [WIP] RL goes brrr (#533)
* Fix vLLM recipes

* Add vllm server to Slurm

* Add overlap across srun

* Fix NUM_NODES

* Refactor TP to script

* fix train script to work withnew  GRPO

* lewis nits

* bump trl, transformers

---------

Co-authored-by: edbeeching <edbeeching@gmail.com>
2025-03-24 15:15:02 +01:00
Edward Beeching 86f9471f8e Fixes missing exception in run_script (#532)
* adds binary code reward, refactors grpo with get_reward_funcs

* adds return type to the function

* fix exception in run_script causes batch of rewards to be zero

* style
2025-03-24 13:36:48 +01:00
Zhou Shao 9409dca675 fix get_reward_funcs bug (#535)
change the input from `script_args.reward_funcs` to `script_args`
2025-03-22 15:33:21 +01:00
Edward Beeching af487204ca Adds binary code reward (#528)
* adds binary code reward, refactors grpo with get_reward_funcs

* adds return type to the function

* add get_reward_funcs test

* remote type hint

* move script args to another file

* update test
2025-03-21 12:53:38 +01:00
Guilherme Penedo 7835979801 adds support for running GRPO on IOI problems (#495)
* adds support for running GRPO on IOI problems

* nit

* bugfixes + recipe

* added piston info and readme changes

* readme updates

* run isort to fix checks

* Update src/open_r1/rewards.py

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>

* adding ioi test

* fix merge issues with python slow tests

* style

* generalize piston workers

* generalize readme

* fix extract code

* finalize slow tests

---------

Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
Co-authored-by: edbeeching <edbeeching@gmail.com>
2025-03-21 08:48:00 +01:00
koskotheim d436b7b9c0 fix typo (#507) 2025-03-15 20:56:14 +01:00
Edward Beeching 5dcfae8979 Fixes bug with async code reward (#504)
* adds slow test for code reward

* fixes bug in setting language and the output parsing

* style

* removed redundant comment

* removed exeception as e

* remove rewards

* removed whitespace

* more whitespace

* remove need for loop with asyncio.run

* nits

* fix type error with e2n AsyncSandbox
2025-03-13 22:54:15 +01:00
lewtun d5922af8ce Add OlympicCoder recipes (#505)
* Add OlympicCoder recipes

* Fix configs

* Add FSDP config
2025-03-13 19:08:34 +01:00
Agus 9890a8d992 Run e2b async sandbox by default (#484)
* Run e2b async sandbox by default

* Remove unnecessary rewards

* Fix run_sync variable

* Run linters

* Let only async version

* Remove unused Sandbox

---------

Co-authored-by: agus <agustin.piqueres@huggingface.com>
2025-03-08 15:41:15 +01:00
A-transformer 88a1b002c1 missing ' (#479)
missing '
2025-03-07 15:44:15 +01:00
A-transformer 6660a477ec Remove unimplemented 'format_deepseek' from reward_funcs documentation (#480)
Removed 'format_deepseek' from GRPOScriptArguments help string as it is not implemented in REWARD_FUNCS_REGISTRY and format_reward already covers DeepSeek-style formatting needs.
2025-03-06 10:45:50 +01:00
lewtun 3b5d6603bf Add citation and acknowledgements (#481)
* Update README.md

* Update README.md

* Update README.md
2025-03-05 20:23:57 +01:00
lewtun a465641ec7 Fix make evaluate (#470) 2025-03-04 14:25:58 +01:00
lewtun 299446902d Enable decontamination on dataset configs (#460) 2025-03-04 09:22:01 +01:00
lewtun 44cb13d4ba Fix vLLM (#464) 2025-03-03 17:25:30 +01:00
A-transformer 4b4c377f27 fix typo (#459)
fix typo
2025-03-03 15:33:23 +01:00
lewtun 45ccf60109 Remove dataset_configs from YAML recipes (#461) 2025-03-03 13:54:58 +01:00
Marco Z c7733d3fa4 update makefile and readme (#449)
Co-authored-by: Marco Zocca <marco.zocca@unfoldml.com>
2025-03-01 15:08:30 +01:00
Edward Beeching 8782fa6e90 bump lighteval, expose the lcb_v4 benchmark (#441) 2025-02-26 17:59:44 +01:00
Edward Beeching a20666d5b5 Bumps TRL (#437) 2025-02-26 10:35:50 +01:00
elie 3ba56c1c3d Add config sft smollm (#425)
* add sft recipe

* add smollm sft

* max_length modif 1

* max_length modif 2
2025-02-25 21:45:59 +01:00
Nile Zhou d036a1b341 fix reward verify err (#430) 2025-02-25 16:15:00 +01:00
Edward Beeching 11beb9a4dc Updates evals to run with ddp=8 for small models (#428)
Currently the logic for calculating num_gpus considers eval in the TP setting, for the Qwen 7b models this retuns 4. However for smaller models we can use DDP and fix the num_gpus at 8
2025-02-25 16:00:13 +01:00
Agus 7188001281 Add script to decontaminate datasets against benchmark datasets (#416)
* Add script to decontaminate datasets against benchmark datasets

* Add docs for the decontamination script

* Update README.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update README.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update README.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update scripts/decontaminate.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update scripts/decontaminate.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update scripts/decontaminate.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update scripts/decontaminate.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Update scripts/decontaminate.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Add license header and attribution to the authors

---------

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
2025-02-24 19:54:44 +01:00
Edward Beeching 0c3ef8372e updates max_seq_length to max length due to a bug in trl (#419) 2025-02-24 17:27:56 +01:00
lewtun 566cfd1a44 Align format reward with R1 traces and add reward function to count think / answer tags (#418)
* Fix tests

* Tune

* Add reward

* Apply suggestions from code review

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

---------

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-02-24 17:16:40 +01:00
elie 5355687e6c add sft recipe (#415) 2025-02-24 15:43:12 +01:00
lewtun 3f9d75a595 Bump Liger kernel (#399)
Needed to enable SFT training via https://github.com/huggingface/trl/pull/2874
2025-02-23 17:44:03 +01:00
lewtun eeca246b07 Update prompt template and sampling parameters for evaluation (#392)
* Pin t

* Pin t

* Set top p

* C

* Tune math prompt

* Improve math prompt

* Update tables
2025-02-22 15:21:01 +01:00
lewtun 49d9b741a5 Pin dependencies (#393) 2025-02-22 14:46:09 +01:00