Edward Beeching
715c8787fb
add back grad accumulations steps ( #612 )
2025-04-17 16:41:39 +02:00
lewtun
4c9b0f25d9
Fix TP once and for all :) ( #613 )
...
* Update evaluation.py
* Fix import
2025-04-17 15:25:59 +02:00
lewtun
14a81d2bd4
Update evaluation.py ( #611 )
2025-04-17 11:11:49 +02:00
lewtun
5112bfc401
Fix SFT for base models ( #604 )
...
* Fix pad token bug in SFT
* Add ChatML default
* Clean up
* Refactor grpo model load
* Add doc
* Bump deepspeed
2025-04-16 11:45:50 +02:00
lewtun
bcbb1da401
Update evaluation.py ( #608 )
2025-04-16 10:37:38 +02:00
lewtun
8eb1b7860a
Set DP=1 due to vLLM <> LightEval hanging ( #600 )
...
* Update evaluate.slurm
* Disable DP
* Fix
2025-04-16 10:24:33 +02:00
lewtun
8cf42663fd
Clean up recipes ( #596 )
2025-04-11 20:09:15 +02:00
Edward Beeching
068f13f236
Hotfix bin reward ( #597 )
...
* add WIP code GRPO configs
* hotfix bin reward
* remove unwanted files
* remote configs
2025-04-11 17:45:38 +02:00
lewtun
04dbf21989
Bump TRL and vLLM ( #595 )
...
* Bump TRL and vLLM
* Fix style
* Bump liger
* Add liger
2025-04-11 16:32:33 +02:00
Edward Beeching
c1eadaa097
E2B Router bug fixes ( #592 )
...
* fix eval system prompt
* style
* fix a rare issue where the execution is None
* fixes a bug in the e2b router
2025-04-11 14:04:59 +02:00
Edward Beeching
3a0e89678c
Fix eval system prompt ( #591 )
...
* fix eval system prompt
* style
2025-04-11 11:23:06 +02:00
Shenghang Tsai
2a7bb45f05
Update README.md ( #590 )
2025-04-10 13:11:35 +02:00
lewtun
bf08f56849
[WIP] Bump lighteval with proper pass@1 ( #584 )
...
* Bump lighteval with proper pass@1
* Bump lighteval
* Update AIME24
2025-04-08 20:53:34 +02:00
Edward Beeching
1b3bf043dc
Adds a E2B router server that executes batches of scripts ( #561 )
...
* adds a dedicated e2b server to handle batches of requests
* fix reward tests
* update slow reward
* style
* updates e2b router to be more generic
* refactor
* refactoring
* licence, cleanup
* update tests
* style
* fix import when e2b not present
* style
* rename sandbox file
* rename to RoutedSandbox
* update readme
* nits
* nits2
* unlimited max time
* update logs path
2025-04-07 21:01:06 +02:00
lewtun
2636a2130f
Add WandB groups to logging ( #573 )
2025-04-02 15:48:59 +02:00
lewtun
ca8664df1c
Fix missing prompt columns in recipes ( #574 )
2025-04-02 15:48:48 +02:00
lewtun
4f5b21e21d
Fix accuracy reward for math ( #566 )
...
* Fix accuracy reward for math
* Add typing
* Add unit test
* Return None for invalid samples
* Fix order of answers
* Fix type
* Use None for non-verifiable answers
2025-04-01 12:04:26 +02:00
Edward Beeching
9915e06f1e
Async code reward fixes ( #546 )
...
* expose num parallel code executions
* add e2b benchmarking script
* adds new parallel code execution with better execption handling
* style
* update default
* increase sandbox timeout
* Add pretty table and Sandbox IDs
* Add Sandbox ID
* fix merge
---------
Co-authored-by: Lewis Tunstall <lewis.c.tunstall@gmail.com >
2025-03-28 14:08:15 +01:00
Zhou Shao
1802bec75f
fix dataset parsing error ( #540 )
...
* fix dataset parsing error
support defined question field to fix errors when datasets' question field is not 'problem'
* add question field config
add script_args: question field
* refactor: datasets prompt column
---------
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
2025-03-28 13:17:04 +01:00
lewtun
4ec555b0c8
Restore single-node instructions to run GRPO ( #549 )
2025-03-27 10:29:07 +01:00
lewtun
8000dd2384
[WIP] RL goes brrr ( #533 )
...
* Fix vLLM recipes
* Add vllm server to Slurm
* Add overlap across srun
* Fix NUM_NODES
* Refactor TP to script
* fix train script to work withnew GRPO
* lewis nits
* bump trl, transformers
---------
Co-authored-by: edbeeching <edbeeching@gmail.com >
2025-03-24 15:15:02 +01:00
Edward Beeching
86f9471f8e
Fixes missing exception in run_script ( #532 )
...
* adds binary code reward, refactors grpo with get_reward_funcs
* adds return type to the function
* fix exception in run_script causes batch of rewards to be zero
* style
2025-03-24 13:36:48 +01:00
Zhou Shao
9409dca675
fix get_reward_funcs bug ( #535 )
...
change the input from `script_args.reward_funcs` to `script_args`
2025-03-22 15:33:21 +01:00
Edward Beeching
af487204ca
Adds binary code reward ( #528 )
...
* adds binary code reward, refactors grpo with get_reward_funcs
* adds return type to the function
* add get_reward_funcs test
* remote type hint
* move script args to another file
* update test
2025-03-21 12:53:38 +01:00
Guilherme Penedo
7835979801
adds support for running GRPO on IOI problems ( #495 )
...
* adds support for running GRPO on IOI problems
* nit
* bugfixes + recipe
* added piston info and readme changes
* readme updates
* run isort to fix checks
* Update src/open_r1/rewards.py
Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com >
* adding ioi test
* fix merge issues with python slow tests
* style
* generalize piston workers
* generalize readme
* fix extract code
* finalize slow tests
---------
Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com >
Co-authored-by: edbeeching <edbeeching@gmail.com >
2025-03-21 08:48:00 +01:00
koskotheim
d436b7b9c0
fix typo ( #507 )
2025-03-15 20:56:14 +01:00
Edward Beeching
5dcfae8979
Fixes bug with async code reward ( #504 )
...
* adds slow test for code reward
* fixes bug in setting language and the output parsing
* style
* removed redundant comment
* removed exeception as e
* remove rewards
* removed whitespace
* more whitespace
* remove need for loop with asyncio.run
* nits
* fix type error with e2n AsyncSandbox
2025-03-13 22:54:15 +01:00
lewtun
d5922af8ce
Add OlympicCoder recipes ( #505 )
...
* Add OlympicCoder recipes
* Fix configs
* Add FSDP config
2025-03-13 19:08:34 +01:00
Agus
9890a8d992
Run e2b async sandbox by default ( #484 )
...
* Run e2b async sandbox by default
* Remove unnecessary rewards
* Fix run_sync variable
* Run linters
* Let only async version
* Remove unused Sandbox
---------
Co-authored-by: agus <agustin.piqueres@huggingface.com >
2025-03-08 15:41:15 +01:00
A-transformer
88a1b002c1
missing ' ( #479 )
...
missing '
2025-03-07 15:44:15 +01:00
A-transformer
6660a477ec
Remove unimplemented 'format_deepseek' from reward_funcs documentation ( #480 )
...
Removed 'format_deepseek' from GRPOScriptArguments help string as it is not implemented in REWARD_FUNCS_REGISTRY and format_reward already covers DeepSeek-style formatting needs.
2025-03-06 10:45:50 +01:00
lewtun
3b5d6603bf
Add citation and acknowledgements ( #481 )
...
* Update README.md
* Update README.md
* Update README.md
2025-03-05 20:23:57 +01:00
lewtun
a465641ec7
Fix make evaluate ( #470 )
2025-03-04 14:25:58 +01:00
lewtun
299446902d
Enable decontamination on dataset configs ( #460 )
2025-03-04 09:22:01 +01:00
lewtun
44cb13d4ba
Fix vLLM ( #464 )
2025-03-03 17:25:30 +01:00
A-transformer
4b4c377f27
fix typo ( #459 )
...
fix typo
2025-03-03 15:33:23 +01:00
lewtun
45ccf60109
Remove dataset_configs from YAML recipes ( #461 )
2025-03-03 13:54:58 +01:00
Marco Z
c7733d3fa4
update makefile and readme ( #449 )
...
Co-authored-by: Marco Zocca <marco.zocca@unfoldml.com >
2025-03-01 15:08:30 +01:00
Edward Beeching
8782fa6e90
bump lighteval, expose the lcb_v4 benchmark ( #441 )
2025-02-26 17:59:44 +01:00
Edward Beeching
a20666d5b5
Bumps TRL ( #437 )
2025-02-26 10:35:50 +01:00
elie
3ba56c1c3d
Add config sft smollm ( #425 )
...
* add sft recipe
* add smollm sft
* max_length modif 1
* max_length modif 2
2025-02-25 21:45:59 +01:00
Nile Zhou
d036a1b341
fix reward verify err ( #430 )
2025-02-25 16:15:00 +01:00
Edward Beeching
11beb9a4dc
Updates evals to run with ddp=8 for small models ( #428 )
...
Currently the logic for calculating num_gpus considers eval in the TP setting, for the Qwen 7b models this retuns 4. However for smaller models we can use DDP and fix the num_gpus at 8
2025-02-25 16:00:13 +01:00
Agus
7188001281
Add script to decontaminate datasets against benchmark datasets ( #416 )
...
* Add script to decontaminate datasets against benchmark datasets
* Add docs for the decontamination script
* Update README.md
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* Update README.md
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* Update README.md
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* Update scripts/decontaminate.py
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* Update scripts/decontaminate.py
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* Update scripts/decontaminate.py
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* Update scripts/decontaminate.py
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* Update scripts/decontaminate.py
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
* Add license header and attribution to the authors
---------
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com >
2025-02-24 19:54:44 +01:00
Edward Beeching
0c3ef8372e
updates max_seq_length to max length due to a bug in trl ( #419 )
2025-02-24 17:27:56 +01:00
lewtun
566cfd1a44
Align format reward with R1 traces and add reward function to count think / answer tags ( #418 )
...
* Fix tests
* Tune
* Add reward
* Apply suggestions from code review
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
---------
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com >
2025-02-24 17:16:40 +01:00
elie
5355687e6c
add sft recipe ( #415 )
2025-02-24 15:43:12 +01:00
lewtun
3f9d75a595
Bump Liger kernel ( #399 )
...
Needed to enable SFT training via https://github.com/huggingface/trl/pull/2874
2025-02-23 17:44:03 +01:00
lewtun
eeca246b07
Update prompt template and sampling parameters for evaluation ( #392 )
...
* Pin t
* Pin t
* Set top p
* C
* Tune math prompt
* Improve math prompt
* Update tables
2025-02-22 15:21:01 +01:00
lewtun
49d9b741a5
Pin dependencies ( #393 )
2025-02-22 14:46:09 +01:00