Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Zhoujun Cheng^1,*, Shibo Hao^1,*, Tianyang Liu^1,*, Fan Zhou², Yutao Xie¹, Feng Yao¹

Yuexin Bian¹, Yonghao Zhuang³, Nilabjo Dey⁴, Yuheng Zha¹, Yi Gu¹, Kun Zhou¹

Yuqi Wang², Yuan Li³, Richard Fan², Jianshu She², Chengqian Gao², Abulhair Saparov⁴

Haonan Li², Taylor W. Killian², Mikhail Yurochkin², Zhengzhong Liu², Eric P. Xing^2,3, Zhiting Hu¹

¹UC San Diego ²MBZUAI ³Carnegie Mellon University ⁴Purdue University

^*Equal Contribution

Abstract

Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains—Math, Code, Science, Logic, Simulation, and Tabular—each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B/32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning research at our code repository.

Cross-Domain Reasoning Transfer

To understand how reasoning capabilities generalize with RL, we conducted controlled experiments using Guru. We investigated the impact of RL on single reasoning domains versus a mixed-domain corpus. An experimental dataset, Guru-18K (3K samples from each of the six domains), was used.

Differential Transferability

Analysis of Cross-Domain Reasoning Transfer

Math, Code, and Science benchmarks consistently improved significantly from training on other domains, possibly due to extensive exposure to these tokens during pretraining. Other domains showed limited cross-domain gains. Easier tasks within Math and Code showed positive transfer more readily than challenging benchmarks in the same domains. Mixed-domain training on a uniformly mixed dataset often matched or exceeded single-domain performance.

Reward and Response-Length Dynamics

In single-domain training, Code, Logic, and Tabular tasks saw contracted outputs, while Science and Math became more verbose. Joint training led to steep reward climbs initially and could reshape length dynamics.

Effects of Training Data Difficulty

Math (in-domain)				Code & Tabular (cross-domain)
MATH500	AMC	AIME24		HumanEval	LiveCodeBench	HiTab	Multihiertt
75.8	52.1	15.8		82.3	11.1	56.5	32.0
78.6	58.4	21.7		73.1	10.7	53.5	35.5
+2.8	+6.3	+5.9	△ (+/-)	-9.2	-0.4	-3.0	+3.5

Training on harder math data improved in-domain math performance but could degrade performance on easier cross-domain tasks. For beneficial cross-domain transfer, a balanced distribution of difficulties or explicit inclusion of cross-domain data may be more effective.

Data Construction

Data Sourcing — Curating datasets across Math, Code, Science, Logic, Simulation, and Tabular domains

Deduplication — Removing overlapping content via substring matching (27.2% Math, 7.5% Code reduction)

Reward Design — Domain-specific verification: rule-based, execution-based, and model-based

Heuristic Filtering — Removing noise and controlling complexity with uniform sampling

Difficulty Filtering — Selecting samples based on model performance gaps for appropriate challenge levels

Final Result: 92K curated examples

Experiment Results

We trained 7B and 32B models on the full Guru dataset to demonstrate the practical impact of multi-domain data. We used verl as the RL training framework and GRPO as the algorithm. The 7B model was trained for 2 epochs on 4 nodes (8 Hopper GPUs each) and the 32B model on 16 nodes for 2 epochs.

Domain	Benchmarks	7B				32B
Domain	Benchmarks	Guru 7B	General Reasoner 7B	ORZ 7B	SimpleRL 7B	Guru 32B	ORZ 32B	SimpleRL 32B
Math	AIME24(avg@32)	17.50	17.08	16.25	15.60	34.89	47.50	27.20
Math	MATH500	77.25	70.40	80.80	87.00	86.00	89.80	89.60
Code	LiveCodeBench(avg@4)	16.49	8.51	5.47	6.72	29.30	22.04	19.80
	HumanEval(avg@4)	82.62	61.12	67.38	58.08	90.85	84.30	81.25
	MBPP	70.00	39.80	48.40	49.60	78.80	74.20	76.75
Science	GPQA-diamond(avg@4)	40.78	38.64	37.63	35.98	50.63	55.67	46.46
Science	SuperGPQA	31.80	30.64	29.75	27.29	43.60	46.05	37.73
Logic	ARC-AGI(avg@4)	3.31	0.75	0.00	0.50	7.63	2.31	5.25
Logic	Zebra Puzzle(avg@4)	39.40	0.07	1.00	0.62	45.21	0.54	1.16
Simulation	CodeI/O(avg@4)	15.63	7.13	5.13	6.63	12.63	3.75	9.75
	CruxEval-I	61.72	63.63	69.38	56.25	80.63	71.13	72.63
	CruxEval-O	71.28	56.50	65.88	58.31	88.75	82.38	67.75
Tabular	FinQA	34.70	34.33	37.60	35.10	46.14	45.20	45.41
	HiTab	74.20	54.40	54.10	50.40	82.00	63.30	69.00
	MultiHiertt(avg@4)	44.94	31.62	38.10	37.57	55.28	52.83	52.83
Others	IFEval	35.81	39.56	32.72	36.69	55.45	38.26	55.27
Others	LiveBench	18.57	19.76	12.64	15.20	34.30	28.78	28.33
Average Score		43.29	33.76	35.42	33.97	54.24	47.53	46.25

Pass@k Curves

Pass@k behavior is highly task-dependent: while improvements in math tasks (e.g., AIME) might largely leverage base model capabilities, tasks like Zebra Puzzle demonstrate genuine reasoning expansion. Model scale also matters—larger models (32B) show more consistent gains than smaller ones (7B). Additionally, decoding hyperparameters significantly affect Pass@k, with higher temperature and top-p enhancing exploration and performance at larger k. These insights suggest Pass@k reflects both model and sampling dynamics, and should be interpreted cautiously.

BibTeX

@misc{cheng2025revisiting, title = {Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective}, author = {Zhoujun Cheng and Shibo Hao and Tianyang Liu and Fan Zhou and Yutao Xie and Feng Yao and Yuexin Bian and Yonghao Zhuang and Nilabjo Dey and Yuheng Zha and Yi Gu and Kun Zhou and Yuqi Wang and Yuan Li and Richard Fan and Jianshu She and Chengqian Gao and Abulhair Saparov and Haonan Li and Taylor W. Killian and Mikhail Yurochkin and Zhengzhong Liu and Eric P. Xing and Zhiting Hu}, journal = {arXiv preprint arXiv:2506.14965}, year = {2025}, doi = {10.48550/arXiv.2506.14965}, url = {https://arxiv.org/abs/2506.14965} }