#247 Rho-1: Not All Tokens Are What You Need for Pretraining
Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. But maybe Not all tokens in a corpus are equally important for language model training. This paper examines token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, a new language model called Rho-1 is introduced. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring tokens using a reference model, and then training the language model with a focused loss on tokens with higher scores. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively — matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when continual pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both data efficiency and performance of the language model pre-training.
In this video, I talk about the following: What is Selective Language Modeling? How are Rho-1 models trained? How does Rho-1 perform?
For more details, please look at https://proceedings.neurips.cc/paper_files/paper/2024/file/3322a9a72a1707de14badd5e552ff466-Paper-Conference.pdf and https://github.com/microsoft/rho
Lin, Zhenghao, Zhibin Gou, Yeyun Gong, Xiao Liu, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. "Not all tokens are what you need for pretraining." Advances in Neural Information Processing Systems 37 (2024): 29029-29063.