InternLM3 Open Source: Achieving High-Performance Models with 4T Data
Written by InternLM Team
On January 15, Shanghai AI Lab announced a major upgrade to the InternLM model with the release of InternLM3. By refining its data framework, InternLM3 significantly improved data efficiency and achieved a leap in IQPT (Intelligence Quality per Token).
The InternLM3–8B-Instruct, trained with only 4T of data, outperformed other open-source models of the similar scale while cutting training costs by over 75%. Additionally, for the first time, IntenLM3 integrates routine conversational capabilities with deep thinking in a general-purpose model, enabling it to handle a wider range of real-world scenarios.
Demo page: https://internlm-chat.intern-ai.org.cn
HuggingFace: https://huggingface.co/internlm/internlm3-8b-instruct
GitHub: https://github.com/InternLM/InternLM
High IQPT drives high-performance reasoning
Data is a key driver for enhancing large model capabilities. Currently, most popular open-source models rely on expanding the scale of pretraining data to improve performance, with datasets typically approaching 20T tokens. This approach, however, leads to a linear increase in training costs and raises industry-wide concerns about data bottlenecks and the sustainability of the Scaling Law.
Our research team believes that improving data quality offers far greater benefits than merely increasing data volume.The data IQPT (Intelligence Quality per Token) is the core of data quality, i.e., the logic, complexity, and inspiration embedded in the thinking process of data. To this end, we proposes a large-scale data streamlined framework, which substantially improves the quality of training data.
In practice, InternLM3 achieved the performance of popular open-source models trained on 18T tokens by using only 4T tokens of pretraining data. Enhancing the performance of models by improving the data IQPT brings a new research paradigm for breaking the Scaling Law.
To better evaluate the impact of data IQPT, we quantified the metric by defining IQPT as the ratio of a model’s average performance to the amount of training data. This provides a measure of the “return on investment” for large model training data. Compared to leading open-source models of similar scale, using Llama3.1 as a benchmark, the data IQPT of InternLM3 is more than 4 times higher.
Through the data streamlined framework, we significantly improved data efficiency for InternLM3, achieving a substantial increase in IQPT. This framework consists of two key components:
- Intelligent Data Processing: To enable fine-grained data handling, we divided the data into millions of domains. Using self-evolution techniques for intelligent agents, large-scale automated quality checks were implemented. Based on misclassifications, reflections were made, and customized processing was applied to each domain.
- Synthesis of High-Value Data: By integrating general and specialized models, we rapidly iterated synthesis algorithms, then selected data to train specialized models. Through material mining in vast amounts of natural data, improved tree-search strategies, and multi-dimensional quality validation, we synthesized a large volume of rich, high-quality data.
Using the OpenCompass open-source evaluation framework, we conducted evaluations on InternLM3 and other models with a reproducible, unified method. The evaluation utilized over ten authoritative benchmark sets, including CMMLU, GPQA, and others, covering various performance dimensions such as reasoning, mathematics, programming, instruction-following, long texts, dialogue, and overall performance. The results show that InternLM3 outperforms similar open-source models on most benchmarks, with overall performance closely matching GPT-4o-mini.
Fusion of Deep Thinking and General Conversation
Exploring general artificial intelligence through the “general-specialized integration” approach relies on synchronously enhancing deep reasoning and domain-generalization capabilities. With the release of InternLM3 for the first time, deep reasoning and general conversational abilities have been integrated into a single general-purpose model, enabling it to handle a broader range of real-world scenarios.
Due to the significant differences in data styles between deep reasoning and general conversation, the industry often develops specialized models for reasoning tasks. Previously, the Shanghai AI Lab introduced InternThinker, a high-performing reasoning model capable of long-form reasoning, self-reflection, and correction during inference, outperforming o1-preview on mathematical competition benchmarks.
Following the “general-specialized integration” approach, we explored methods for training on a fusion of different data types. This enables to combine general conversational and deep reasoning abilities seamlessly. By leveraging system prompts, users can switch the model between conversational and reasoning modes with a single command, endowing the general-purpose model with deep thinking capabilities.
In the post-training phase, we developed task-driven and knowledge-system-driven synthetic data strategies. This includes instruction annotation and synthesis based on the World Knowledge Tree and high-quality response generation using multi-agent techniques. By maximizing the potential of real and synthetic user instructions, we categorized multi-task scenarios with fine granularity, creating hundreds of thousands of high-quality fine-tuning instruction datasets, which significantly improved the model’s conversational experience.
As shown in the diagram below, during inference tasks, users can switch InternLM3 from general conversation mode to deep reasoning mode with a single click.