At the early morning of July 19, Beijing time, Meta has open sourced the upgraded version of LLaMA: LLaMA-2, three sizes of models from 7B to 70B are all open and can be used for free commercial use. Let’s take a quick look at the exciting new features of the upgraded LLaMA-2.
LLaMA-2 has currently open sourced three size versions with 7 billion, 13 billion, and 70 billion parameters. Different from LLaMA-1, in addition to open sourcing the pedestal models, models of all sizes have also been open sourced for fine-tuning, supporting direct dialogue applications. It is still necessary to fill out the application to get the download link. At the same time, Meta officially provides the original version and the HuggingFace version, a total of 12 models, to meet various needs.
According to official introductions, LLaMA-2 used 40% more training data during the pre-training stage compared to LLaMA-1, with a total of 2T tokens used for model pre-training.
In specific performance evaluations, LLaMA-2 demonstrated superior capabilities compared to its predecessor in various dimensions including academic ability, reasoning ability, knowledge ability, and comprehension ability. For instance, LLaMA-2–7B achieved a 21% performance improvement on the MMLU, a commonly used academic comprehensive ability evaluation dataset for large models, nearly doubled the performance on the GSM8K code evaluation set, and achieved a 12% performance improvement on the TrivialQA knowledge ability evaluation set.
Compared to LLaMA-1, LLaMA-2 scaled the model size to 70 billion parameters. With more massive training data, LLaMA-2(70B) achieved objective performance improvements, reaching performance close to that of ChatGPT on multiple evaluation sets.
The paper also provided a performance comparison between LLaMA-2(70B) and closed-source models (GPT-3.5, GPT-4, PaLM, PaLM-2). ChatGPT scored 70.0 on MMLU, almost equivalent to LLaMA-2(70B)’s score of 68.9. It is believed that with the efforts of the community, the performance of open-source models will get closer and closer to the level of ChatGPT.
According to the paper, LLaMA-2 retains the overall structure of LLaMA-1 in the model architecture, increasing the context length from 2048 to 4096, and introduces Grouped-query Attention (GQA) technology to enhance the model’s inference efficiency.
LLaMA-2-Chat, the focus of this upgrade, has built a model with superior dialogue capabilities by introducing Supervised Fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). Additionally, LLaMA-2 introduced Ghost Attention technology to improve the model’s multi-turn dialogue capability.
During the SFT phase, LLaMA-2 carried out a series of data quality improvement work based on the open-source directive fine-tuning dataset. With the selected high-quality data of 27,540 entries, LLaMA-2 can unlock excellent dialogue capabilities.
In the RLHF phase, LLaMA-2 designed a method of collecting user preference data based on a comparison of responses from two models, and additionally collected a set of safety preference datasets. Based on the collected preference data, LLaMA-2 trained a Reward Model, aligning with human preferences from two perspectives: Helpfulness and Safety.
Through two stages of fine-tuning the base model, LLaMA-2-Chat not only maintained the basic model capabilities but also responded better to human instructions in the dialogue scenario, providing useful and safe content responses.
LLaMA-2 conducted comprehensive research on the safety aspect, carried out systematic work at various stages such as pre-training, fine-tuning, and safety evaluation, and overall improved the model’s safety capabilities.
During the pre-training stage, LLaMA-2 removed personal privacy-related data in accordance with relevant regulations, and also conducted systematic analysis and research on biases and toxicity in the pre-training data. In the fine-tuning stage, LLaMA-2 introduced three technologies, including
- Fine-tuning that aligns with safety guidelines
- Safety-oriented RLHF
- Safety-oriented Context Distillation
In addition, LLaMA-2 also introduced Red Teaming, which further enhances the model’s safety capabilities by simulating security attacks. LLaMA-2 has also open-sourced its categorization and annotation guidelines for safety capabilities, which are believed to promote more valuable work in improving the safety of large models in the academic community.
After the above steps, the LLaMA-2 model achieved excellent results on two safety capability assessment datasets, TruthfulQA and ToxiGen, and even achieved superior performance to ChatGPT on the harmful evaluation dataset ToxiGen.