Fine-tuning Llama2 takes less than 200 lines of code!

3 min readJul 27, 2023

Last week, Meta AI released their next generation large language model: Llama 2. They open-sourced the model, training, and inference scripts, with the model even available in a Hugging Face version. They’ve really done an excellent job of thinking about regular users while open-sourcing conscientiously, which is incredibly cool. As soon as I heard the news, I rushed into the official repository, llama2-recipes, planning to experience the training process of Llama 2.

During my first experience, it was clear that the code release was somewhat rushed, and I ran into a few minor issues. The model stopped converging after only a few iterations. Upon carefully reviewing the code, I found a small mistake. They updated the epoch-based scheduler as if it were step-based, which led to the learning rate decreasing too rapidly. After only a few iterations, the learning rate was almost zero. So, I quickly reported this issue to the official team: https://github.com/facebookresearch/llama-recipes/issues/27

The official response was incredibly fast and they fixed the issue the very next day: https://github.com/facebookresearch/llama-recipes/pull/28

Anyone who has encountered similar issues can update their code to ensure the problem is resolved.

After solving this minor issue, Llama2 was able to train normally. Great works for Meta AI! star++

After this minor incident, I got a good handle on the Llama2 training process. As stated in the paper, it is trained using FSDP. Wait, FSDP, isn’t MMEngine v0.8.0 also supporting FSDP training? So, I implemented the Llama2 training process based on the new features of MMEngine. See the complete training example at: https://github.com/open-mmlab/mmengine/tree/main/examples/llama2

Implementing the Dataset

We directly referred to the implementation of the alpaca dataset in llama-recipe.

Building the FSDPStrategy

The constructor of FSDPStrategy initializes distributed environment, random seeds, and other environmental variables, so it needs to be done first. Strategy is a feature introduced in MMEngine v0.8.0, aimed at solving some issues with large model training. For a detailed explanation of Strategy, you can look forward to subsequent articles~

strategy = FSDPStrategy(
    model_wrapper=dict(
        auto_wrap_policy=partial(
            transformer_auto_wrap_policy,
            transformer_layer_cls={LlamaDecoderLayer})),
    state_dict_cfg='full',
    env_kwargs=dict(randomness=dict(seed=42)))

Building the dataloader and model

The configuration is completely copied from the official repo. It’s worth noting that the official repo by default enables bf16 training with full parameters, without the need for mixed precision training.

# Prepare model
tokenizer = LlamaTokenizer.from_pretrained(args.checkpoint)
tokenizer.add_special_tokens({'pad_token': '<PAD>'})
model = LlamaForCausalLM.from_pretrained(args.checkpoint)
model.to(torch.bfloat16)
model.train()

# Prepare dataset
train_dataset = AlpacaDataset(
    tokenizer=tokenizer, data_path=args.data_root)
train_dataloader = DataLoader(
    train_dataset,
    batch_size=args.batch_size,
    sampler=DefaultSampler(train_dataset, seed=0),
    collate_fn=default_data_collator,
    drop_last=True)

Preparing optimizer and scheduler

The configuration aligns with the official repo, using AdamW and StepLR. The model, scheduler, and optimizer are then passed to the strategy to handle the FSDP related logic.

optim_cfg = dict(
    optimizer=dict(type=AdamW, lr=1e-4, weight_decay=0.0),
    accumulative_counts=ORI_BATCH_SIZE / args.batch_size)
scheduler_cfgs = [dict(type=StepLR, step_size=1, gamma=0.85)]
model, optimizer, schedulers = strategy.prepare(
    model,
    optim_wrapper=optim_cfg,
    param_scheduler=scheduler_cfgs,
    dispatch_kwargs=dict(max_iters=max_iters, max_epochs=args.max_epoch))

Customizing the train-loop

By using the strategy, we can break away from Runner and freely implement the training logic. Doesn’t it feel similar to native PyTorch?

for epoch in range(args.max_epoch):
    for idx, inputs in enumerate(train_dataloader):
        # Convert inputs to target device.
        inputs = apply_to(inputs, lambda m: isinstance(m, torch.Tensor),
                          lambda m: m.cuda())

        loss = model(**inputs).loss
        optimizer.update_params(loss)

        max_memory = torch.cuda.max_memory_allocated()
        strategy.logger.info(f'Epoch: {epoch+1}/{args.max_epoch}, '
                             f'Iter: {idx+1}/{epoch_length}, '
                             f'Loss: {loss.item():.3f}, '
                             f'Lr: {optimizer.get_lr()["lr"][0]:.6f} '
                             f'Memory: {max_memory/1e9:.3f}G')
        visualizer.add_scalars({'loss': loss.item()})

        torch.cuda.reset_peak_memory_stats()

    for scheduler in schedulers:
        scheduler.step()

    save_dir = f'{args.output_dir}/epoch_{epoch+1}'
    state_dict = model.state_dict()

    if is_main_process():
        model.save_pretrained(save_dir, state_dict=state_dict)
        tokenizer.save_pretrained(save_dir)

However, leaving the Runner also has some drawbacks. We have to manually update the learning rate, print logs, record logs, and save weights.

In conclusion

Users who are interested can come to MMEngine and try out the training examples. We welcome plenty of feedback. If you are interested in DeepSpeed and ColossalAI, we will also provide examples of fine-tuning with DeepSpeed and ColossalAI as soon as possible.

MMEngine：https://github.com/open-mmlab/mmengine