Deploy Llama-2 models easily with LMDeploy!

4 min readAug 2, 2023

This article will guide you on how to quickly deploy the Llama-2 models with LMDeploy.

There are 3 types of Llama-2 models that have been open-sourced so far: 7B, 13B, and 70B. Comparing to Llama-1, the 7B and 13B structures remain unchanged, while the 70B adjusts it, replacing Multi-Head Attention with Grouped-Query Attention. Overall, it’s not too difficult to implement. Let’s get started!

The LMDeploy’s Journey with Llama-2

Getting Started: 7B/13B

Meta provides Llama-2 7B and 13B conversation models with context window size 4096. As they have the same structure as Llama, all we need to do is to add the Llama-2 chat template in LMDeploy.

Tip: LMDeploy can deploy any language models with the same structure as Llama or Llama-2. Feel free topull PRs about their chat template to LMDeploy :)

The installation of LMDeploy is very simple:

pip install lmdeploy

By following the steps below, you will be able to interact with it via the command line:

python3 -m lmdeploy.serve.turbomind.deploy llama2 //the/path/of/llama-2-chat-7b-hf
python3 -m lmdeploy.turbomind.chat ./workspace

Launch the triton inference server to serve the model:

tritonserver --model-repository=./workspace/model_repository/ --allow-grpc=1 --grpc-port=33337

If you want to use the webui chat window, you can do the following:

python3 -m lmdeploy.app {tritonserver_ip_addr}:33337

Open the webpage https://localhost:6006 in your browser, and you can chat with AI assistant online.

LMDeploy has outstanding inference performance, outperforming similar open-source projects in output token throughput and request throughput metrics. Among them, output token throughput measures the token generation speed under fixed input and output tokens. Request throughput tests the number of requests processed per minute under real conversation scenario.

The above diagram shows output token throughput when the input and output tokens are (2048, 2048). It can be concluded that LMDeploy is about 5% — 15% higher than DeepSpeed overall and outperforms the official facebook Llama-2 inference by up to 5x.

In terms of request throughput, LMDeploy is about 30% higher than vLLM.

Advancing: 70B

Llama-2 70B uses GQA (Grouped-Query Attention). As shown in the following diagram, GQA divides query heads into groups, each of which shares a single key head and value head. When the number of groups equals the number of query heads, it becomes MHA (Multi-Head Attention). When the group is 1, it is MQA (Multi-Query Attention).

According to the literature, GQA is close to MHA in terms of model capability while being as efficient as MQA in terms of inference speed.

使用 MHA 结构的自回归模型，在推理过程中，会维护一个巨大的 k/v cache。它的内存开销公式为：

Auto-regressive models using MHA structure maintain a large k/v cache during the inference process. Its memory overhead formula is:

batch * max_seq_len * n_heads * head_dim * sizeof(half) * 2

而对于 GQA 来说，k/v cache 的内存开销公式变成：

While for GQA, the formula becomes:

batch * max_seq_len * n_kv_heads * head_dim * sizeof(half) * 2

n_heads / n_kv_heads is the size of the group. As you can see, using GQA can reduce the k/v cache to 1/group of MHA. This is very beneficial for attention, which is memory-intensive computation.

LMDeploy has implemented GQA and supports tensor parallelism. The deployment method is similar to that of 7B. You just need to set the tensor parallel parameter to 8 when converting the model structure. For more details, please refer to serving.

LMDeploy’s Special Features

Interactive mode Inference: No More Paying for Conversation History

In multi-turn conversation scenarios, most inference engines require users to send the prompt as well as the past conversation history to the server. This means that users have to pay for the history in each round of the conversation. LMDeploy can cache all attention k/v of the conversation, thus avoiding repetitive processing of historical conversations. We call this procedure interactive mode, which can greatly reduce the latency of generating the first token, especially for long conversation history.

Persistent Batch: The key to high throughput

LMDeploy models the inference of a conversational LLM as a persistently running batch whose lifetime spans the entire serving process. To put it simply:

- The persistent batch as N pre-configured batch slots.
- Requests join the batch when there are free slots available. A batch slot is released and can be reused once the generation of the requested tokens is finished.
- The batch grows or shrinks automatically to minimize unnecessary computations.

Conclusion

Other exciting features of LMDeploy are still under intense development. Welcome to follow our project at https://github.com/InternLM/lmdeploy for the latest updates!