Faster and More Efficient 4-bit quantized LLM Model Inference
LMDeploy has released an exciting new feature — 4-bit quantization and inference. This not only trims down the model’s memory overhead to just 40% of what FP16 inference would take, but more importantly, with extreme optimized kernel, the inference performance has not been compromised. Instead, it’s more than three times the speed of FP16 inference on Gerforce RTX 4090.
We conducted benchmarks on both Llama-2–7B-chat and Llama-2–13B-chat models, utilizing with 4-bit quantization and FP16 precision respectively. The throughput for generating completion tokens was measured by setting a single prompt token and generating 512 tokens in response. All the results was measured for single batch inference.
As shown in the diagram, 4-bit inference with LMDeploy achieves 3.16 times faster than FP16 inference. And it outperforms other extradinary competitors by a margin of around 30% to 80%.
As for memory overhead, we tested scenarios with context window sizes of 1024, 2048 and 4096 respectively. The 4-bit 7B model can easily be accommodated by a single Geforce RTX 3060.
For more detailed test results, please refer to the benchmark section of this article.
Quick Start
Installation
The minimum requirement for performing 4-bit LLM model inference with LMDeploy on NVIDIA graphics cards is sm80, which includes models such as the A10, A100, and Geforce RTX 30/40 series.
Before proceeding with the inference, please ensure that lmdeploy(>=v0.0.5) is installed.
pip install lmdeploy
Get 4-bit quantized model
You can visit LMDeploy’s model zoo to download pre-quantized 4-bit models.
git-lfs install
git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
Alternatively, you can quantize the model weights to 4-bit by following the instructions presented in this guide.
Inference
## Convert the model's layout and store it in the default path, ./workspace.
python3 -m lmdeploy.serve.turbomind.deploy \
--model-name llama2 \
--model-path ./llama2-chat-7b-w4 \
--model-format awq \
--group-size 128
## inference
python3 -m lmdeploy.turbomind.chat ./workspace
Serve with gradio
If you wish to interact with the model via web ui, please initiate the gradio server as indicated below:
python3 -m lmdeploy.serve.gradio.app ./workspace --server-ip {ip_addr} --server-port {port}
Subsequently, you can open the website http://{ip_addr}:{port}
in your browser and interact with the model
Benchmark
LMDeploy uses two evaluation metrics to measure the performance of the inference API, namely completion token throughput (also known as output token throughput) and request throughput. The former tests the speed of generating new tokens, given specified number of prompt token and completion token, while the latter measures the number of requests processed per minute under real dialogue data.
Completion token throughput
We utilized lmdeploy’s profile_generation.py to test the token generation throughput and memory usage of the 4-bit and 16-bit Llama-2–7B-chat models under different batches on an A100–80G. The number of prompt tokens and completion tokens were set to 1 and 512, respectively.
The comparison results for throughput and GPU memory usage are as follows:
Request throughput
We also tested the request throughput of the 4-bit and 16-bit Llama-2–7B-chat by lmdeploy’s profile_throughput.py on an A100–80G GPU. The comparison results are as follows:
More
In addition to 4-bit quantization, LMDeploy also supports int8 quantization for k/v cache. We believe that the combination of these two will further improve inference performance. More detailed performance evaluation will be reported soon. Welcome to follow https://github.com/InternLM/lmdeploy to stay up-to-date with the latest news!
The more you star, the more you get :)