Model deployment（I）：Brief introduction of model deployment

12 min readApr 10, 2023

Many members of the OpenMMLab community face confusion regarding the deployment of OpenMMLab models. To address this issue, an open source model deployment toolbox, MMDeploy, has been developed. MMDeploy facilitates the deployment process, bridging the gap between models and applications. In this technical article, we will use MMDeploy to provide a step-by-step tutorial on model deployment. We will cover the following topics:

Definition of the intermediate representation of ONNX
Conversion of a PyTorch model to an ONNX model
Use of inference engines such as ONNX Runtime and TensorRT
Demonstration of common deployment pipeline problems and their solutions, specifically pertaining to PyTorch-ONNX-ONNX Runtime/TensorRT
Inference SDK of MMDeploy in C/C++

By going through this tutorial, we hope that readers will be able to deploy their own PyTorch models on an ONNX Runtime/TensorRT and gain proficiency in deploying various computer vision models of OpenMMLab on different inference engines using MMDeploy.
It is assumed that readers have a basic understanding of PyTorch and are familiar with Python, hence no additional knowledge of model deployment is required.
In this technical article, we will deploy a simple super-resolution model and explore concepts such as intermediate representation and inference engine.

Model Deployment Acquaintance

In software engineering, deployment refers to the process of putting the developed software into use, including steps such as environment configuration and software installation. Similarly, for deep learning models, model deployment refers to the process of running a trained model in a specific environment. Compared with software deployment, model deployment can bring more difficulties like:

The environment required to run the model is difficult to configure. Deep learning models are usually written in certain frameworks, such as PyTorch, TensorFlow. Due to the limitations of framework size and dependent environment, these frameworks may not be suitable for installation in production environments such as mobile phones and development boards.
The structure of the deep learning model is usually relatively large, requiring a large amount of computing power to meet the needs of real-time operation. The operating efficiency of the model needs to be optimized.

Because of these difficulties, model deployment cannot be accomplished by simple environment configuration and installation. After several years of exploration in industry and academia, a popular pipeline for model deployment is generated:

To deploy the model to a certain environment, developers can use any deep learning framework to define the network structure and determine the parameters in the network through training. After that, the structure and parameters of the model will be converted into an intermediate representation that only describes the network structure. Some optimizations for the network structure will be performed on the intermediate representation. Finally, written in a hardware-oriented high-performance programming framework (such as CUDA, OpenCL), the inference engine that can efficiently execute the operators in the deep learning network will convert the intermediate representation into a specific file format, and run the model efficiently on the corresponding hardware platform .

This pipeline solves two major problems in model deployment. The first one is using the intermediate representation between the deep learning framework and the inference engine, developers have no need to worry about how to run various complex frameworks in the new environment. Also, through the optimization of network structure for intermediate representation and the underlying optimization of inference engine operations, the operation efficiency of the model is greatly improved.

Now，let’s start from a “Hello World” model deployment program to know more about model deployment !

Deploy your first model

Generate a PyTorch model

Refering to the official PyTorch model deployment tutorial, let’s realize a super-resolution model with PyTorch and deploy the model onto the ONNX Runtime inference engine.

First of all，we need to create a compiling environment with a PyTorch codebase. We strongly recommend using conda to manage Python libraries. With conda, you can initialize a PyTorch environment with the following command:

# Create and pre-install a virtual environment of Python 3.7, naming 'deploy'
conda create -n deploy python=3.7 -y
# Enter the virtual environment
conda activate deploy
# Install a cpu version PyTorch
conda install pytorch torchvision cpuonly -c pytorch

If your device supports cuda，we recommend you to use a GPU version PyTorch after cuda environment configuration. You can change the above PyTorch installation command into:

# Install a cuda 11.3 verison PyTorch
# Please refer to the above PyTorch offical installation tutorial if you use other cuda version
conda install pytorch torchvision cudatoolkit=11.3 -c pytorch

This tutorial will use some other third-party libraries. You can install these libraries with the following commands:

# Install ONNX Runtime, ONNX, OpenCV
pip install onnxruntime onnx opencv-python

After everything is configured, create a new empty folder. All scripts in this article can be run with the following Jupyter Notebook.

script1.ipynb

In the new folder, use the code below to create a script that runs the super-resolution model.

import os
import cv2
import numpy as np
import requests
import torch
import torch.onnx
from torch import nn
class SuperResolutionNet(nn.Module):
    def __init__(self, upscale_factor):
        super().__init__()
        self.upscale_factor = upscale_factor
        self.img_upsampler = nn.Upsample(
            scale_factor=self.upscale_factor,
            mode='bicubic',
            align_corners=False)
        self.conv1 = nn.Conv2d(3,64,kernel_size=9,padding=4)
        self.conv2 = nn.Conv2d(64,32,kernel_size=1,padding=0)
        self.conv3 = nn.Conv2d(32,3,kernel_size=5,padding=2)
        self.relu = nn.ReLU()
    def forward(self, x):
        x = self.img_upsampler(x)
        out = self.relu(self.conv1(x))
        out = self.relu(self.conv2(out))
        out = self.conv3(out)
        return out
        
# Download checkpoint and test image
urls = ['https://download.openmmlab.com/mmediting/restorers/srcnn/srcnn_x4k915_1x16_1000k_div2k_20200608-4186f232.pth',
    'https://raw.githubusercontent.com/open-mmlab/mmediting/master/tests/data/face/000001.png']
names = ['srcnn.pth', 'face.png']
for url, name in zip(urls, names):
    if not os.path.exists(name):
        open(name, 'wb').write(requests.get(url).content)
def init_torch_model():
    torch_model = SuperResolutionNet(upscale_factor=3)
    
    state_dict = torch.load('srcnn.pth')['state_dict']
    
    # Adapt the checkpoint
    for old_key in list(state_dict.keys()):
        new_key = '.'.join(old_key.split('.')[1:])
        state_dict[new_key] = state_dict.pop(old_key)
    torch_model.load_state_dict(state_dict)
    torch_model.eval()
    return torch_model
model = init_torch_model()
input_img = cv2.imread('face.png').astype(np.float32)
# HWC to NCHW
input_img = np.transpose(input_img, [2, 0, 1])
input_img = np.expand_dims(input_img, 0)
# Inference
torch_output = model(torch.from_numpy(input_img)).detach().numpy()
# NCHW to HWC
torch_output = np.squeeze(torch_output, 0)
torch_output = np.clip(torch_output, 0, 255)
torch_output = np.transpose(torch_output, [1, 2, 0]).astype(np.uint8)
# Show image
cv2.imwrite("face_torch.png", torch_output)

In this script, we create a classic super-resolution network SRCNN. SRCNN first upsamples the image to the corresponding resolution, and then processes the image with 3 convolutional layers. For convenience, we skip the step of training the network and directly download the model weights and the input. (Since the weight structure of SRCNN in MMEditing is different from the model we defined, we modified the key of the weight dictionary to fit the model we defined.) In order to make the model output into the correct image format, we convert the output of the model into HWC format, and ensure that the color value of each channel is between 0 and 255. If the script works properly, a super-resolution photo of the face will be saved as “face_torch.png”.

After the PyTorch model is tested correctly, let’s officially start deploying the model. Our next task is to convert the PyTorch model into a model described by the intermediate representation ONNX.

Intermediate Representation — — ONNX

Before introducing ONNX, let’s get to know the structure of the neural network in essence. The neural network actually just describes the process of data calculation, and its structure can be represented by a calculation graph. For example, ‘a+b’ can be represented by the following calculation graph:

In order to speed up computation, some frameworks describe the neural network using a static “compile-before-execute” graph. The disadvantage of static graphs is that it is difficult to describe the control flow (such as if-else branch statements and for loop statements), and directly introducing control statements to them will result in different calculation graphs. For example, if ‘a=a+b’ is executed n times in a loop, different calculation graphs will be generated for different n:

ONNX （Open Neural Network Exchange）is released by Facebook and MicroSoft together in 2017. It is a standard format to describe computational graphs. At present, under the joint maintenance of several institutions, ONNX has docked with various deep learning frameworks and various inference engines. Therefore, ONNX is regarded as a bridge from the deep learning framework to the inference engine, just like the intermediate language of the compiler. Due to the different compatibility of various frameworks, we usually only use ONNX to represent static graphs that are easier to deploy.

Let’s convert the PyTorch model to an ONNX model with the following code:

x = torch.randn(1, 3, 256, 256)
with torch.no_grad():
    torch.onnx.export(
        model,
        x,
        "srcnn.onnx",
        opset_version=11,
        input_names=['input'],
        output_names=['output'])

Among them, torch.onnx.export is PyTorch’s built-in function for converting the model to ONNX format.

Let’s first look at the first three mandatory parameters. The first three parameters are the model to be converted, any set of inputs for the model, and the filename of the exported ONNX file.

When converting a model, it is easy to understand that the original model and the output file name are required, but why do you need to provide a set of inputs for the model? This involves the principle of ONNX conversion. From the PyTorch model to the ONNX model, it is essentially a language translation. The intuitive idea is to thoroughly parse the code of the original model like a compiler, recording all control flows.

But as mentioned earlier, we usually only use ONNX to record static graphs that do not consider control flow. Thus, PyTorch provides a model conversion method called trace. Given a set of inputs, the model is actually executed again. That is, the calculation graph corresponding to this set of inputs is recorded and saved in ONNX format. The export function uses the tracking export method, which needs to give any set of inputs to make the model run. Our test image is three-channel, 256x256 in size. A random tensor of the same shape is also constructed here.

Among the remaining parameters, opset_version indicates the version of the ONNX operator set. The development of deep learning will continue to produce new operators. In order to support these new operators, ONNX will often release new operator sets. Currently, 15 versions have been updated. Setting ‘opset_version = 11’, which means using the 11th ONNX operator set, because the bicubic (double cubic interpolation) in SRCNN is only supported in opset11. The left 2 parameters input_names, output_names are the name of input and output tensors, which we will use later.

If the above code runs successfully, an ONNX model file “srcnn.onnx” will be added to the directory. We can use the following script to verify that the model file is correct:

import onnx
onnx_model = onnx.load("srcnn.onnx")
try:
    onnx.checker.check_model(onnx_model)
except Exception:
    print("Model incorrect")
else:
    print("Model correct")

Among them，onnx.load function can read a ONNX model. onnx.checker.check_model can check whether the model format is correct. This function will report an error if there is one. When the model works properly, “Model correct” will be printed in the console.

Then, let’s take a look at the actual structure of the ONNX model. We can use Netron （an open source model visualization tool）to visualize the ONNX model. Drag the srcnn.onnx file from your local file system into the site, and you can see the following visualization:

Click input or output to view the basic information of the ONNX model, including the version of the model, as well as the name and data type of the model input and output.

Click an operator node, you can see the specific information of the operator. For example, click on the first Conv to see:

Each operator records three types of information: operator attribute, graph structure and weight.

Operator attribute information is the information in attributes in the figure. For convolution, operator attributes include convolution kernel size (kernel_shape), convolution step size (strides), etc. These operator properties will eventually be used to generate a specific operator.
Graph structure information refers to the name of the operator node in the calculation graph and the information of adjacent edges. For the convolution in the figure, the operator node is called Conv_2, the input data is called 11, and the output data is called 12. According to the graph structure information of each operator node, the calculation graph of the network can be completely restored.
Weight information refers to the weight information stored by operators after the network is trained. For convolution, the weight information includes the weight value of the convolution kernel and the bias value after convolution. Click the plus sign behind conv1.weight to see the specific content of the weight information.

Now, we have the ONNX model of SRCNN. Let’s see how we can finally put this model to work.

Inference Engine — — ONNX Runtime

ONNX Runtime is a cross-platform machine learning inference accelerator maintained by Microsoft, which is the ‘inference engine’ we mentioned earlier. ONNX Runtime directly serves ONNX. That is to say, ONNX Runtime can directly read and run .onnx files without converting files in .onnx format to other formats. Which means，for deployment pipeline ‘PyTorch — ONNX — ONNX Runtime’ , as long as you get the .onnx file in the target device and run the model on ONNX Runtime, the model deployment is done.

Following all these steps, we converted the model written in PyTorch into an ONNX model and visually checked the correctness of the model. Finally, let’s complete the final step of model deployment by running the model with ONNX Runtime.

The ONNX Runtime provides a Python interface. Following the script, we can add the following code running model:

import onnxruntime

ort_session = onnxruntime.InferenceSession("srcnn.onnx")
ort_inputs = {'input': input_img}
ort_output = ort_session.run(['output'], ort_inputs)[0]

ort_output = np.squeeze(ort_output, 0)
ort_output = np.clip(ort_output, 0, 255)
ort_output = np.transpose(ort_output, [1, 2, 0]).astype(np.uint8)
cv2.imwrite("face_ort.png", ort_output)

In this code, there are only three lines of code related to the ONNX Runtime, excluding the post-processing operations. Let’s briefly parse these three lines of code. onnxruntime.InferenceSession is used to obtain a ONNX Runtime inference engine,whose parameter is used in the model of reasoning ONNX file.

The run method of the reasoner is used for model inference, and its first parameter is a list of output tensor names, and the second parameter is a dictionary of input values. The key of the input value dictionary is the tensor name, and the value is the tensor value of numpy type. The names of the input and output tensors need to correspond to the input and output names set in torch.onnx.export.

If the code works correctly, another super-resolution photo will be saved as “face_ort.png”. This picture is exactly the same as the “face_torch.png” just obtained. This shows that ONNX Runtime has successfully run the SRCNN model, and the model deployment is complete!

If a user wants to implement super-resolution operations later, a “srcnn.onnx” file is all he or she needs. The user can run the model with just a few lines of code with configured Python environment of ONNX Runtime. Or there is an easier way. We can use ONNX Runtime to compile an application that can directly execute the model. We only need to provide the ONNX model file to the user, and let the user select the ONNX model file name to be executed in the application to run the model.

Summary

In this tutorial, we leverage established model deployment tools to facilitate the deployment of an initial version of the super-resolution model SRCNN. However, as the model architecture becomes increasingly complex in practical application scenarios, challenges will arise. In the upcoming tutorial, we will “upgrade” this super-resolution model to enable dynamic input support.

Is it a little difficult to fully grasp this article? Easy ! There is a lot to learn for model deployment. To simplify examples, this tutorial contains many knowledge points that will be covered in the following articles. In fact, after reading this tutorial, it is enough to keep in mind the following knowledge highlights:

Model deployment refers to the process of running a trained model in a specific environment. Model deployment needs to solve the two major problems of poor model framework compatibility and slow model operation.
Common pipeline of model deployment is ‘Deep Learning Framework — Intermediate Representation — Inference Engine’. And ONNX is a popular intermediate representation.
The deep learning model is actually a computational graph. When models are deployed, they are typically transformed into static diagrams. That is, diagrams with no control flow (branch statements, loop statements).
The PyTorch framework has built-in support for ONNX. All you need to do is construct a random set of inputs and call torch.onnx.export to the model to convert PyTorch to ONNX.
Inference engine ONNX Runtime has built-in support for ONNX models. Given a.onnx file, model reasoning can be done simply by using the Python API of ONNX Runtime.

In order to realize the implementation of deep learning algorithms, challenging model deployment is inevitable. So, the open source codebase MMDeploy we developed implements the deployment of multiple vision task models such as object detection, image segmentation, and super-resolution in OpenMMLab.

It supports multiple inference engines such as ONNX Runtime, TensorRT, ncnn, openppl, and OpenVINO. In the following model deployment tutorials, we will introduce model deployment technologies and how these technologies are used in MMDeploy. Hope you will continue to pay attention to our follow-up tutorials, pay attention to MMDeploy, and contribute to the implementation of it.