Thoroughly evaluate AX620A from the perspective of security industry

OpenMMLab
8 min readAug 10, 2023

--

We’ll thoroughly evaluate AX620A from the perspective of security and defense business as a third-party Deploee. We’ll test and assess AX620A’s performance in various dimensions such as model design, inference, and SDK. We will also provide corresponding test source code and running logs. We hope this information will provide valuable insights for chip selection.

OpenMMLab Platform Link

Product Introduction

AX620A is the second-generation visual chip launched by AXERA, configured with a 4-core Cortex-A7 CPU and 3.6Tops@int8 NPU computing power.

In int4 mode, AX620A’s computing power will be promoted to 14.4 Topps. However, int4 has the specific requirements for model design, and the official open-source documentation has not clarified the usage. We did not consider this part in this test.

Pure computing units cannot be tested practically. Therefore, we chose the Maix-III AXera-Pi, a development board designed by sipeed based on the AX620A chip. As of August 7, 2023, the retail price of the core board is less than $40. The singeboard(core board+baseboard) comes with a USB3.0 baseboard, core board, WIFI module, Ethernet interface, camera, and a 5-inch display screen, making it fully functional and convenient for work.

After acquiring the device, we first tested its CPU performance using the testing tool megpeak, a tool capable of measuring peak computational performance that supports arm, x86, and OpenCL architectures.

In this test, we modified CMakeLists.txt, and the gcc compilation parameters and explanations are as follows:

Here’s what we got:

there are 4 cores, currently use core id :0

bandwidth: 1.861453 Gbps
padal throughput: 5.411672 ns 0.739143 GFlops latency: 6.405562 ns :
padd throughput: 1.509807 ns 2.649345 GFlops latency: 5.024015 ns :
mla_s32 throughput: 5.275521 ns 1.516438 GFlops latency: 5.290761 ns :
mlal_s8 throughput: 2.923057 ns 5.473721 GFlops latency: 5.025521 ns :
mlal_s16 throughput: 2.770953 ns 2.887093 GFlops latency: 5.106042 ns :
mlal_s16_lane throughput: 2.765276 ns 2.893020 GFlops latency: 5.027750 ns :
mla_f32 throughput: 5.354490 ns 1.494073 GFlops latency: 10.047442 ns :
mul_s32 throughput: 5.393568 ns 0.741624 GFlops latency: 5.274667 ns :
mul_f32 throughput: 5.387370 ns 0.742477 GFlops latency: 5.398443 ns :
cvt throughput: 5.377729 ns 0.743808 GFlops latency: 5.352896 ns :
qrdmulh throughput: 5.275443 ns 0.758230 GFlops latency: 5.353959 ns :
rshl throughput: 2.763766 ns 1.447301 GFlops latency: 5.023833 ns :

The results we obtained are:

  • The actual memory bandwidth is 1.86 Gbps. Actual memory provided by the device is 1.2GB
  • For int8 multiplication, the 4-core CPU can provide a total of 22 gflops of computing power, 3.7 times that of fp32 multiplication

Users can estimate the execution time of image processing code based on these data — assuming memory movement and calculation are well optimized and do not interfere — such as dividing the computation by 1.49 gflops if there is a significant amount of fp32 computation, or dividing by memory bandwidth if the main operation is memory copying.

Next, we begin evaluating NPU performance. AX620A supports 1_1 mode, where half of the computing power is used for night vision enhancement, and the other half for AI computation. Considering that security scenarios usually require enhanced image quality at night, all our subsequent tests will enable 1_1 mode. If you need to compare qps with other similar chips, AX620A’s results need to be multiplied by 2.

Single Operator Test

AX620A’s model conversion tool is called pulsar, a python script inside a docker image. Currently, pulsar supports 46 kinds of onnx operators. The complete support list can be found here: onnx support list.

Testing Process

Although we can not test every operator, just like GEMM can be broken down into multiple GEPP/GEBPs, operators can also be broken down into basic operations. Based on the calculation process, we classified these 46 onnx operators into 9 categories and selected one from each category for testing.

Next, we used torch2onnx to generate corresponding onnx models for these 9 operators.

Due to the numerous operator parameters, to avoid excessively long test time, we fixed the input shape of conv at 224x224, while other models were tested at three different scales: 112x112, 384x256, and 1080x1920. Eventually, we obtained 170 single-operator onnx models. The torch code to generate these models can be found here: github link.

Then, we used pulsar to convert these onnx operators. We successfully converted 139 models, and 28 models had clear error logs during the conversion process, including one softmax conversion that got stuck.

Finally, in the successfully run models, we measured the different operators’ energy efficiency ratio, an indicator showing the number of MACs (multiply-accumulate) completed per microsecond. We sorted by energy efficiency ratio, and here are some of the results:

Result Analysis

The test results are consistent with the official Efficient Operator Design Guidelines, and we observed more phenomena:

  • When stride=2, the efficiency of conv7x7 surpassed conv3x3
  • Using conv1x1 instead of gemm can make better use of NPU
  • The overhead of binary operations like add is not low

The complete test results, including the conversion process logs, statistical tables, and execution scripts, can be found here: https://github.com/tpoisonooo/deploee-benchmark/blob/main/ax620a/opr_test.md

The user can adjust the parameters of the model according to mac_util and efficiency in the table.

Model Testing

The hardware model library contains 640 onnx models converted by OpenMMLab algorithms. These models are related to various tasks such as 3D detection, segmentation, key point recognition, OCR, etc. They are very suitable for testing the completeness of the vision chip software stack.

Since AX620A does not support dynamic shape, it can not support models like mmdet3d-voxel, mmaction, and LLaMa. Therefore, we have selected 318 fixed input size onnx models for testing.

However, during the conversion process of some models, errors occurred due to some operators not being implemented. Here are the cases where many operators are missing:

In the end, we successfully converted 60 models. Now let’s take a look at the running time of the resnet series models on AX620A:

Compared to resnet50’s runtime of over 100ms on the Jetson Nano, the speed of axera-pi has increased 5 times, and its retail price is only 1/2 of the former, showing obvious cost performance.

The onnx models used for testing can be searched and downloaded in the hardware model library. Execution logs and results have been published at the following link: https://github.com/tpoisonooo/deploee-benchmark/blob/main/ax620a/model_test.md

SDK Evaluation

Visual SDK is usually composed of multiple pipelines. Its inputs are images or videos, and the outputs are structured data. Since image decoding is often not a performance bottleneck, when developing a visual SDK, we need to consider video decoding, image operations, and compatibility with the pipeline.

Decoding

In the field of security, the most commonly used video formats are h.264 and h.265, mainly adopting main or high profile, with a standard size of 1088x1920. Although the video frame rate used in our tests is 60fps, this will not affect our final conclusion.

$ ffmpeg -i 1088x1920.h264
..
Stream #0:0: Video: h264 (High 10), yuv420p10le(progressive), 1920x1088 [SAR 136:135 DAR 16:9], 57 fps, 59.94 tbr, 1200k tbn, 119.88 tbc

We made slight modifications to the ax-pipeline source code to test the peak decoding speed in different situations (such as output scaling, cropping, flipping, etc.). This is because, in practical business, the width and height of the video are not fixed. In some chip implementations, correcting video output may cause a decrease in decoding speed, while using smaller video sizes can speed up decoding. Therefore, we need to take these situations into account when testing decoding speed. Here are the test results we obtained on the AX620A:

As mentioned above, we can see that the video decoding speed of AX620A is almost stable at 60fps and is not affected by image processing operations.

Image Processing Support

The second factor affecting pipeline throughput is image processing speed. Take common face recognition as an example. Before recognition, it is necessary to correct the face image. At this time, whether you can use the CPU and NPU to efficiently achieve perspective transformation may determine the maximum throughput of the pipeline. The following table lists the image processing operators provided by AX620A in IVPS (Video Image Processing Subsystem). Users can consult the complete documentation at the following link: https://github.com/sipeed/axpi_bsp_sdk/tree/main/docs

Due to the lack of WarpPerspective and TopK, to deploy a complete face business, it may be necessary to adjust the image correction and feature engineering implementation.

Conclusion

From the perspective of security industry, we have conducted a comprehensive test of AX620A, covering various evaluation indicators such as CPU, operators, models, and SDK, and provided detailed operation logs and script support for this.

According to the experimental results, we have reached the following conclusions:

  1. Cost Performance: ★★★★★

AX620A has strong performance. In comparison with the Jetson Nano’s contemporaneous product, it achieves more than a 5-fold performance improvement at less than 1/2 of its retail price and with 50% of its own computing power.

2. Usability: ★★★★☆

Pulsar is implemented based on Docker, and users can use it directly without complicated installation and configuration. At the same time, AX620A provides complete samples and documentation, providing good support for users. However, the lack of chip architecture documentation makes AX620A slightly deficient in transparency.

3. Model Compatibility: ★★★☆☆ AX620A has limited support for dynamic shape; a few models cannot be converted within 2 hours, reflecting room for improvement in compatibility.

4. CV Operators and Decoding Support: ★★★☆☆

AX620A can meet basic computer vision operators and decoding needs, but the API has usage restrictions.

If AX620A can further optimize pulsar, improve model compatibility, etc., it has the potential to become a perfect vision NPU chip. We have learned from industry channels that these issues have been resolved in AXERA’s third-generation vision chip AX650N, and we are looking forward to its performance.

--

--