Newest User Guide for MMAction2 — Master Action Recognition Today!

6 min readFeb 27, 2023

In September 2022, MMAction2 V1.0 was refactored based on the MMEngine algorithm library, bringing comprehensive architecture and functionality optimization! After more than 4 months of polishing, what surprises does the latest V1.0 rc3 version bring? Come and explore!

The latest MMAction2 V1.0 rc3 version has brought many new features, including:

Latest SOTA video understanding algorithms.
Enhance skeleton action recognition with rich motion modalities.
Inferencer: get model inference done in just one line of code.
Model Zoo upgraded: better baselines, higher starting points.
Omni-Source technology: boost video model training with image datasets.
Deploy spatiotemporal detection models with our exportation tool

Try now: https://github.com/open-mmlab/mmaction2/tree/1.x

1. Latest SOTA video understanding algorithms

MMAction2 V1.0 adds support for video understanding models and datasets in the field of video understanding, including:

Video Swin Transformer (CVPR’2022)
VideoMAE (NeurIPS’2022)
C2D (CVPR’2018)
MViT V2 (CVPR’2022)
STGCN++ (ArXiv’2022)
UniFormer V1 (ICLR’2022) and V2 (Arxiv’2022)
AVA-Kinetics dataset for spatialtemporal action detection

2. Enhance skeleton action recognition with rich motion modalities

As MMAction2 previously only supported joint and bone modalities, MMAction2 V1.0 adds support for joint motion and bone motion modalities. We trained and evaluated the above four modalities on STGCN, 2s-AGCN, and STGCN++ using NTU60 2D and 3D keypoint data.

3. Inferencer: get model inference done in just one line of code

Inferencer is a fast inferencer interface introduced in MMAction2 V1.0 rc3. Once initialized by specifying the model name, Inferencer can perform model inference in one line of code.

from mmaction.apis.inferencers import MMAction2Inferencer
inferencer = MMAction2Inferencer(rec='tsn')
result = inferencer(input_video_path)

Additionally, demo inference can also be performed with just one line of code.

python demo/demo_inferencer.py demo/demo.mp4 --rec tsn --vid-out-dir demo_out

Visualization results of Fast Inference API

Compared to the previous inference script, Inferencer has the following advantages:

a. Simpler. Based on the model information saved in the metafile, the checkpoint can be omitted and alias can be used instead of long config file names.

b. More powerful. The fast inference interface supports visualization and can save inference results in multiple ways.

c. More unified. The fast inference interface will provide a unified interface for various tasks in MMAction2, and OpenMMLab algorithm libraries maintain consistent interfaces, truly achieving “out-of-the-box” use.

For more detailed usage, please refer to the relevant documentation. If you want to implement the inference interface by yourself, you can also refer to the MMEngine inference interface design document.

4. Model Zoo upgraded: better baselines, higher starting points

a. ImageNet/Kinetics pre-train

We have utilized the pre-training models provided by the new training methods recently supported by TorchVision [https://pytorch.org/blog/how-to-train-state-of-the-art-models-using-torchvision-latest-primitives/] to update the initialization methods of a series of video classification models and achieved better results. The following table shows some of the results:

b.Repeat Augment

Repeat Augment was initially proposed as a data augmentation method for ImageNet training, and has been used in a series of recent Video Transformer works. When a video is read during training, we use multiple random samples of this video (usually 2–4) for training. This approach can increase the model’s generalization ability and reduce the IO pressure of video reading, as pointed out by Kaiming He et al.’s work [1]. We support Repeat Augment in V1.0 and used this technique in the training of MViT [2]. The table below compares the performance before and after using Repeat Augment.

5. Omni-Source technology: boost video model training with image datasets

a. Motivation of Omni-Source

Omni-Source, originating from [3], improves the generalization abilities of video recognition models by training them with additional image datasets. Specifically, the spatial scenes in videos do not vary significantly, and a medium-sized video dataset (such as Kinetics400) provides relatively little 2D spatial information (compared to image datasets like ImageNet), making it easy for models to overfit to the 2D spatial features of the training set. Expanding the size of the video dataset would increase training costs exponentially (video classification tasks are already several times more expensive than image classification tasks). Training the video and image datasets together may be a solution.

b. Improvement strategy

As shown in the figure below, the original Omni-Source [3] stacks images into “static” videos as input for video models. However, this leads to a significant increase in computation. We aim to process image data with 2D models in terms of computation. Taking SlowOnly ResNet50 as an example, it can output normally when the input is a video. When the input is an image, 1x3x3 convolution and 1x1x1 convolution are used to reshape into 2D convolution; 3x1x1 convolution first sums over the temporal axis and then degrades into 2D convolution. In this way, we convert the 3D model SlowOnly ResNet50 into a 2D model ResNet50, which can receive image inputs normally.

c. Performance Results

Taking SlowOnlyR50 8x8 model as an example, the experimental performance comparison of the three training methods is shown below. It can be seen that the Omni-Source training effectively utilizes additional ImageNet dataset, significantly improving the performance.

6. Deploy spatiotemporal detection models with our exportation tool

We now support exporting the spatio-temporal detection model based on FastRCNN to ONNX format. Given the model’s configuration file CONFIG_FILE and model weights CHECKPOINT, this model can be exported to ONNX format inference file OUTPUT_FILE as follows:

python3 tools/deployment/export_onnx_stdet.py \
    CONFIG_FILE \
    CHECKPOINT \
    --output_file OUTPUT_FILE

For example:

python3 tools/deployment/export_onnx_stdet.py \
    configs/detection/ava/slowonly_kinetics400-pretrained-r101_8xb16-8x8x1-20e_ava21-rgb.py \
    slowonly_omnisource_pretrained_r101_8x8x1_20e_ava_rgb_20201217-16378594.pth \
    --output_file slowonly_kinetics400-pretrained-r101_8xb16-8x8x1-20e_ava21-rgb.onnx \

We provide an example in the MMAction2 documentation for performing temporal detection inference using the converted ONNX file.

https://github.com/hukkai/mmaction2/tree/onnx-demo/demo#spatiotemporal-action-detection-onnx-video-demo

Inference results based on the converted ONNX model.

Inference results based on PyTorch model.

[1] Masked Autoencoders As Spatiotemporal Learners

[2] Multiscale Vision Transformers

[3] Omni-sourced Webly-supervised Learning for Video Recognition

[4] Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks

Come join us on the OpenMMLab Discord community：https://discord.gg/raweFPmdzG to collaborate and stay up-to-date with the latest news and updates. We’re excited to welcome developers from all over to share their expertise and ideas. And don’t forget to keep an eye on our Medium account for even more fascinating content down the road. Thank you for your support！