6 min readNov 23, 2022

MMEval: A unified evaluation library for multiple machine learning libraries

Introduction

At the 2022 World Artificial Intelligence Conference (WAIC) in Shanghai, OpenMMLab released the new OpenMMLab 2.0 vision algorithm system based on the next-generation training architecture： MMEngine.

MMEngine provides a powerful and flexible training engine, as well as common training techniques to meet users’ diverse needs for model training. For model evaluation needs, MMEngine also provides Metric and Evaluator modules, which are used by downstream libraries to implement task-specific metrics through inheritance.

OpenMMLab is the most comprehensive open-source algorithm system for computer vision in the era of deep learning, and currently covers 30+ research areas, each of which has its own metric for the task. We want to unify these metrics to serve more users in an easier and more open way. Therefore, we integrate the evaluation metrics of the original OpenMMLab algorithm library, and then develop a unified and open cross-framework algorithm evaluation library： MMEval.

If you like our open-source project I’d appreciate it if you’d give us a checky star on GitHub

MMEval GitHub home page: https://github.com/open-mmlab/mmeval

MMEval documentation: https://mmeval.readthedocs.io/en/latest

MMEval introduction

MMEval is a cross-framework evaluation library for machine learning algorithms. It provides efficient and accurate distributed evaluation, together with steady support for multiple machine learning backends. It comes with the following features.

Provide rich evaluation metrics for each segment of computer vision (natural language processing metrics are in progress)
Support a variety of distributed communication libraries to achieve efficient and accurate distributed evaluation
Support a variety of machine learning frameworks to automatically distribute corresponding implementations based on inputs.

The architecture of MMEval is shown in the following figure.

Compared with some existing open source algorithm evaluation libraries, such as Lightning-AI/metrics, huggingface/evaluate, and the recently released pytorch/torcheval, the difference of MMEval mainly lies in the following: 1) we provide more comprehensive evaluation metrics in computer vision, and 2) we provide cross-framework evaluation.

MMEval currently provides 20+ metrics covering tasks such as classification, object detection, image segmentation, point cloud segmentation, keypoint detection, optical flow estimation, etc. The supported metrics of MMEval can be viewed in the support matrix in the documentation: https://mmeval.readthedocs.io/en/latest/get_started/support_matrix.html

MMEval installation and usage

MMEval requires Python 3.6+ and can be installed via pip.

pip install mmeval

There are two ways to use MMEval's metrics, taking Accuracy as an example:

from mmeval import Accuracy

import numpy as np
accuracy = Accuracy()

The first way is to directly call the instantiated Accuracy object to calculate the metric.

labels = np.asarray([0, 1, 2, 3])
preds = np.asarray([0, 2, 1, 3])
accuracy(preds, labels)
# {'top1': 0.5}preds = np.asarray([0, 2, 1, 3])

The second way is to calculate the metric after accumulating data from multiple batches.

for i in range(10):
    labels = np.random.randint(0, 4, size=(100, ))
    predicts = np.random.randint(0, 4, size=(100, ))
    accuracy.add(predicts, labels)
accuracy.compute()
# {'top1': ...}labels = np.random.randint(0, 4, size=(100, )predicts = np.random.randint(0, 4, size=(100, )The evaluation metrics in MMEval also support distributed evaluation, you can refer to the tutorial on how to use distributed evaluation: https://mmeval.readthedocs.io/en/latest/tutorials/dist_evaluation.html

The evaluation metrics in MMEval also support distributed evaluation, you can refer to the tutorial on how to use distributed evaluation: https://mmeval.readthedocs.io/en/latest/tutorials/dist_evaluation.html

Multiple distributed communication backend

During the evaluation process, the results of partial datasets are usually inferred on each GPU in data parallel to speed up the evaluation. The evaluation results computed on each data subset may not always be equated with the evaluation results of the whole data set by simple averaging. Therefore, it is a common practice to save the inference results or intermediate results of metrics calculation obtained from each GPU in the distributed evaluation process, perform all-gather operations in all processes, and finally calculate the metrics results for the whole evaluation dataset.

The distributed communication requirements required by MMEval in the distributed evaluation mainly include the following:

All-gather the intermediate results of the evaluation metrics saved in each process
broadcast the results of the metrics computed by the rank 0 process to all processes

In order to flexibly support multiple distributed communication libraries, MMEval abstracts the above distributed communication requirements and defines a distributed communication interface BaseDistBackend, which is designed as shown in the following figure.

A number of distributed communication backends have been implemented in MMEval, as shown in the following table.

Multiple Machine Learning Framework Support

MMEval aims to support multiple machine learning frameworks, and one of the simplest solutions is to have all the metrics of the evaluation support NumPy. This would enable most of the evaluation requirements, since all machine learning frameworks have Tensor data types that can be converted to NumPy arrays.

However, there may be some problems in some cases.

NumPy has some common operators that are not yet implemented, such as topk, which can affect the speed of the evaluation metrics.
It can be time-consuming to move a large number of Tensors from CUDA devices to CPU memory.
NumPy arrays are not automatically differentiable.

To address these issues, MMEval’s metrics provide some implementations of metric computation for specific machine learning frameworks. At the same time, in order to cope with the problem of distributing different metrics computation methods, MMEval uses a type annotation-based dynamic multi-distribution mechanism, which can dynamically select different computation methods based on the input data types.

A simple example of type annotation based multi-distribution is as follows.

from mmeval.core import dispatch

@dispatch
def compute(x: int, y: int):
    print('this is int')

@dispatch
def compute(x: str, y: str):
    print('this is str')

compute(1, 1)
# this is int

compute('1', '1')
# this is str

Vision

Training and evaluation are two very important phases in the process of experimenting and producing machine learning models.

While MMEngine already provides a flexible and powerful training architecture, MMEval aims to provide a unified and open model evaluation library. The unified approach is to meet the needs for model evaluation of different tasks in different domains, while the open approach is to decouple it from machine learning frameworks and provide evaluation capabilities for different machine learning framework ecosystems in a more open way.

At present, MMEval is still in the early stage, many evaluation metrics are still being added, and some architecture designs may not be mature enough. In the coming period, MMEval will continue to iterate and improve in the following two directions.

Continuously add metrics and extend to more tasks such as NLP, speech, recommendation systems, etc.
Support more machine learning frameworks, and explore new ways to support multiple machine learning frameworks

Although MMEval has been released, there are still a large number of evaluation metrics from the original OpenMMLab algorithm library that have not yet been migrated and added to MMEval. We have compiled these evaluation metrics that have not yet been added and released a community task. We welcome all our community partners to join us in building MMEval: https://github.com/open-mmlab/mmeval/issues/50