MMEval: A unified evaluation library for multiple machine learning libraries
Introduction
At the 2022 World Artificial Intelligence Conference (WAIC) in Shanghai, OpenMMLab released the new OpenMMLab 2.0 vision algorithm system based on the next-generation training architecture: MMEngine.
MMEngine provides a powerful and flexible training engine, as well as common training techniques to meet users’ diverse needs for model training. For model evaluation needs, MMEngine also provides Metric and Evaluator modules, which are used by downstream libraries to implement task-specific metrics through inheritance.
OpenMMLab is the most comprehensive open-source algorithm system for computer vision in the era of deep learning, and currently covers 30+ research areas, each of which has its own metric for the task. We want to unify these metrics to serve more users in an easier and more open way. Therefore, we integrate the evaluation metrics of the original OpenMMLab algorithm library, and then develop a unified and open cross-framework algorithm evaluation library: MMEval.
MMEval GitHub home page: https://github.com/open-mmlab/mmeval
MMEval documentation: https://mmeval.readthedocs.io/en/latest
MMEval introduction
MMEval is a cross-framework evaluation library for machine learning algorithms. It provides efficient and accurate distributed evaluation, together with steady support for multiple machine learning backends. It comes with the following features.
- Provide rich evaluation metrics for each segment of computer vision (natural language processing metrics are in progress)
- Support a variety of distributed communication libraries to achieve efficient and accurate distributed evaluation
- Support a variety of machine learning frameworks to automatically distribute corresponding implementations based on inputs.
The architecture of MMEval is shown in the following figure.
Compared with some existing open source algorithm evaluation libraries, such as Lightning-AI/metrics, huggingface/evaluate, and the recently released pytorch/torcheval, the difference of MMEval mainly lies in the following: 1) we provide more comprehensive evaluation metrics in computer vision, and 2) we provide cross-framework evaluation.
MMEval currently provides 20+ metrics covering tasks such as classification, object detection, image segmentation, point cloud segmentation, keypoint detection, optical flow estimation, etc. The supported metrics of MMEval can be viewed in the support matrix in the documentation: https://mmeval.readthedocs.io/en/latest/get_started/support_matrix.html
MMEval installation and usage
MMEval
requires Python 3.6+ and can be installed via pip.
pip install mmeval
There are two ways to use MMEval
's metrics, taking Accuracy
as an example:
from mmeval import Accuracy
import numpy as np
accuracy = Accuracy()
The first way is to directly call the instantiated Accuracy
object to calculate the metric.
labels = np.asarray([0, 1, 2, 3])
preds = np.asarray([0, 2, 1, 3])
accuracy(preds, labels)
# {'top1': 0.5}preds = np.asarray([0, 2, 1, 3])
The second way is to calculate the metric after accumulating data from multiple batches.
for i in range(10):
labels = np.random.randint(0, 4, size=(100, ))
predicts = np.random.randint(0, 4, size=(100, ))
accuracy.add(predicts, labels)
accuracy.compute()
# {'top1': ...}labels = np.random.randint(0, 4, size=(100, )predicts = np.random.randint(0, 4, size=(100, )The evaluation metrics in MMEval also support distributed evaluation, you can refer to the tutorial on how to use distributed evaluation: https://mmeval.readthedocs.io/en/latest/tutorials/dist_evaluation.html
The evaluation metrics in MMEval also support distributed evaluation, you can refer to the tutorial on how to use distributed evaluation: https://mmeval.readthedocs.io/en/latest/tutorials/dist_evaluation.html
Multiple distributed communication backend
During the evaluation process, the results of partial datasets are usually inferred on each GPU in data parallel to speed up the evaluation. The evaluation results computed on each data subset may not always be equated with the evaluation results of the whole data set by simple averaging. Therefore, it is a common practice to save the inference results or intermediate results of metrics calculation obtained from each GPU in the distributed evaluation process, perform all-gather operations in all processes, and finally calculate the metrics results for the whole evaluation dataset.
The distributed communication requirements required by MMEval
in the distributed evaluation mainly include the following:
- All-gather the intermediate results of the evaluation metrics saved in each process
- broadcast the results of the metrics computed by the rank 0 process to all processes
In order to flexibly support multiple distributed communication libraries, MMEval abstracts the above distributed communication requirements and defines a distributed communication interface BaseDistBackend, which is designed as shown in the following figure.
A number of distributed communication backends have been implemented in MMEval, as shown in the following table.
Multiple Machine Learning Framework Support
MMEval aims to support multiple machine learning frameworks, and one of the simplest solutions is to have all the metrics of the evaluation support NumPy. This would enable most of the evaluation requirements, since all machine learning frameworks have Tensor data types that can be converted to NumPy arrays.
However, there may be some problems in some cases.
- NumPy has some common operators that are not yet implemented, such as topk, which can affect the speed of the evaluation metrics.
- It can be time-consuming to move a large number of Tensors from CUDA devices to CPU memory.
- NumPy arrays are not automatically differentiable.
To address these issues, MMEval’s metrics provide some implementations of metric computation for specific machine learning frameworks. At the same time, in order to cope with the problem of distributing different metrics computation methods, MMEval uses a type annotation-based dynamic multi-distribution mechanism, which can dynamically select different computation methods based on the input data types.
A simple example of type annotation based multi-distribution is as follows.
from mmeval.core import dispatch
@dispatch
def compute(x: int, y: int):
print('this is int')
@dispatch
def compute(x: str, y: str):
print('this is str')
compute(1, 1)
# this is int
compute('1', '1')
# this is str
Vision
Training and evaluation are two very important phases in the process of experimenting and producing machine learning models.
While MMEngine already provides a flexible and powerful training architecture, MMEval aims to provide a unified and open model evaluation library. The unified approach is to meet the needs for model evaluation of different tasks in different domains, while the open approach is to decouple it from machine learning frameworks and provide evaluation capabilities for different machine learning framework ecosystems in a more open way.
At present, MMEval is still in the early stage, many evaluation metrics are still being added, and some architecture designs may not be mature enough. In the coming period, MMEval will continue to iterate and improve in the following two directions.
- Continuously add metrics and extend to more tasks such as NLP, speech, recommendation systems, etc.
- Support more machine learning frameworks, and explore new ways to support multiple machine learning frameworks
Although MMEval has been released, there are still a large number of evaluation metrics from the original OpenMMLab algorithm library that have not yet been migrated and added to MMEval. We have compiled these evaluation metrics that have not yet been added and released a community task. We welcome all our community partners to join us in building MMEval: https://github.com/open-mmlab/mmeval/issues/50