Benchmarking the multi-modal capability of Bard with MMBench

OpenMMLab
4 min readAug 18, 2023

--

In March 2023, Google launched Bard, a lightweight and optimized version of LaMDA based on Transformer. Similar to ChatGPT, Bard is a close-sourced model and provide service to users via web UI. In July 2023, Google announced the latest update of Bard, which is capable of processing image input. In order to provide an overview of Bard’s multi-modal ability, we evaluate it on the test split of MMBench as below and compare it with other state-of-the-art VLMs.

Project: https://opencompass.org.cn/MMBench

Evaluation Setting

The test split of MMBench includes 1798 questions. During testing, we find that Bard refuses to process images containing human faces. For a fair comparison, we remove questions that Bard refuse to answer and discard questions that evaluate four human-related capabilities (Image Emotion, Identity Reasoning, Social Relation, and Action Recognition) in the test split. After filtering, we build a subset of 1226 samples and 16 leaf ability dimensions.

Quantitative Results

We compare Bard with two state-of-the-art VLMs that perform well on MMBench, namely Shikra and Otter-I. The result is shown in the figure below. Bard attains an impressive overall accuracy of 51%, positioning itself among the top-tier VLMs proposed to date. Notably, Bard excels in answering questions that involve common sense reasoning. It achieves 62.3% accuracy on Nature Relation questions and 45.2% accuracy on Physical Relation questions, outperforming its counterparts, e.g., Otter-I and Shikra, by a substantial margin. Meanwhile, an analysis reveals that Bard’s performance is comparatively lower in tasks requiring spatial perception, such as Spatial Relationship and Object Localization. This observation aligns with expectations, considering that Shikra incorporates visual grounding tasks into its training data to enhance its localization capabilities, a facet potentially not integrated into Bard’s training process.

Qualitative Results

To complement the quantitative analysis, we also provide some qualitative examples of Bard. Some good cases are demonstrated in the figure below. In the left-hand example, Bard adeptly processes intricate scenes, distills key information, and arrives at a reasonable conclusion. Notably, the majority of VLMs subjected to our testing fail to deliver the correct response to this particular question. In the right-hand example, Bard recognizes the correct concept from cartoon, sidestepping any potential confusion arising from the harmonious interaction between a snake and a mouse. This highlights Bard’s exceptional common sense reasoning ability.

In the next figure, we present illustrative examples that highlight Bard’s performance shortcomings. These instances originate from both image style and image quality tasks. The former entails the model to discern image categories, while the latter involves assessing visual attributes, such as brightness, across a pair of images. A shared characteristic between these tasks is the insignificance of image content concerning the task’s objectives. Bard performs bad on the two tasks, achieving 50% and 7% accuracy on each tasks respectively. The accompanying tables within these cases visually demonstrate Bard’s tendency to excessively focus on semantic concepts and depicted objects within the provided text and image, leaving it struggling to effectively address inquiries regarding holistic styles and attributes.

Bard Provides well-structured Responses

Last but not least, in all the aforementioned examples, Bard consistently delivers well-structured responses, frequently utilizing bullet-point lists and tables to enhance clarity. Moreover, across a majority of the questions, Bard adheres to a consistent response format: presenting the predicted option initially, subsequently offering a comprehensive rationale, and culminating by enumerating the reasons for the incorrectness of alternative choices. From the perspective of being a chatbot, Bard undeniably stands out as one of the most exceptional multi-modal chatbots.

A Brief Intro of MMBench

MMBench is a multi-modality benchmark released in early July 2023, which includes ~3000 multiple choice questions to evaluate over 20 different multi-modal capabilities. Since the benchmark release, more than 200 submissions have been received, and the leaderboard has covered 15 VLMs till now. More information:

Project: https://opencompass.org.cn/MMBench

Paper: https://arxiv.org/pdf/2307.06281.pdf

Codebase: https://github.com/InternLM/opencompass

--

--