M3CoT

Abstract

Multi-modal Chain-of-Thought (MCoT) requires models to leverage knowledge from both textual and visual modalities for step-by-step reasoning, which gains increasing attention. Nevertheless, the current MCoT benchmark still faces some challenges: (1) absence of visual modal reasoning, (2) single-step visual modal reasoning, and (3) Domain missing, thereby hindering the development of MCoT. Motivated by this, we introduce a novel benchmark (M³CoT) to address the above challenges, advancing the multi-domain, multi-step, and multi-modal CoT. Additionally, we conduct a thorough evaluation involving abundant MCoT approaches on Vision Large Language Models (VLLMs). In addition, we highlight that the current VLLMs still struggle to correctly reason in M³CoT and there is a large gap between VLLMs and human performance in M³CoT, despite their superior results on previous MCoT benchmarks. To our knowledge, we take the first meaningful step toward the multi-domain, multi-step, and multi-modal scenario in MCoT. We hope that M³CoT will serve as a valuable resource, providing a pioneering foundation in multi-domain, multi-step, multi-modal chain-of-thought research.

Motivation

We find that the existing benchmarks exhibit three major drawbacks:

Absence of visual modal reasoning As shown in Figure a, the model can successfully produce rationale and answer solely based on the textual modality context of "supports the plant", which cannot truly reflect the ability of multi-modal CoT model.

Single-step visual modal reasoning As illustrated in Figure b, the model only requires a single-step "feather" object to predict the correct rationale and answer, which cannot be satisfied in the complex multi-step CoT scenario.

Domain Missing Commonsense and mathematics are important domains for evaluating multi-modal CoT, but the current benchmarks lack these topics, hindering the comprehensive evaluation progress of multi-modal CoT.

For more details, you can explore the datatset and check the visualizations here: Explore and Visualizations.

Our dataset is distributed under the CC BY-NC-SA (Attribution-NonCommercial-ShareAlike) license. You can download our dataset from M³CoT (Google Drive), or check out our github repository.

💡 The M³CoT dataset is now available at HuggingFace Datasets!

Annotation

Specifically, to address the first issue, we directly remove samples that could infer the final answer without the need for images.

To tackle the second issue, we manually annotate and select multi-step multi-modal samples. Specifically, we provide expert annotators with textual context and rationales without images. Experts were required to determine when multi-step reasoning could not be resolved solely based on textual context. Subsequently, we present the images to experts to ascertain whether multi-step reasoning occurred across textual and visual modalities.

To solve the third issue, we explore LLM-guided augmentation to synthesize the multi-step MCoT data for commonsense and mathematics domains.

M³CoT is randomly divided into three splits: train, validation, and test splits, containing 7,973, 1,127, and 2,359 examples respectively.

Compared to ScienceQA, M³CoT demands more intricate reasoning, with an average length of 293.93, much higher than ScienceQA's 47.66.

Evaluation

We evaluate various VLLMs in M³CoT, including Kosmos-2, InstructBLIP, LLaVA-V1.5, CogVLM, Gemini, GPT4V. In addition, we explore some prompting strategies. Specifically, we utilize Direct approach to submitting samples in the VLLMs required format; CoT with “Let’s think step-by-step!”; Desp–CoT with an initial image description prompting; CCoT with better description in graph format.

As shown in Figure(a), VLLM has achieved amazing performance in single-step reasoning. However, compared with single-step MCoT data in ScienceQA, multi-step MCoT data in M³CoT maintains at least a 29.06% performance decrease (Figure(a)). In order to further understand the difference in model reasoning with different number of steps, we calculated the accuracy of different steps. As shown in Figure 7 (b), as the number of reasoning steps increases, the performance of the model will decrease significantly. In Figure(c), minimal rationale semantic distribution overlap between datasets further proves that the multi-step MCoT is an Out-of-Distribution (OOD) problem compared with single-step MCoT. For all, we attribute the low performance to the multi-step complexities for M³CoT.

We observe that rationale quality incrementally improves M³CoT performance, while markedly impacts the accuracy in CoT tasks. it markedly impacts the accuracy in CoT tasks.

Finetuning on various VLLMs

Finetuning on M³CoT can result better performance

Table reveals that our benchmark training set significantly enhances model performance. It enables traditional vision-language models (VLMs) to surpass the zero-shot VLLMs, which is the value of our dataset in boosting VLM effectiveness.

Citation


@inproceedings{chen-etal-2024-m3cot,
  title = "M$^3$CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought",
  author = "Chen, Qiguang  and
    Qin, Libo  and
    Zhang, Jin  and
    Chen, Zhi  and
    Xu, Xiao  and
    Che, Wanxiang",
  booktitle = "Proc. of ACL",
  year = "2024",
}

Contact

Please create Github issues here or email Qiguang Chen , or open up an issue on Github. if you have any questions or suggestions.

Acknowledgement

This website is adapted from Nerfies and LLaVA-RLHF, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Usage and License Notices: The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of Qwen-VL and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.