Leaderboard - M³CoT

Evaluation of different methods on the test split (whole: 2,359). The accuracies across various categories and the overall average are reported below.

😀 You are invited to contribute your results to the M³CoT test split! Please send your result scores to this email or open a new issue at the github repository.

⚠️⚠️⚠️ Caveat: The data in the leaderboard is collected manually from existing papers. There might be some errors in the data, ambiguous data due to different interpretations, and missing data due to the lack of information in the papers. Make sure to double-check the data before using it. Please contact us at this email if you find any errors or have any suggestions. We appreciate your contributions and feedback.

#	Model	Prompt	#Setting	#Size	#Backbone	Link	Lang	Natural	Social	Physical	SocialCS	Temporal	Algebra	Geometry	Theory	Total

Prompting strategies:

Direct: approach to submitting samples in the VLLMs required format

CoT: with ''Let's think step-by-step!''

Desp-CoT: with an initial image description prompting

CCoT: with better description in graph format

Setting:

Zero-shot: The model is evaluated in a zero-shot setting on M³CoT

Tool-Usage: The model is evaluated in a tool-augmented setting on M³CoT

Fine-tuning: The model is fine-tuned on M³CoT

-: Not available

#Size: Total number of parameters in the model

Accuracies for different question sets:

Lang: questions of the language science subject

Natural: questions of the natural science subject

Social: questions of the social science subject

Physical: questions of the physical commonsense

SocialCS: questions of the social commonsense

Temporal: questions of the temporal commonsense

Algebra: questions of the algebra mathematics

Geometry: questions of the geometry mathematics

Theory: questions of the theory mathematics

Total: all questions (reporting the average accuracy)

Leaderboard - M3CoT

Leaderboard - M³CoT