M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
📖 Technical Report | 🤗 Hugging Face| 🤖 ModelScope
📖 Technical Report | 🤗 Hugging Face| 🤖 ModelScope
Introduction
We introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.
📌 Updates
- [2025.07.14] 🔥 Our Technical Report is in public on arxiv.
- [2025.07.11] 🔥 We release M2-Reasoning on 🤗 Hugging Face and 🤖 ModelScope.
Key Features
- A High-quality Data Construction Pipeline: We design and implement a multi-stage data synthesis and curation pipeline that generates vast amounts of reasoning data.
- A Dynamic Multi-Task Training Strategy: We propose a sophisticated training strategy that effectively handles data heterogeneity. It features step-wise dynamic optimization to mitigate conflicts between different data sources and a task-specific reward formulation to provide tailored incentive signals.
- Unified General and Spatial Reasoning Model: We propose M2-Reasoning-7B, an MLLM uniquely engineered for both abstract and spatial reasoning. Extensive evaluations on 8 distinctbenchmarks demonstrate that, by leveraging our custom data and training pipelines, M2-Reasoning establishes new state-of-the-art (SOTA) results across both general and spatial reasoning domains.
Evaluation
We conduct a comprehensive evaluation of our models across two key domains: general and spatial reasoning. Our evaluation utilizes a diverse set of public benchmarks, grouped by the primary capability they measure:
- General Reasoning (Mathematical & Logical): To evaluate this capability, we employ six benchmarks: MathVista, MathVision, MathVerse, DynaMath, WeMath, and LogicVista.
| Models | MathVista | MathVision | MathVerse | DynaMath | WeMath | LogicVista | Avg. (Δ) |
|---|---|---|---|---|---|---|---|
| Base-Scale General Models | |||||||
| InternVL3-8B | 70.5 | 30.0 | 38.5 | 25.7 | 39.5 | 44.5 | 41.4 |
| InternVL3-9B | 69.0 | 29.3 | 37.9 | 25.1 | 34.8 | 49.0 | 40.8 |
| Qwen2.5-VL-7B | 68.1 | 25.4 | 41.1 | 21.8 | 36.2 | 47.9 | 40.1 |
| MUG-U-7B | 74.8 | 26.1 | 35.4 | 17.2 | 26.5 | 39.8 | 36.6 |
| SAIL-VL-1.6-8B | 74.2 | 23.2 | 33.4 | 14.0 | 29.6 | 41.4 | 36.0 |
| Base-Scale Reasoning Models | |||||||
| WeThink-VL-7B | 71.6 | 26.0 | 44.2 | 24.8 | 48.0 | 51.2 | 44.3 (+4.2) |
| Taichu-VLR-7B | 72.3 | 27.1 | 46.7 | 23.0 | 44.0 | 48.3 | 43.6 |
| VLAA-Thinker-7B | 68.0 | 26.4 | 48.2 | 22.4 | 41.5 | 48.5 | 42.5 (+2.4) |
| URSA-8B-PS-GRPO | 67.8 | 31.8 | 41.5 | 22.4 | 38.3 | 44.7 | 41.1 (+8.2) |
| Ovis2-8B | 71.8 | 25.9 | 42.3 | 20.4 | 27.2 | 39.4 | 37.8 |
| Our Models | |||||||
| Base Model | 70.2 | 25.9 | 30.5 | 20.2 | 27.2 | 37.8 | 35.5 |
| M2-Reasoning-CI-7B | 71.7 | 29.2 | 42.1 | 25.0 | 42.8 | 46.8 | 42.9 (+7.4) |
| M2-Reasoning-7B | 75.0 | 31.5 | 44.7 | 26.8 | 41.8 | 50.0 | 45.0 (+9.5) |
Spatial Reasoning: We assess this skill using 2 benchmarks: CV-Bench and VSI-Bench
CV-Bench:
| Models | Count | Relation | Depth | Distance | Avg. |
|---|---|---|---|---|---|
| Large-Scale Models | |||||
| GPT-4O | 65.9 | 85.7 | 87.8 | 78.2 | 78.9 |
| Gemini-1.5-pro | 70.4 | 85.2 | 82.4 | 72.8 | 77.4 |
| Base-Scale Models | |||||
| InternVL3-8B | 74.0 | 90.6 | 84.3 | 81.0 | 82.0 |
| Qwen2.5-VL-7B-Instruct | 65.2 | 86.6 | 70.6 | 79.8 | 75.0 |
| LLava-NEXT-Video-7B | 59.3 | 77.0 | 71.3 | 54.7 | 65.2 |
| Our Models | |||||
| M2-Reasoning-7B | 66.6 | 92.8 | 89.3 | 84.3 | 82.3 |
- VSI-Bench:
| OC | AD | OS | RS | RDs | RDr | RP | AO | Avg. | |
|---|---|---|---|---|---|---|---|---|---|
| Large-Scale Models | |||||||||
| Gemini-1.5-pro | 56.2 | 30.9 | 64.1 | 43.6 | 51.3 | 46.3 | 36.0 | 34.6 | 45.4 |
| GPT-4O | 46.2 | 5.3 | 43.8 | 38.2 | 37.0 | 41.3 | 31.5 | 28.5 | 34.0 |
| Base-Scale Models | |||||||||
| InternVL3-8B | 68.1 | 39.0 | 48.4 | 33.6 | 48.3 | 36.4 | 27.3 | 35.4 | 42.1 |
| Video-R1-7B | - | - | - | - | - | - | - | - | 37.1 |
| Qwen2.5-VL-7B-Instruct | 37.7 | 20.1 | 49.7 | 37.4 | 38.5 | 40.4 | 31.4 | 32.0 | 35.9 |
| LLava-NeXT-Video-7B | 48.5 | 14.0 | 47.8 | 24.2 | 43.5 | 42.4 | 34.0 | 30.6 | 35.6 |
| Our Models | |||||||||
| M2-Reasoning-7B | 41.0 | 34.0 | 60.9 | 55.4 | 40.7 | 47.3 | 29.9 | 28.8 | 42.3 |
Model Downloads
You can download the model from both Hugging Face and ModelScope.
If you're in mainland China, we strongly recommend you to download our model from ModelScope.
Example Usage
The basic environment is python=3.10, torch=2.6.0+cu124, transformers=4.49.0
We provide a small example on the usage of this repo.
import osimport torchfrom transformers import ( AutoProcessor, AutoTokenizer,)import warningsimport argparsefrom modeling_bailing_qwen2_5 import Bailing_qwen2_5NativeForConditionalGenerationfrom processing_bailing_qwen2_5 import Bailing_qwen2_5Processorwarnings.filterwarnings("ignore")class BailingMMInfer: def __init__(self, model_name_or_path, device="cuda", max_pixels=None, min_pixels=None, video_max_pixels=768 * 28 * 28, video_min_pixels=128 * 28 * 28, generation_config=None ): super().__init__() self.model_name_or_path = model_name_or_path self.device = device self.device_map = device self.video_max_pixels = video_max_pixels if video_max_pixels is not None else 768 * 28 * 28 self.video_min_pixels = video_min_pixels if video_min_pixels is not None else 128 * 28 * 28 self.model, self.tokenizer, self.processor = self.load_model_processor() if max_pixels is not None: self.processor.max_pixels = max_pixels if min_pixels is not None: self.processor.min_pixels = min_pixels if generation_config is None: generation_config = { "num_beams": 1, "do_sample": True, "temperature": 0.9 } self.generation_config = generation_config def load_model_processor(self): model = Bailing_qwen2_5NativeForConditionalGeneration.from_pretrained( self.model_name_or_path, torch_dtype=torch.bfloat16, device_map=self.device_map, _attn_implementation="flash_attention_2" ).eval() tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, add_bos_token=True, trust_remote_code=True) processor = Bailing_qwen2_5Processor.from_pretrained(self.model_name_or_path, trust_remote_code=True) return model, tokenizer, processor def generate(self, messages, max_new_tokens=512): text = self.processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, use_system=True ) image_inputs, video_inputs = self.processor.process_vision_info(messages) inputs = self.processor( text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt", ) # print(inputs) print(self.tokenizer.decode(inputs['input_ids'][0])) inputs = inputs.to(self.device) for k in inputs.keys(): if k == "pixel_values" or k == "pixel_values_videos": inputs[k] = inputs[k].to(dtype=torch.bfloat16) with torch.no_grad(): generated_ids = self.model.generate( inputs, max_new_tokens=max_new_tokens, eos_token_id=self.processor.tokenizer.eos_token_id, **self.generation_config, ) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = self.processor.batch_decode( generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False )[0] return output_textif __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('--model_name_or_path', type=str, default="inclusionAI/M2-Reasoning") parser.add_argument('--max_pixels', type=int, default=401408) parser.add_argument('--min_pixels', type=int, default=401408) parser.add_argument('--max_new_tokens', type=int, default=4096) args = parser.parse_args() device = "cuda" if torch.cuda.is_available() else "cpu" # model_name_or_path = os.path.join(args.input_dir, args.model_name_or_path) bailing2 = BailingMMInfer( args.model_name_or_path, device=device, max_pixels=args.max_pixels, min_pixels=args.min_pixels ) messages = [ { "role": "system", "content": [ {"type": "text", "text": "You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <think>...</think> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}."}]}, { "role": "user", "content": [ {"type": "image", "image": "./assets/example1.png"}, {"type": "text", "text": "\nQuestion:\n\nRhombus $QRST$ has an area of 137.9 square meters. If $RT$ is 12.2 meters, find $QS$.\nA. 11.3\nB. 22.4\nC. 22.6\nD. 25.6"}, ], }, ] output_text = bailing2.generate(messages, max_new_tokens=args.max_new_tokens) print(output_text)'''[Output]:<think>To find the length of \( QS \) in the rhombus \( QRST \), we can use the formula for the area of a rhombus, which is given by:\[\text{Area} = \frac{1}{2} \times d_1 \times d_2\]where \( d_1 \) and \( d_2 \) are the lengths of the diagonals. In this problem, we are given:- The area of the rhombus is 137.9 square meters.- One of the diagonals,