MMaDA: Multimodal Large Diffusion Language Models
A new class of multimodal diffusion foundation models called MMaDA is presented by Hugging Face, with the goal of achieving exceptional performance in a variety of areas. It aims to perform exceptionally well in tasks including text-to-image generation, multimodal comprehension, and textual reasoning.
Three significant innovations set this technique apart:
Unified diffusion architecture: MMaDA uses a modality-agnostic design and a common probabilistic formulation in its unified diffusion architecture. This guarantees smooth integration and processing across various data kinds by doing away with the requirement for modality-specific components.
Mixed long chain-of-thought (CoT) fine-tuning strategy: Curating a consistent CoT format across modalities is the goal of this tactic. This method enables cold-start training for the last reinforcement learning (RL) stage by coordinating reasoning processes between the textual and visual domains, which improves the model’s capacity to manage challenging problems right away.
Unified policy-gradient-based RL algorithm (UniGRPO): UniGRPO, or the unified policy-gradient-based RL algorithm, is a unified method designed especially for diffusion foundation models. In order to provide consistent performance gains across reasoning and generating tasks, it employs varied reward modelling to unify post-training.
According to experimental data, MMaDA-8B is a unified multimodal foundation model with significant generalization capabilities. In certain domains, it has demonstrated better performance than other potent models. In particular, MMaDA-8B exceeds models such as LLaMA-3-7B and Qwen2-7B in textual reasoning, surpasses SDXL and Janus in text-to-image generation, and outperforms Show-o and SEED-X in multimodal understanding. These accomplishments demonstrate how well MMaDA bridges the pretraining and post-training divide in unified diffusion systems.
Domain | MMaDA-8B’s Performance Claim | Compared Against Models |
Textual Reasoning | Surpasses | LLaMA-3-7B, Qwen2-7B |
Multimodal Understanding | Outperforms | Show-o, SEED-X |
Text-to-Image Generation | Excels over | SDXL, Janus |
A number of checkpoints that reflect various training stages are included in MMaDA:
- MMaDA-8B-Base: This version is accessible following instruction tweaking and pretraining. It has the ability to generate simple text, images, captions for images, and thoughts. Huggingface has made MMaDA-8B-Base open-source. Its specifications are about 8.08B. Pre-training on ImageNet (step 1.1), an Image-Text Dataset (Stage 1.2), and text instruction following (Stage 1.3) are all part of the training for this step.
- MMaDA-8B-MixCoT (shortly to arrive): Mixed lengthy chain-of-thought (CoT) fine-tuning is used in this edition. It is intended to have sophisticated multimodal, textual, and image-generation reasoning capabilities. The training process includes multimodal reasoning (Stage 2.2) after Mix-CoT finetuning with text reasoning (Stage 2.1). It was anticipated to be released two weeks after the 2025-05-22 update.
- MMaDA-8B-Max (soon to be released): UniGRPO reinforcement learning is used to train this version. It should excel at sophisticated reasoning and producing amazing visuals. After the code transition to OpenRLHF is complete, the UniGRPO RL training stage (Stage 3) is scheduled for release. It was anticipated to be released one month after the 2025-05-22 update.
Text creation employing a semi-autoregressive sampling technique is supported by MMaDA for inference. Non-autoregressive diffusion denoising is used in multimodal generation. For text generation, multimodal generation, and text-to-image generation, inference scripts are offered. Pre-training and Mix-CoT fine-tuning are two of the phases that make up the training process.
On May 21, 2025, the research paper introducing MMaDA was submitted to arXiv. It is also accessible through Hugging Face Papers. The trained models and code are open-sourced and accessible on Huggingface and Gen-Verse/MMaDA’s GitHub account. You can view a demo online at Huggingface Spaces. According to the available data, the project repository on GitHub has 11 forks and 460 stars. Python is the main language used in the repository. Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang are the listed authors.