The top open-weights large language model, DeepSeek-R1, may be overtaken by Alibaba’s new model family after four months.

Qwen 3: Act Quicker, Think Deeper
Overview
The newest member of the Qwen family of large language models is called Qwen3. Qwen3-235B-A22B flagship model, outperforms DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro in coding, math, and general capabilities benchmarks. Qwen3-30B-A3B, a small MoE model, also outperforms QwQ-32B with ten times as many activated parameters, and even a little model like Qwen3-4B can compete with Qwen2.5-72B-Instruct.
Qwen3-235B-A22B, a huge model with 235 billion total parameters and 22 billion activated parameters, and Qwen3-30B-A3B, a smaller model with 30 billion total parameters and 3 billion activated parameters, are the two MoE models that we are open-weighting.
Furthermore, under the Apache 2.0 license, six dense models—Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B—are also open-weighted.
Models | Layers | Heads (Q / KV) | Tie Embedding | Context Length |
---|---|---|---|---|
Qwen3-0.6B | 28 | 16 / 8 | Yes | 32K |
Qwen3-1.7B | 28 | 16 / 8 | Yes | 32K |
Qwen3-4B | 36 | 32 / 8 | Yes | 32K |
Qwen3-8B | 36 | 32 / 8 | No | 128K |
Qwen3-14B | 40 | 40 / 8 | No | 128K |
Qwen3-32B | 64 | 64 / 8 | No | 128K |
Models | Layers | Heads (Q / KV) | # Experts (Total / Activated) | Context Length |
---|---|---|---|---|
Qwen3-30B-A3B | 48 | 32 / 4 | 128 / 8 | 128K |
Qwen3-235B-A22B | 94 | 64 / 4 | 128 / 8 | 128K |
Platforms like Hugging Face, ModelScope, and Kaggle now offer the post-trained models (like Qwen3-30B-A3B) and their pre-trained equivalents (like Qwen3-30B-A3B-Base). It advises utilizing frameworks such as SGLang and vLLM for deployment. Tools like Ollama, LMStudio, MLX, llama.cpp, and KTransformers are strongly advised for local use. Whether in development, production, or research settings, these solutions guarantee that users can effortlessly include Qwen3 into their workflows.
Qwen 3’s objective is to enable academics, developers, and organizations worldwide to use these state-of-the-art models to create creative solutions.
Feel free to test out Qwen3 on the mobile APP and Qwen Chat Web (chat.qwen.ai)!
Important Features
Hybrid Thinking Modes
A hybrid method to problem-solving is introduced by Qwen3 models. Two modes are supported by them:
- Thinking Mode: Before providing the final response, the model takes its time to reason through each step. For complicated issues that call for more careful consideration, this is perfect.
- Non-Thinking Mode: In this mode, the model responds quickly, almost instantly, making it appropriate for easier queries where depth is less crucial than speed.
As was previously shown, Qwen 3 shows smooth and scalable performance gains that are directly related to the budget allotted for computational reasoning. With the help of this architecture, users may more easily set up task-specific budgets, striking a better balance between inference quality and cost effectiveness.
Multilingual Support
119 languages and dialects are supported by Qwen 3 models. Users all over the world can now take advantage of these models’ power thanks to their broad multilingual capacity, which creates new opportunities for global applications.
Improved Agentic Capabilities
In addition to strengthening MCP support, it has optimized the Qwen 3 models for coding and agentic capabilities. Examples of how Qwen3 thinks and engages with the world are given below.
Comparison to Qwen2.5
Compared to Qwen2.5, the pretraining dataset for Qwen3 has been greatly increased. With over 36 trillion tokens spanning 119 languages and dialects, Qwen3 employs almost twice as many tokens as Qwen2.5, which was pre-trained on 18 trillion. Qwen2.5-VL took information from these studies and improved it. It generated synthetic data using Qwen2.5-Math and Qwen2.5-Coder to increase the quantity of math and code data. This covers code snippets, textbooks, and question-answer pairings.
Qwen3 Pre-training

There are three steps in the pre-training process. The model was pretrained on more than 30 trillion tokens with a context length of 4K tokens in the first stage (S1). The model gained general information and fundamental linguistic skills at this stage. We enhanced the dataset in the second stage (S2) by adding more knowledge-intensive activities, like STEM, coding, and reasoning exercises. After then, 5 trillion more tokens were used to pretrain the model. In the last step, we increased the context length to 32K tokens using high-quality long-context data. This guarantees that the model can efficiently handle lengthier inputs.
The overall performance of Qwen 3 dense base models is comparable to that of Qwen2.5 base models with more parameters because of improvements in model architecture, an increase in training data, and more efficient training techniques. For example, Qwen2.5-3B/7B/14B/32B/72B-Base and Qwen3-1.7B/4B/8B/14B/32B-Base both function similarly. Notably, Qwen 3 dense base models even perform better than bigger Qwen2.5 models in domains like STEM, coding, and reasoning. With just 10% of the active parameters, they perform comparably to Qwen2.5 dense base models for Qwen3-MoE base models. Both training and inference costs are significantly reduced as a result.
Post-training
It used a four-stage training pipeline to create the hybrid model, which can reason step-by-step and respond quickly. Long chain-of-thought (CoT) cold start, reasoning-based reinforcement learning (RL), thinking mode fusion, and generic RL are the components of this pipeline.
Using a variety of long CoT data covering a range of tasks and domains, including coding, mathematics, logical thinking, and STEM challenges, it refined the models in the first step. This approach aimed to teach the model basic reasoning. The second phase concentrated on increasing the computing capacity for reinforcement learning, employing rule-based incentives to improve the model’s capacity for exploration and exploitation.
In the third step, it refined the thinking model using a combination of extended CoT data and frequently utilized instruction-tuning data to incorporate non-thinking capabilities. The improved thinking model from the second stage produced this data, guaranteeing a smooth fusion of reasoning and fast reaction times. In the fourth stage, it used reinforcement learning (RL) on over 20 general-domain tasks to improve the model’s general capabilities and fix undesirable behaviors. These tasks comprised agent capabilities, format following, and instruction following, among others.
Agentic usages
Qwen 3 is very good at calling tools. To fully utilise Qwen3’s agentic capabilities, we advise utilising Qwen-Agent. The code complexity is significantly reduced by Qwen-Agent’s internal encapsulation of tool-calling templates and tool-calling parsers.
You can use the MCP configuration file, the Qwen-Agent integrated tool, or integrate additional tools on your own to specify the accessible tools.