Friday, March 28, 2025

Boost DeepSeek R1 Distil Performance With AMD & NexaQuant

Use NexaQuant on an AMD client to increase the performance and inference capabilities of DeepSeek R1 Distil 4-bit.

 NexaQuant on an AMD
Image Credit To AMD

Two DeepSeek R1 distillers, the DeepSeek R1 Distil Qwen 1.5B and the DeepSeek R1 Distil Llama 8B, were announced by Nexa AI today.

Well-known quantization techniques, such as the Q4 K M based on llama.cpp, enable huge language models to drastically minimize their memory footprint while, as a trade-off, usually providing low perplexity loss for dense models. However, for (dense or MoE) models that use Chain of Thought traces, even moderate perplexity loss might lead to a damage to reasoning capabilities. According to Nexa AI, NexaQuants maintains the 4-bit quantization while recovering this reasoning capability loss (in comparison to the full 16-bit precision) and maintaining the speed benefit.

In comparison to their full 16-bit counterparts, it can observe that the Q4 K M quantized DeepSeek R1 Distil perform marginally worse on LLM benchmarks like GPQA and AIME24 (with the exception of the AIME24 test on Llama 3 8b distil, which performs noticeably worse). One solution to this issue would be to switch to a Q6 or Q8 quantization, although doing so would make the model run a little more slowly and use more memory.

According to Nexa AI, NexaQuants recovers the loss while maintaining a 4-bit quantization using a patented quantization technique. This implies that consumers can, in theory, benefit from both speed and accuracy.

DeepSeek-R1-Distill-Qwen-1.5B-NexaQuant

Overview

Being completely open-source and competing with OpenAI’s O1 reasoning model, DeepSeek-R1 has been in the news. In order to preserve offline access, lower latency, and protect data privacy, many users prefer to run it locally. However, quantization is usually required to fit such a huge model into personal devices (e.g., Q4_K_M), which undermines the advantages of the local reasoning model and frequently costs accuracy (up to ~22% accuracy loss).

By reducing the DeepSeek R1 Distilled model’s size by one-fourth of its initial size without sacrificing accuracy, to have resolved the trade-off. While NexaQuant maintained full precision model accuracy, tests conducted on an HP Omnibook AIPC with an AMD Ryzen AI 9 HX 370 processor revealed a decoding speed of 66.40 tokens per second and a peak RAM usage of only 1228 MB in the NexaQuant version, compared to only 25.28 tokens per second and 3788 MB RAM in the unquantized version.

NexaQuant Use Case Illustration

This is a comparison of how NexaQuant-4Bit and a typical Q4_K_M respond to a typical investment banking brainteaser question. While reducing the size of the model file by four times, NexaQuant excels in accuracy.

Prompt: A Typical Brainteaser Question in Investment Banking:

A rectangular chocolate bar measuring 6 by 8 is composed of tiny 1×1 pieces. AMD wish to divide it into forty-eight pieces. One piece of chocolate can be broken vertically or horizontally, but two pieces cannot be broken together! How many breaks must be taken at the very least?

Correct Response: 47

DeepSeek-R1-Distill-Llama-8B-NexaQuant

DeepSeek-R1-Distill-Llama-8B-NexaQuant
Image Credit To Hugging Face

The summary is identical to the DeepSeek-R1-Distill-Qwen-1.5B-NexaQuant mentioned above.

NexaQuant Use Case Illustration

This is a comparison of how NexaQuant-4Bit and a typical Q4_K_M respond to a typical investment banking brainteaser question. While reducing the size of the model file by four times, NexaQuant excels in accuracy.

Prompt: A Typical Brainteaser Question in Investment Banking:

By selecting two locations at random along its length, a stick can be divided into three pieces. How likely is it that it will form a triangle?

Correct Response: 1/4

Radeon graphics card or AMD Ryzen processor

How to use a Radeon graphics card or AMD Ryzen processor with NexaQuants

For all of your LLM requirements, to advise utilising LM Studio.

  • Install LM Studio by downloading it from lmstudio.ai/ryzenai.
  • Paste the huggingface URL of one of the nexaquants above into the discover tab.
  • Await the completion of the model’s download.
  • Return to the conversation tab and use the drop-down menu to choose the model. Verify that “manually choose parameters” is chosen.
  • Maximise the GPU offload layers.
  • Open the model and start talking!

Developers can also utilize the NexaQuant versions of the DeepSeek R1 Distil mentioned to gain typically better performance in applications that are based on GGUF or llama.cpp, according to the statistics that Nexa AI gave.

Conclusions

Nexa AI provides all performance and/or cost reduction promises; AMD has not independently tested them. Numerous factors influence both cost advantages and performance. The results presented here may not be typical and are peculiar to Nexa AI. GD-181.

GD-97: AMD is not liable for connected websites’ content and does not implicitly support them. Third-party links are provided for convenience.

GD-220e: The AMD Radeon graphics engine, a specialized AI engine, and AMD Ryzen CPU cores that provide AI capabilities make up GD-220e-Ryzen AI. Some AI functions might not be optimized for Ryzen AI CPUs yet, therefore OEM and ISV enablement is necessary. Ryzen AI works with AMD Ryzen 7040 and 8040 Series CPUs and Ryzen PRO 7040/8040 Series processors, except for Ryzen 5 7540U, 8540U, 7440U, and 8440U.

AMD Ryzen AI 300 and PRO 300 processors. With the exception of the Ryzen 5 8500G/GE and Ryzen 3 8300G/GE, all AMD Ryzen 8000G Series desktop processors.

With the exception of Ryzen 5 220 and Ryzen 3 210, AMD Ryzen 200 Series and Ryzen PRO 200 Series processors; AMD Ryzen AI Max Series and Ryzen AI PRO Max Series processors. Before making a purchase, please confirm feature availability with the maker of your system.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post