AMD Releases AMD-135M, its First Small Language Model. Large language models (LLMs) such as GPT-4 and Llama have attracted a lot of interest in the rapidly changing field of artificial intelligence because to their remarkable skills in natural language synthesis and processing. Small language models (SLMs), on the other hand, are becoming more and more important in the AI model community and provide a special benefit for certain use situations.
AMD is presenting AMD-135M with Speculative Decoding, its very first Small language model. This study shows the dedication to an open strategy to AI, which will promote more inventive, moral, and inclusive technological advancement and guarantee that its advantages are more freely distributed and its difficulties are more cooperatively handled.
AMD-135M Models
The first AMD small language model is AMD-135M
The first Small language model for the Llama family, AMD-135M, was split into two models: AMD-Llama-135M and AMD-Llama-135M-code. It was trained entirely from scratch on AMD Instinct MI250 accelerators using 670B tokens.
Pretraining: Using four MI250 nodes, 670 billion tokens of general data were used to train the AMD-Llama-135M model from scratch over the course of six days.
- AMD-Llama-135M: AMD used 670B general data to train the model from scratch on the MI250 accelerator. Pretraining AMD-Llama-135M on four MI250 nodes, each with four MI250 accelerators (each with eight virtual GPU cards and 64G of RAM), took us six full days.
Code Finetuning: Using the same hardware, an extra 20 billion tokens of code data were added to the AMD-Llama-135M-code version, which took four days to complete.
- AMD-code Llama-135M: It improved the AMD-Llama-135M further by adding 20B code data tokens to make it more precise and enable a certain code mode. The AMD-Llama-135M-code tuning process required four full days on four MI250 accelerators.
- Code Dataset: To refine the 135M pretrained model, AMD used the Python part of the StarCoder dataset. The StarCoder dataset, which contains data from GitHub Issues, Jupyter notebooks, and GitHub contributions, totals over 250B tokens and consists of 783GB of code spanning 86 programming languages. Specifically, they concentrated on the Python programming language.
To enable developers to replicate the model and assist in training further SLMs and LLMs, the training code, dataset, and weights for this model are made available as open source.
Enhancement using Speculative Decoding
An autoregressive method is usually used for inference in large language models. This method’s primary drawback is that it can only produce one token each forward pass, which has a negative impact on total inference speed and memory access efficiency.
This issue has been resolved with the development of speculative decoding. The primary idea is to build a collection of candidate tokens using a tiny draft model and then validate them with a bigger target model. By generating several tokens on each forward pass without sacrificing performance, this method dramatically lowers the amount of memory access required and enables speed increases of many orders of magnitude.
Acceleration of Inference Performance
It evaluated the inference performance with and without speculative decoding on the MI250 accelerator for data centers and the Ryzen AI processor (with NPU) for AI PCs, using AMD-Llama-135M-code as a draft model for CodeLlama-7b. In comparison to the inference without speculative decoding, it saw speedups on the Instinct MI250 accelerator, Ryzen AI CPU, and Ryzen AI NPU for the specific setups that it evaluated utilizing AMD-Llama-135M-code as the draft model. On certain AMD systems, the AMD-135M SLM provides an end-to-end workflow that includes both training and inferencing.
In summary
AMD-135M SLM creates a complete workflow on AMD GPU accelerators and Ryzen AI processors that includes both training and inferencing. In order to achieve optimal performance in the data center as well as on power-limited edge devices like AI PC, this model helps ensure compliance with developer usability criteria by offering a reference implementation that follows best practices for model construction, pretraining, and deployment on AMD platforms. AMD is committed to providing new models to the open-source community, and it looks forward to seeing what ideas arise from this group.