Presenting the Initial AMD Small Language Model (SLM): AMD-Llama-135M And AMD-Llama-135M-Code Models Drives AI Development.
Introduction
There has been a lot of talk and attention given to the fast advancement of artificial intelligence technologies, particularly with regard to large language models (LLMs). These language models, which started with ChatGPT and went on to include GPT-4 and Llama, have shown hitherto unheard-of talents in natural language production, processing, and interpretation. Small language models, on the other hand, are becoming more and more important in the AI model community because of their special benefit for certain use cases.
Why Create Your Own SLMs (Small Language Models)?
In the now fast growing area of artificial intelligence, LLMs models such as GPT-4 and Llama 3.1 have lifted the bar for performance and capacity. Although LLMs have significance, small language models (SLMs) provide a strong argument since they offer a workable solution that strikes a compromise between operational limitations and performance.
Furthermore, it is well recognized that training LLMs often necessitates a large array of expensive GPUs. The requirement for data rises exponentially with model size. Even with a well-trained LLM, it may sometimes be challenging to operate it effectively on a client device with very constrained processing power.
Many community developers find it difficult to strike a compromise between the availability of enough computing resources and datasets and the goal of improving model performance and accuracy as a result of these variables. The introduction of SLM offers an alternate approach to these situations of resource-intensive training and inference, assisting in a major reduction in the cost of hardware, memory, and power use.
Innovations in SLM AI Models
The first AMD-135M SLM with Speculative Decoding is being released by AMD. The AMD-Llama-135M and AMD-Llama-135M-code models are the first two tiny language models for the Llama family, and they were trained entirely on AMD Instinct MI250 accelerators using 690B tokens. To enable developers to replicate the model and assist in training further SLMs and LLMs, the training code, dataset, and weights for this model are made available as open source.
Create and Implement the Model
The training process for this model consisted of two main stages: first, it was necessary to train the AMD-Llama-135M model from scratch using 670B tokens of general data. Next, it was necessary to execute a second training on 20B tokens of code data in order to produce AMD-Llama-135M-code. The draft model for CodeLlama-7b (a pretrained open-source transformer model on code generation created by Meta and accessible on Hugging Face) was AMD-Llama-135M-code. All of the AMD hardware platforms it examined saw an average 2-3x speedup in deployed performance.
AMD benchmarked with various open-source models of similar size to demonstrate that the AMD-135M model has performance equivalent to well-known models on the market. The outcome showed that AMD-135M can perform at a state-of-the-art level on tasks like Hellaswag, SciQ, and ARC-Easy, outperforming Llama-68M and LLama-160M models. Additionally, as the chart below illustrates, the performance for the tasks assigned under Hellaswag, WinoGrande, SciQ, MMLU, and ARC-Easy is comparable to that of GPT2-124M and OPT-125M.
AMD-135M Model Outperforms Small Language Models Available for Public Development on Specific Tasks.
Prior to training
AMD-Llama-135M: It used 670B general data to train the model from scratch on the MI250 accelerator. Pretraining AMD-Llama-135M on four MI250 nodes, each with four MI250 accelerators (each with eight virtual GPU cards and 64G of RAM), took us six full days.
Pretrain Dataset: To pretrain the 135M model, it used the SlimPajama and Project Gutenberg datasets. Over 70,000 free ebooks are in Project Gutenberg. This equals 670 billion tokens.
Code Adjustment
AMD-code Llama-135M: It improved the AMD-Llama-135M further by adding 20B code data tokens to make it more precise and enable a certain code mode. The AMD-Llama-135M-code tuning process required four full days on four MI250 accelerators.
Code Dataset: To refine our 135M pretrained model, they used the Python part of the StarCoder dataset. The StarCoder dataset, which contains data from GitHub Issues, Jupyter notebooks, and GitHub contributions, totals over 250B tokens and consists of 783GB of code spanning 86 programming languages. Specifically, AMD concentrated on the Python programming language. As a result, it extracted the Python subset, which had around 20B tokens.
Conjectural Interpretation
An autoregressive method is usually used for inference in large language models. This method’s primary drawback is that it can only produce one token each forward pass, which has a negative impact on total inference speed and memory access efficiency.
This issue has been resolved with the development of speculative decoding. The primary idea is to build a collection of candidate tokens using a tiny draft model and then validate them with a bigger target model. By generating several tokens on each forward pass without sacrificing performance, this method dramatically lowers the amount of memory access required and enables speed increases of many orders of magnitude.