Thursday, November 21, 2024

What Is Fine-Tuning? And Its Methods For Best AI Performance

- Advertisement -

What Is Fine-Tuning?

In machine learning, fine-tuning is the act of modifying a learned model for particular tasks or use cases. It is now a standard deep learning method, especially for developing foundation models for generative artificial intelligence.

How does Fine-Tuning work?

When fine-tuning, a pre-trained model’s weights are used as a basis for further training on a smaller dataset of instances that more closely match the particular tasks and use cases the model will be applied to. Although supervised learning is usually included, it may also incorporate semi-supervised, self-supervised, or reinforcement learning.

- Advertisement -

The datasets that are utilized to fine-tune the pre-trained model communicate the particular domain knowledge, style, tasks, or use cases that are being adjusted. As an illustration:

  • An Large Language Model that has already been trained on general language might be refined for coding using a fresh dataset that has pertinent programming queries and sample code for each one.
  • With more labeled training samples, an image classification model that has been trained to recognize certain bird species may be trained to identify new species.
  • By using example texts that reflect a certain writing style, self-supervised learning may teach an LLM how to write in that manner.

When the situation necessitates supervised learning but there are few appropriate labeled instances, semi-supervised learning a type of machine learning that combines both labeled and unlabeled data is beneficial. For both computer vision and NLP tasks, semi-supervised fine-tuning has shown promising results and eases the difficulty of obtaining a sufficient quantity of labeled data.

Fine-Tuning Techniques

The weights of the whole network can be updated by fine-tuning, however this isn’t usually the case due to practical considerations. Other fine-tuning techniques that update just a subset of the model parameters are widely available and are often referred to as parameter-efficient fine-tuning (PEFT). Later in this part, it will discuss PEFT approaches, which help minimize catastrophic forgetting (the phenomena where fine-tuning results in the loss or instability of the model’s essential information) and computing demands, typically without causing significant performance sacrifices.

Achieving optimal model performance frequently necessitates multiple iterations of training strategies and setups, adjusting datasets and hyperparameters like batch size, learning rate, and regularization terms until a satisfactory outcome per whichever metrics are most relevant to your use case has been reached. This is because there are many different fine-tuning techniques and numerous variables that come with them.

- Advertisement -

Parameter Efficient Fine-Tuning (PEFT)

Full fine-tuning, like pre-training, is computationally intensive. It is typically too expensive and impracticable for contemporary deep learning models with hundreds of millions or billions of parameters.

Parameter efficient fine-tuning (PEFT) uses many ways to decrease the number of trainable parameters needed to adapt a large pre-trained model to downstream applications. PEFT greatly reduces computing and memory resources required to fine-tune a model. In NLP applications, PEFT approaches are more stable than complete fine-tuning methods.

Partial tweaks

Partial fine-tuning, also known as selective fine-tuning, updates just the pre-trained parameters most important to model performance on downstream tasks to decrease computing costs. The remaining settings are “frozen,” preventing changes.

The most intuitive partial fine-tuning method updates just the neural network’s outer layers. In most model architectures, the inner layers (closest to the input layer) capture only broad, generic features. For example, in a CNN used for image classification, early layers discern edges and textures, and each subsequent layer discerns finer features until final classification is predicted.

The more similar the new job (for which the model is being fine-tuned) is to the original task, the more valuable the inner layers’ pre-trained weights will be for it, requiring fewer layers to be updated.

Other partial fine-tuning strategies include changing just the layer-wide bias terms of the model, not the node weights. and “sparse” fine-tuning that updates just a portion of model weights.

Additive tweaking

Additive approaches add layers or parameters to a pre-trained model, freeze the weights, and train just those additional components. This method maintains model stability by preserving pre-trained weights.

This may increase training time, but it decreases GPU memory needs since there are fewer gradients and optimization states to store: Lialin, et al. found that training all model parameters uses 12–20 times more GPU memory than model weights. Quantizing the frozen model weights reduces model parameter accuracy, akin to decreasing an audio file’s bitrate, conserving more memory.

Additive approaches include quick tweaking. It’s comparable to prompt engineering, which involves customizing “hard prompts” human-written prompts in plain language to direct the model toward the intended output, for as by selecting a tone or supplying examples for few-shot learning. AI-authored soft prompts are concatenated to the user’s hard prompt in prompt tweaking. Prompt tuning trains the soft prompt instead of the model by freezing model weights. Fast, efficient tuning lets models switch jobs more readily, but interpretability suffers.

Adapters

In another subclass of additive fine-tuning, adaptor modules new, task-specific layers added to the neural network are trained instead of the frozen model weights. The original article assessed outcomes on the BERT masked language model and found that adapters performed as well as complete fine-tuning with 3.6% less parameters.

Reparameterization

Low-rank transformation of high-dimensional matrices (such a transformer model’s large matrix of pre-trained model weights) is used in parameterization-based approaches like LoRA. To reflect the low-dimensional structure of model weights, these low-rank representations exclude unimportant higher-dimensional information, drastically lowering trainable parameters. This greatly accelerates fine-tuning and minimizes model update memory.

LoRA optimizes a delta weight matrix injected into the model instead of the matrix of model weights. The weight update matrix is represented as two smaller (lower rank) matrices, lowering the number of parameters to update, speeding up fine-tuning, and reducing model update memory. Pre-trained model weights freeze.

Since LoRA optimizes and stores the delta between pre-trained weights and fine-tuned weights, task-specific LoRAs can be “swapped in” to adapt the pre-trained model whose parameters remain unchanged to a given use case.

QLoRA quantizes the transformer model before LoRA to minimize computing complexity.

Common fine-tuning use cases

Fine-tuning may customize, augment, or extend the model to new activities and domains.

  • Customizing style: Models may be customized to represent a brand’s tone by using intricate behavioral patterns and unique graphic styles or by starting each discussion with a pleasant greeting.
  • Specialization: LLMs may use their broad language skills to specialized assignments. Llama 2 models from Meta include basic foundation models, chatbot-tuned variations (Llama-2-chat), and code-tuned variants.
  • Adding domain-specific knowledge: LLMs are pre-trained on vast data sets but not omniscient. In legal, financial, and medical environments, where specialized, esoteric terminology may not have been well represented in pre-training, using extra training samples may help the base model.
  • Few-shot learning: Models with high generalist knowledge may be fine-tuned for more specialized categorization texts using few samples.
  • Addressing edge cases: Your model may need to handle circumstances not addressed in pre-training. Using annotated samples to fine-tune a model helps guarantee such scenarios are handled properly.
  • Your organization may have a proprietary data pipeline relevant to your use case. No training is needed to add this information into the model via fine-tuning.

Fine-Tuning Large Language Models(LLM)

A crucial step in the LLM development cycle is fine-tuning, which enables the basic foundation models’ linguistic capabilities to be modified for a range of applications, including coding, chatbots, and other creative and technical fields.

Using a vast corpus of unlabeled data, self-supervised learning is used to pre-train LLMs. Autoregressive language models are trained to predict the next word or words in a sequence until it is finished. Examples of these models include OpenAI’s GPT, Google’s Gemini, and Meta’s Llama models. Pre-training involves giving models a sample sentence’s beginning from the training data and asking them to forecast each word in the sequence until the sample’s finish. The real word that follows in the original example phrase acts as the ground truth for each forecast.

Although this pre-training produces strong text production skills, it does not provide a true grasp of the intent of the user. Fundamentally, autoregressive LLMs only add text to a prompt rather than responding to it. A pre-trained LLM (that has not been refined) only predicts, in a grammatically coherent manner, what may be the following word(s) in a given sequence that is launched by the prompt, without particularly explicit direction in the form of prompt engineering.

In response to the question, “Teach me how to make a resume,” an LLM would say, “using Microsoft Word.” Although it is an acceptable approach to finish the phrase, it does not support the user’s objective. Due to pertinent information in its pre-training corpus, the model may already possess a significant amount of knowledge about creating resumes; however, this knowledge may not be accessible without further refinement.

Thus, the process of fine-tuning foundation models is essential to making them entirely appropriate for real-world application, as well as to customizing them to your or your company’s distinct tone and use cases.

- Advertisement -
Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes