The Complete Guide to Using Intel Gaudi 2 Accelerator with CLIP.
The initial version of the article appeared on roboflow.com.
The CLIP architecture underpins modern computer vision. Image and video categorisation, retrieval augmented generation (RAG), image similarity computations, and more can be trained using Contrastive Language Image Pretraining embedding models.
OpenAI modified multiple public checkpoints on huge datasets in its CLIP architecture. Apple, Meta AI, and others have trained CLIP models since OpenAI released the first. These models are usually trained for general application rather than a specific use case.
The Intel Gaudi 2 accelerator with Hugging Face Transformers and Optimum Habana optimisations can train a projection layer for a bespoke CLIP model. You can encode this CLIP model to learn domain-specific concepts.
Our post shows how to train a CLIP projection layer with Intel Gaudi 2 accelerator. We will also demonstrate model deployment preparation.
Let’s begin now!
What’s CLIP?
OpenAI created multiplemodal vision model architecture Contrastive Language Image Pretraining (CLIP). CLIP calculates picture and text embeddings. CLIP models learn from image-text pairs. An embedding model learns correlations between image components and text captions using these pairs.
Many enterprise applications benefit from CLIP models. For instance, CLIP can help:
- Classifying assembly line part images
- Media archive video classification
- Real-time, scaled image content moderation
- Deduplicate photos before massive model training.
- And more
CLIP models can run at many frames per second, depending on hardware. CLIP runs best on AI-specific hardware like the Intel Gaudi 2 accelerator. In 20 minutes, a single Intel Gaudi 2 accelerator computed 66,211 CLIP vectors. This speed is enough for many real-time applications; more chips can boost performance.
The checkpoint released by OpenAI solves many use cases, however for more particular use cases or use cases that incorporate enterprise data on which existing big models cannot train, conventional CLIP models fail.
Training your own CLIP model helps here.
Training a CLIP Model on Intel Gaudi 2 Accelerator
Train your own CLIP model to categorise pictures as needed. You can train a model to discriminate between components you make, classify product flaws, or detect landmarks.
Training a CLIP model requires:
- Image dataset, and;
- Image captions with details.
Captions should adequately describe each image in a few sentences.
Hugging Face Transformers supports sight-language model training, including Contrastive Language Image Pretraining. Hugging Face Transformers joined Intel to enhance training and inference on Intel Gaudi 2 accelerator with Optimum Habana Transformers additions. This guide trains a model using Transformers and the Optimum Habana Transformers example.
This article will use the COCO dataset, which has over 100,000 captioned images for training.
Step 1: Download and Configure Dataset
A CLIP-like model requires an image dataset containing captions for each image. These captions should be detailed enough for the model to understand an image.
This guide will use the COCO dataset, which has visual captions for over 100,000 photos.
The Hugging Face team created a script to train a CLIP model with Intel Gaudi 2 accelerator for AI training. A COCO JSON dataset is fed to this script to train a CLIP model.
This guide uses the default dataset. You can utilise any COCO JSON dataset. Visit the Microsoft COCO dataset website to learn about COCO JSON.
In COCO JSON format, your dataset images should have these features:
[ “image_id” , “caption_id” , “caption” , “height” , “width” , “file_name” , “coco_url” , “image_path” , “id” ]
You require a train, test, and valid dataset.
Download the Contrastive Language Image Pretraining dataset to Intel Gaudi 2 accelerator for this guide. Run to begin:
Intel’s model training requires the Hugging Face Transformers library, which you must install:
After downloading the dataset, train your CLIP model.
Step 2: Download Optimum Habana Training Script
Download the training scripts next. You may grab these scripts from Optimum Habana GitHub.
Clone the Optimum Habana GitHub repository:
Go to examples/contrastive-image-text:
Next, install project requirements:
All training scripts for this guide are in this folder.
Use this folder for the rest of this guide.
Step 3: Model Stub Creation
Hugging Face encodes text and vision with pretrained code. You can train the model’s projection layer on your dataset. Create a Python file and add the code to download the weights and model configurations:
Step 4: Model Training
You can train your model on one or more HPUs.
Use one of these commands to train your model:
Train with a Single HPU
Train with Multiple HPUs
Intel Gaudi 2 Accelerator CLIP Training Benchmark
Intel used the COCO dataset to test Contrastive Language Image Pretraining on Intel Gaudi 2 accelerator on a single HPU. Time was used to document the entire training job. Since Intel downloaded their dataset before training, this training time comprises dataset initialisation but not download.
For three epochs across the COCO dataset, their model trained in 15 minutes and 1 second.
Remember, this technique does not train a CLIP model from scratch. Instead, Intel tune a projection layer on their dataset to assist Contrastive Language Image Pretraining learn new concepts from their data.
Installing CLIP on Intel Gaudi 2 Accelerator
After training a model, how do Intel deploy it to production? Hugging Face Transformers includes Intel Gaudi 2 AI acceleration for training and inference. The Optimum Habana project accelerated Contrastive Language Image Pretraining using Intel Gaudi 2 accelerator operations to speed up training and inference.
This section discusses how to deploy a CLIP-like model using the model Intel trained earlier.
Step 1: Create a CLIP Inference Script
CLIP deployment on an Intel Gaudi 2 accelerator requires an inference script that calculates Contrastive Language Image Pretraining vectors using your model.
This script can be modified for zero-shot image classification, video classification, dataset deduplication, image search, and more. In the Prepare Model for Production section of this guide, Intel will cover some of these use scenarios using Roboflow’s code.
Create a script to calculate CLIP vectors with Intel model:
Step 2: prepare production model
After configuring a model, Intel may build custom logic to solve a business problem.
Many enterprise Contrastive Language Image Pretraining applications exist. CLIP can be used for:
- Deduplicate photos in a dataset to prepare it for training. Deduplication prevents duplicate photos from appearing in train and valid datasets after data splitting, which can lower training quality. Deduplication also prevents your model from training on several images, which is inefficient.
- Build a semantic picture search engine. You may search a library of photos using text and image searches. This is useful for finding relevant images in large image datasets. A news organisation could employ semantic image search to identify article-related photos.
- Classify images: Label them. Consumer-facing applications like wildlife photo identification, content control, and more can leverage image categorisation.
- Check for certain sequences in a video. This is perfect for media indexing, where you want to identify screen feature timestamps. Video categorisation can also detect NSFW content.
- CLIP can extract images for RAG-based systems, including those that connect with large multimodal models (LMMs) like GPT-4 with Vision.
The Roboflow team has created multiple Contrastive Language Image Pretraining and Intel Gaudi 2 accelerator enterprise application tutorials:
Build a CLIP Image Search Engine with Intel Gaudi 2 AI Accelerators
Build CLIP Enterprise Datasets for Multimodal Model Training with Intel Gaudi 2 AI Accelerators
Multimodal CLIP Video Analysis with Intel Gaudi 2 AI Accelerators
Intel Gaudi 2 Accelerator CLIP Inference Benchmark
To test performance, Roboflow computed CLIP vectors for 66,211 photos using a single Intel Gaudi accelerator. Intel utilised the out-of-the-box CLIP model to demonstrate how Transformers’ model will work on your system.
Instead of using a customised model where performance may differ, they employed the CLIP model straight out of the box as a benchmark.
In their benchmarking, a single Intel Gaudi 2 AI accelerator calculated CLIP vectors for 66,211 photos in 20m11s using default Contrastive Language Image Pretraining weights. Intel’s system generated ~3,310 CLIP vectors each minute, or ~55.2 per second.
Batch processing and real-time CLIP vector calculation applications benefit from this efficiency.