The Edge’s video analytics are being revolutionised by Vision Transformers (ViTs).
An AI model known as a Vision Transformer (ViT) uses a transformer design to perform common computer vision tasks such semantic image segmentation, object identification, and image categorisation. Since its inception, the transformer architecture has dominated Natural Language Processing (NLP), particularly in models like the GPT architecture that underpin ChatGPT and other chatbots.
Although transformer models are now the industry standard in natural language processing (NLP), their first applications in computer vision (CV) were more restricted and frequently involved combining or replacing portions of convolutional neural networks (CNNs). Nonetheless, ViTs are a noteworthy advancement that demonstrates how a pure transformer applied directly to picture patch sequences can achieve remarkably high performance on image classification tasks.
How Vision Transformers Work
Unlike conventional CNNs, ViTs process images in a unique way. A ViT model displays an input image as a set of fixed-size image patches rather than considering it as a structured grid of pixels and employing convolutional layers. The order of word embeddings used when applying transformers to text is comparable to these patches.
The following steps are part of the general architecture:
- Dividing a picture into fixed-size chunks.
- These picture patches are being flattened.
- Using the flattened patches to create linear embeddings in lower dimensions.
- Positional embeddings are included. This enables the model to recreate the visual structure by learning the relative placement of the image patches.
- supplying a transformer encoder with this series of embeddings.
- The output of the last transformer block is usually sent to a classification head, which is frequently a fully linked layer, in order to classify images. This classification head may employ one hidden layer for pre-training and one linear layer for fine-tuning.
Self-Attention is the key mechanism
Taking inspiration from its NLP roots, the self-attention mechanism is a fundamental part of the ViT architecture. For the input data to contain contextual information and long-range dependencies, this method is essential. It enables the ViT model to focus on various input data regions according to how pertinent they are to the task at hand.
A weighted sum of the input data is computed by self-attention, with the weights being determined by how similar the input features are. By giving more weight to pertinent features, this weighting aids the model in capturing more insightful representations. In order to determine the hierarchy and alignment within the data, it measures the pairwise interactions between entities (image patches). This process contributes to the increased robustness of visual networks.
A sequence of transformer blocks is used by the transformer encoder to process these patches. A feed-forward layer (also known as a Multi-Layer Perceptron, or MLP) and a multi-head self-attention layer are the two sub-layers that usually make up each block. By extending the self-attention mechanism, multi-head attention enables the model to focus on many segments of the input sequence at once. To enhance training, Layer Normalization is frequently applied prior to each block, and residual connections are added following each block.
ViTs can embed information worldwide throughout the visual to the self-attention layer. This is a significant distinction from CNNs, which focus on local connectivity and use a hierarchical approach to develop a global knowledge. This global approach enables ViTs to semantically correlate disparate details in an image.
Also Read About Intel Vision Describes How AI Improves Cancer Treatment
Attention Maps:
Attention maps allow us to see the attention weights that are determined between each patch and every other patch. These maps show how important certain aspects of an image are to various learnt representations in the model. Understanding which areas of the image are crucial for a particular activity can be gained by visualising these maps, frequently as heatmaps.
Vision Transformers Vs Convolutional Neural Networks (CNNs)
CNNs, which have long been the state-of-the-art (SOTA) for a variety of computer vision tasks, such as image classification, are frequently contrasted with ViTs.
Architecture and Processing:
CNNs extract localised features and construct a hierarchical global knowledge through the use of convolutional layers and pooling operations. They handle pictures as organised grids. Contrarily, ViTs treat images as a series of patches that are processed by self-attention mechanisms, doing away with convolutions.
Attention/connection:
Hierarchical generalisation and local connection are essential to CNNs. ViTs employ self-attention, a global strategy that takes into account data from the full image simultaneously. ViTs can now more accurately represent long-range dependencies as a result.
Inductive Bias:
Compared to CNNs, ViTs often exhibit less inductive bias. Locality and translation invariance are qualities that CNNs naturally take advantage of. These must be learnt by ViTs from the data.
Computational Efficiency:
ViT models have the potential to be more computationally efficient than CNNs, sometimes requiring significantly less pre-training resources. Compared to SOTA CNNs, they can get similar or higher accuracy using about four times fewer computational resources. Additionally, their global self-attention strategy works well with GPUs and other parallel processing architectures.
Data Dependency:
ViTs mainly rely on vast volumes of data for large-scale training in order to attain high performance because of their lesser inductive bias. ViTs can perform better than CNNs when trained on large datasets (more than 14 million photos). They may still perform worse than comparable-sized CNN alternatives like ResNet, though, when trained from scratch on mid-sized datasets like ImageNet. When training on smaller datasets, this frequently calls for the use of model regularisation or data augmentation (AugReg).
Optimisation:
CNNs are typically simpler to optimise than ViTs.
History and Performance
ViTs achieved state-of-the-art accuracy with improved efficiency, contributing to recent advances in computer vision. They have demonstrated competitive performance across a range of applications. With impressive performance on ImageNet-1K, COCO detection, and ADE20K semantic segmentation benchmarks, for instance, the CSWin Transformer, a ViT version, has outperformed earlier SOTA techniques such as the Swin Transformer.
The Google Research Brain Team presented the Vision Transformer model architecture in a research paper presented at ICLR 2021. Its creation is a part of a timeline of vision transformer advancements that started with the 2017 NLP transformer design proposal. Models such as DETR (2020), iGPT (2020), the original ViT (2020), applications to different jobs (2020), and a number of ViT variations that have emerged since 2021, such as DeiT, PVT, TNT, Swin, and CSWin, are important steps.
For example, research teams frequently share pre-trained ViT models and fine-tuning code on GitHub. Large datasets such as ImageNet and ImageNet-21k are commonly used to pre-train these models.
Use cases and applications
Applications for vision transformers are widespread in a variety of computer vision activities and sectors. These consist of:
- Image recognition includes action recognition, segmentation, object detection, and image categorization.
- Visual grounding, visual question answering, and visual reasoning are examples of generative modelling and multi-model tasks.
- Video processing includes activity identification and video forecasting.
- Image enhancement includes super-resolution and colorization.
- 3D Analysis: Classification of point clouds and segmentation.
Applications in certain industries include healthcare (e.g., diagnosing medical photographs), smart cities, manufacturing, vital infrastructure, retail (for object recognition), and image captioning to assist the blind and visually impaired. One cross-attention vision transformer for image classification that works well for medical imaging is called CrossViT.
ViTs have the potential to become a general-purpose learning technique that works with different types of data. Like transformers revolutionized NLP, their promise lies in their understanding of hidden rules and contextual linkages.
Also Read About Granite Vision VLM For Open-Source Chart, Data Visualization
Challenges
Despite their potential, ViTs have a number of obstacles to overcome:
Architecture Design:
Concerns on creating the best ViT architectures.
Data Dependency & Generalisation:
Because their inductive biases are lower than CNNs’, they mostly rely on large datasets for training. Generalisation and robustness are strongly impacted by data quality.
Robustness:
Although several studies demonstrate the possibility of classifying images while protecting privacy and being resistant to attacks, it is still difficult to generalise robustness.
Interpretability:
It’s still difficult to completely comprehend why transformers perform well on visual tasks.
Efficiency:
It can be difficult to create transformer models that are effective for deployment on devices with low resources.
Performance on Particular Tasks:
In certain situations, directly utilising the pure ViT backbone on tasks such as object detection has not been able to outperform CNN results.
Technical Knowledge & Tools:
Since ViTs are still relatively new, integrating them may call for more technical expertise than with more well-known CNNs. Additionally, the availability of supporting libraries and tools is still developing.
Hyperparameter Tuning:
Research is ongoing to determine how architectural choices and hyperparameter tuning impact accuracy and efficiency in comparison to CNNs.
Research is currently being done to completely understand how ViTs function and how to best utilise their potential because they are still a relatively new technology.