What Is NanoVLM? Key Features, Components And Architecture

0
187
NanoVLM
What Is NanoVLM? Key Features, Components And Architecture

Vision-language models (VLMs) for NVIDIA Jetson devices, specifically the Orin Nano, are the focus of the NanoVLM program. For these models, the objective is to increase processing speed and decrease memory utilization in order to achieve interactive performance. Supported VLM families, benchmarks, and setup specifications, such as compatible Jetson devices and Jetpack versions, are all included in the documentation. It also describes a number of use scenarios, such as processing video sequences, live streaming analysis, and multimodal chat via online user interfaces or command-line interfaces.

What is nanoVLM?

The quickest and easiest repository for training and optimizing tiny Vision-Language Models (VLMs) is called nanoVLM.

Hugging Face created this condensed teaching framework. By offering a straightforward, PyTorch-based library, its main objective is to democratize the construction of vision-language models. NanoVLM prioritizes readability, modularity, and transparency without compromising practical applicability, drawing inspiration from initiatives like as Andrej Karratha’s nanoGPT. About 750 lines of code comprise the fundamental defining and training logic of nanoVLM, plus extra boilerplate for parameter loading and reporting.

Components and Architecture

Fundamentally, nanoVLM is a modular multimodal architecture that consists of a modality projection mechanism, a lightweight language decoder, and a vision encoder. SigLIP-B/16, a transformer-based architecture for reliable feature extraction from photos, serves as the foundation for the vision encoder.

  • This visual backbone converts input images into embeddings that the language model can understand.
  • SmolLM2, an efficient and clear causal decoder-style transformer, is used on the textual side.
  • A simple projection layer, which aligns the picture embeddings into the input space of the language model, controls the fusion between vision and language.
  • The integration is appropriate for rapid prototyping and educational use because it is transparent, readable, and simple to alter.

The VLM itself (~100 lines), the Language Decoder (~250 lines), the Modality Projection (~50 lines), the Vision Backbone (~150 lines), and a basic training loop (~200 lines) comprise the workable code structure.

Size and Performance

A 222M parameter nanoVLM is produced by using HuggingFaceTB/SmolLM2-135M and SigLIP-B/16-224-85M as backbones. The nanoVLM-222M is the version that is available.

NanoVLM produces competitive outcomes in spite of its small size and ease of use. The 222M model achieved an accuracy of 35.3% on the MMStar benchmark after being trained for roughly 6 hours on a single H100 GPU using roughly 1.7M samples from the the_cauldron dataset. This performance, which was attained with fewer parameters and less computation, is said to be similar to larger models like SmolVLM-256M.

NanoVLM’s efficiency makes it appropriate for environments with limited resources, including educational institutions or developers utilising a single workstation.

Key Features and Philosophy

NanoVLM is a straightforward yet effective platform for beginning to use VLMs.

  • It lets users experiment with various configurations and parameters to discover the potential and effectiveness of tiny VLMs.
  • One distinguishing characteristic is transparency; components are minimally abstracted and well defined, assisting users in comprehending logic and data flow. This makes it perfect for repeatability research and instructional reasons.
  • Because it is modular and forward-compatible, users can replace the vision encoders, decoders, or projection mechanisms. This offers a foundation for investigating several lines of inquiry.

Getting Started and Usage

After cloning the repository and configuring the environment, users can begin. Although pip can be used, uv is the suggested package management. Torch, numpy, torchvision, pillow, datasets, huggingface-hub, transformers, and wandb are among the dependencies.

Handy techniques for loading and storing models from the Hugging Face Hub are included in nanoVLM. It is possible to load pretrained weights from a Hub repository such as “lusxvr/nanoVLM-222M” using the VisionLanguageModel.from_pretrained() method.

Model.push_to_hub() can be used to push trained models to the Hub, creating a model card (README.md) and saving weights (model.safetensors) and configuration (config.json). Although they can be made private, repositories are normally public.

Additionally, models can be loaded locally and stored using model.VisionLanguageModel.from_pretrained() and save_pretrained() with a local path.

A generate.py script is supplied so that a trained model can be tested. An illustration demonstrates how to provide an image and the query, “What is this?” to obtain results that describe a cat.

NanoVLM” is also listed in the Models part of the NVIDIA Jetson AI Lab, although the extensive content concentrates on using the NanoLLM library to optimize different VLMs (such as Llava, VILA, and Obsidian) for Jetson devices. This implies that Jetson and other platforms can benefit from the optimization techniques for small VLMs studied in nanoVLM.

Training

The provided train.py script, which makes use of the default models/config.py, can be used to train nanoVLM. Logging with WANDB is a common part of training.

VRAM specs

Training requires an understanding of VRAM requirements.

  • A single NVIDIA H100 GPU benchmarking the default 222M model reveals that batch size boosts peak VRAM utilization.
  • The VRAM allotted after the model has been loaded is roughly 870.53 MB.
  • The maximum amount of VRAM used during training is approximately 4.5 GB for batch size 1 and 65 GB for batch number 256.
  • Training with a batch size of 512 caused OOM, peaking at about 80 GB prior to OOM.
  • The main conclusions are that training with a batch size of up to 16 requires at least ~4.5 GB of VRAM, whereas training with a batch size of up to 16 requires about 8 GB.
  • Modifications to sequence lengths or model architecture result in different VRAM requirements.
  • To test VRAM requirements on a particular system and setup, a measure_vram.py script is supplied.

Contributions and the Community

Welcome contributions to nanoVLM.

Contribution guidelines stress maintaining the implementation in pure PyTorch; contributions that include dependencies, such as transformers, are encouraged. It won’t accept deep speed, trainer, or accelerate. The first step in discussing new feature ideas is to open an issue. Pull requests can be used to submit bug fixes.

Data Packing, Multi-GPU training, Multi-image support, Image-splitting, and integration with VLMEvalKit are areas for future study (Roadmap). The project may be used with Transformers, Datasets, and Inference Endpoints because it is integrated into the Hugging Face ecosystem.

In summary

To put it briefly, nanoVLM is a Hugging Face project that offers a straightforward, legible, and modular PyTorch framework for creating and experimenting with little VLMs. It is intended for efficient use and instructional purposes, and it has distinct pathways for training, creation, and integration into the Hugging Face ecosystem.