Friday, March 28, 2025

Llama 3.2 11B Vision Instruct: Future Of AI Vision Models

Llama 3.2 11B Vision Instruct

The Llama 3.2-Vision collection of multimodal large language models (MLLMs) includes 11B and 90B pretrained and instruction-tuned image reasoning generative models. Llama 3.2 11B Vision instruct-tuned models excel at visual identification, picture reasoning, captioning, and image-related queries. On industry benchmarks, the models outperform numerous open source and closed multimodal models.

Llama-3.2-11B-Vision-Instruct architecture

Llama 3.2-Vision is based on Llama 3.1 text-only, an auto-regressive language model with an optimized transformer architecture. These versions match human preferences for helpfulness and safety using supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). A separately trained vision adapter interacts with the pre-trained Llama 3.1 language model to support picture recognition tasks in Llama 3.2-Vision. The adapter feeds image encoder representations into the core LLM via cross-attention layers.


Training Data
ParamsInput modalitiesOutput modalitiesContext lengthGQAData volumeKnowledge cutoff
Llama 3.2-Vision(Image, text) pairs11B (10.6)Text + ImageText128kYes6B (image, text) pairsDecember 2023
Llama 3.2-Vision(Image, text) pairs90B (88.8)Text + ImageText128kYes6B (image, text) pairsDecember 2023

Llama 3.2 11B Vision instruct Supported Languages

  • English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are available for text-only activities. Llama 3.2 was trained on more than 8 languages. Please note that image + text programs only support English.
  • Developers can fine-tune Llama 3.2 models for languages other than these if they follow the Community License and Acceptable Use Policy. Developers must always deploy securely and responsibly, even with several languages.
  • Llama 3.2 Model Family: Pretraining token counts only. The models all use Grouped-Query Attention (GQA) to scale inferences.

Intended Use

Llama 3.2-Vision is for business and research use. Instruction-tuned models are for visual identification, image reasoning, captioning, and assistant-like image chat, whereas pretrained models can be used for other image reasoning tasks. Llama 3.2-Vision can also input graphics and text, enabling other use cases:

  • VQA and Visual Reasoning: Imagine a machine that understands your enquiries about a picture.
  • Document Visual Question Answering (DocVQA): Imagine a computer reading the language and layout of a map or contract and answering questions from the image.
  • Image captioning: Image captioning extracts details, understands the scene, and writes a sentence or two to communicate the story.
  • Image-Text Retrieval: It matches photos with descriptions. Like a search engine but for images and text.
  • Visual Grounding: Visual grounding links what we see and communicate. Understanding how English references certain visual elements lets AI algorithms identify items or regions from natural language descriptions.

How to use Llama-3.2-11B-Vision-Instruct

This repository contains transformer and original llama codebase versions of Llama 3.2 11B Vision Instruct.

Use with transformers

Inference using conversational messages that may include an image is possible with transformers >= 4.45.0.

Make sure to update your transformers installation via pip install --upgrade transformers.

Hardware and Software

Pretraining factors included custom training libraries, Meta’s GPU cluster, and production infrastructure. Production infrastructure was fine-tuned, annotated, and evaluated.

The table below shows that training used 2.02M GPU hours on H100-80GB (TDP 700W) hardware. Power consumption is the peak power capacity per GPU device, adjusted for power usage efficiency, and training time is the total GPU time needed to train each model.

The estimated location-based greenhouse gas emissions for training were 584 tonnes CO2eq. Meta has had net zero greenhouse gas emissions worldwide since 2020 and matched 100% of its power use with renewable energy, thus its market-based greenhouse gas emissions for training were 0 tonnes CO2eq.

Training Time (GPU hours)Training Power Consumption (W)Training Location-Based Greenhouse Gas Emissions (tons CO2eq)Training Market-Based Greenhouse Gas Emissions (tons CO2eq)
Llama 3.2 vision 11B InstructStage 1 pretraining: 147K H100 hours Stage 2 annealing: 98K H100 hours SFT: 896 H100 hours RLHF: 224 H100 hours700710
Llama 3.2 vision 90B InstructStage 1 pretraining: 885K H100 hours Stage 2 annealing: 885K H100 hours SFT: 3072 H100 hours RLHF: 2048 H100 hours7005130
Total2.02M5840

Training Data

Briefly, Llama 3.2-Vision was pretrained on 6B images and texts. Public vision instruction datasets and over 3M synthetic examples make up the instruction tuning data.

Llama 3.2 Instruct

The main objectives in conducting safety fine-tuning are to reduce the workload of developers in deploying safe AI systems by giving them a readily available, secure, and potent model for a variety of applications, as well as to give the research community a useful resource for examining the robustness of safety fine-tuning. You may read more about the safety mitigations were implemented in Llama 3 by reading the Llama 3 document.

Fine-Tuning Data: To reduce any safety hazards, nous employ a multifaceted approach to data collection, integrating synthetic data with human-generated data from suppliers. In order to improve data quality control, researchers have created numerous large language model (LLM)-based classifiers that allow us to carefully choose excellent prompts and responses.

Refusals and Tone: That focused heavily on model refusals to benign cues and refusal tone, building on the work one can began with Llama 3. The safety data plan contained both aggressive and borderline cues, as well as adjusted the security data responses to adhere to tone rules.

Llama 3.2 Systems

Safety as a System: Llama 3.2 and other large language models are not intended to be used alone; rather, they should be used as a component of a larger AI system with extra safety measures in place as needed. When creating agentic systems, developers are supposed to implement system safeguards.

In addition to reducing the system’s inherent security and safety hazards, safeguards are essential for achieving the proper helpfulness-safety alignment and for integrating the model or system with outside tools. To offer the community protections like Llama Guard, Prompt Guard, and Code Shield that developers should implement with Llama models or other LLMs as part of the responsible release methodology.

New Use Cases and Capabilities

Technological Advancement: In addition to the best practices that are generally applicable to all use cases of generative AI, Llama versions typically provide new features that call for particular considerations.

Image Reasoning: Applications for image reasoning are made possible by Llama 3.2-Vision models’ multimodal (text and image) input capabilities. Having implemented specific procedures, such as assessments and mitigations, as part of the responsible release process to mitigate the danger of the models being able to uniquely identify people in photos.

Models might not always withstand adversarial prompts, as is the case with other LLM risks. To identify and reduce these risks, developers should assess identification and other relevant risks in the context of their applications. They should also think about integrating Llama Guard 3-11B-Vision into their system or using other mitigations as needed.

Evaluations

Scanned Evaluations: To filter input prompt and output reaction, researchers constructed specialized, adversarial evaluation datasets and assessed systems made up of Purple Llama safeguards and Llama models. Contextual application evaluation is crucial, so nous advise creating a specific evaluation dataset for your use case.

Red teaming: To identify risks through adversarial prompting, it regularly conducted red teaming activities. It then applied the lessons learnt to enhance security by tuning datasets and benchmarks. In order to comprehend the nature of these real-world problems and how such models might cause unintentional harm for society, one can teamed early on with subject-matter experts in crucial risk areas.

It developed a set of adversarial objectives for the red team to try to accomplish based on these discussions, like obtaining damaging data or reprogramming the model to behave in a potentially dangerous manner. In addition to multilingual content specialists with experience in integrity issues in certain geographic markets, the red team included experts in cybersecurity, adversarial machine learning, responsible AI, and integrity.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post