Saturday, April 12, 2025

Pegasus 1.2: High-Performance Video Language Model

Pegasus 1.2 sets a new standard for long-form video AI with top-tier accuracy and minimal lag. Designed for commercial use, it supports efficient video querying at scale.

The video understanding firm TwelveLabs and Amazon Web Services (AWS), currently announced that Amazon Bedrock would soon provide TwelveLabs’ cutting-edge multimodal foundation models, Marengo and Pegasus. Using a single API, developers can access top-performing models from top Artificial Intelligence firms with Amazon Bedrock, a fully managed service. By utilising AWS’s security, privacy, and performance, developers and businesses will be able to revolutionise the way they search for, evaluate, and produce insights from video information with seamless access to TwelveLabs’ sophisticated video understanding capabilities. The first cloud provider to provide TwelveLabs models is AWS.

Read more on Marengo 2.7: Breakthrough in Multimodal Video Understanding

Introducing Pegasus 1.2

Real-world video applications have two distinct obstacles, in contrast to many academic settings:

  • Videos from the real world might be anywhere from a few seconds to many hours long.
  • Precise temporal comprehension is necessary.

TwelveLabs is launching Pegasus 1.2, a major advancement in industry-grade video language models, in response to these commercial demands. Pegasus 1.2 reaches cutting-edge results in interpreting lengthy videos. With minimal latency, cheap cost, and best-in-class accuracy, model can handle films up to an hour in length. Additionally, their embeddings storage cleverly caches movies, making it possible to query the same video more quickly and affordably over and over again.

Pegasus 1.2 is therefore a state-of-the-art solution that provides business value through its intelligent, targeted system design performing exceptionally well where production-grade video processing pipelines most require it.

Best in-class video language model to handle long videos

Although managing lengthy movies is essential in business contexts, processing time and consequently extended time-to-value are major issues. A normal video processing/inference system rapidly becomes unable to handle orders of magnitude more frames as input films become longer, which makes it practically unsuitable for widespread adoption and commercial use. Additionally, a system with commercial expectations would have to provide accurate answers to input prompts and questions while reasoning over considerably longer time horizons.

Latency

It compare time-to-first-token (TTFT) for films spanning in duration from 3 to 60 minutes using various frontier model APIs, GPT-4o and Gemini 1.5 Pro, in order to assess Pegasus 1.2’s speed. With the help of video-focused model architecture and optimised inference system, Pegasus 1.2 consistently shows time-to-first-token latency for videos up to 15 minutes long, and responds more quickly to longer content.

Performance

Using a subset of the Video-MME dataset that includes films longer than 30 minutes (VideoMME-Long), It compare Pegasus 1.2’s performance to the same set of frontier model APIs. Pegasus 1.2 outperforms all other flagship APIs, demonstrating the new state-of-the-art performance.

Pricing

Cost For commercial video processing, Pegasus 1.2 offers best-in-class performance without the high cost. TwelveLabs has concentrated on mastering lengthy films and precise temporal knowledge rather than attempting to handle everything. It has developed a highly optimized system that offers exceptional performance at a competitive price point with targeted approach.

Even better, system can handle multiple video-to-text generation without incurring significant costs. Pegasus 1.2 creates rich video embeddings when movies are indexed and saves them in database for further API requests, enabling customers to create continuously for a very minimal cost. For instance, the cache cost of Google Gemini 1.5 Pro is $4.5 for every hour of storage, or 1 million tokens, which is about equal to the token count for an hour of video. It embedded storage, on the other hand, is a whopping x36,000 cheaper at just $0.09 per video hour per month. Customers with enormous video archives that need to comprehend everything inexpensively would benefit greatly from concept.

Model Overview & Limitation

Architecture

A video encoder, a video tokenizer, and a large language model are the three main parts of Pegasus 1.2’s encoder-decoder architecture, which is tailored for thorough video comprehension. This architecture preserves computing efficiency while allowing for comprehensive analysis of textual and visual data.

When these elements are combined, a coherent system that can comprehend both long-term contextual information and fine-grained details is produced. It architecture shows that by making careful design decisions and coming up with creative answers to basic problems in multimodal processing, small models may achieve advanced video understanding.

Restrictions

Safety and Biases

Pegasus 1.2 has safety features, but like any AI model, it runs the risk of producing content that can be deemed offensive or dangerous if sufficient supervision and regulation aren’t in place. It continue to learn about the safety and ethical considerations for video foundation models. A thorough assessment and ethical report will be made accessible when it carry out further testing and collect input.

Hallucinations

Occasionally, Pegasus 1.2 may generate erroneous results. Although it has made improvements since Pegasus 1.1 to lessen hallucinations, users should be aware of this restriction, particularly for activities requiring a high degree of precision and factual accuracy.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Page Content

Recent Posts

Index