The video understanding firm TwelveLabs and Amazon Web Services (AWS), Currently announced that Amazon Bedrock would soon provide TwelveLabs’ cutting-edge multimodal foundation models, Marengo and Pegasus. Using a single API, developers can access top-performing models from top Artificial Intelligence firms with Amazon Bedrock, a fully managed service. By utilising AWS’s security, privacy, and performance, developers and businesses will be able to revolutionise the way they search for, evaluate, and produce insights from video information with seamless access to TwelveLabs’ sophisticated video understanding capabilities. The first cloud provider to provide TwelveLabs models is AWS.
Introduction to Marengo 2.7
Twelve Labs is presenting Marengo 2.7, a cutting-edge multimodal embedding model that outperforms Marengo 2.6 by more than 15%.
An overview of video representation using several vectors
Video material is intrinsically more complicated and nuanced than text, where a single word embedding may successfully capture semantic meaning. In addition to temporal dynamics (motion, transitions), auditory elements (speech, background noise, music), and visual elements (objects, scenes, and activities), a video clip frequently includes textual information (overlays, subtitles). Conventional single-vector methods find it difficult to efficiently condense all of these disparate elements into a single representation without sacrificing important details. This intricacy calls for a more advanced method of video comprehension.
A novel multi-vector strategy is used by Marengo 2.7 to handle this complexity. It generates distinct vectors for each component of the movie rather than condensing everything into a single vector. Some vectors may record movement recall what was said and describe appearances. This method improves the model’s comprehension of films with a wide variety of information, resulting in more precise video analysis in terms of motion, audio, and visuals.
60+ multimodal retrieval datasets were evaluated
Current video understanding model standards frequently depend on narrative-style, in-depth descriptions that highlight a film’s key moments. This method does not, however, take into account actual usage patterns, where users tend to ask more vague, shorter questions, such as “find the red car” or “show me the celebration scene.” Additionally, users often look for background information, specific things that might only be visible for a limited moment, or peripheral features. Furthermore, enquiries frequently blend many modalities, such as text overlays with particular actions or visual components with aural signals. Because of this discrepancy between benchmark evaluation and real-world use scenarios, Marengo 2.7 required a more thorough review process.
TwelveLabs created a comprehensive evaluation system that includes more than 60 different datasets since it recognised the shortcomings of the current benchmarks in capturing real-world use cases. This methodology thoroughly evaluates the model’s performance in the following areas:
- General visual comprehension
- Understanding complex queries
- detection of small objects
- Interpretation of OCR
- Recognition of a logo
- Processing of audio (both spoken and unspoken)
Cutting Edge Performance with Unmatched Image-to-Visual Search Features
Marengo 2.7 exhibits cutting-edge performance on all major benchmarks, with its image-to-visual search capabilities showing especially noteworthy accomplishments. Even though the model performs well on all measures, its capabilities in picture object and logo search mark a major advancement in the industry.
Text-to-visual search in general
Average performance on the MSRVTT and COCO datasets was 74.9%, 4.6% better than external SOTA models.
Motion-to-visual search (text)
78.1% average recall in Something Something v2, which is 30.0% better than the external SOTA model.
OCR (text) search
Average performance of 77.0% on both TextCaps and BLIP3-OCR datasets, which is 13.4% better than external SOTA models.
Minimal object-to-visual search (text)
The average performance on the obj365-medium, bdd-medium, and mapillary-medium datasets was 52.7%, which was 10.1% better than external SOTA models.
General search from picture to visual
A stunning 35.0% improvement over external SOTA models largest performance leap to date was demonstrated by an exceptional 90.6% average performance across the obj365-easy, obj365-medium, and LaSOT datasets.
Image-to-visual search using a logo
An outstanding 19.2% improvement over external SOTA models, demonstrating notable progress, with an average performance of 56.0% across the OpenLogo, ads-logo, and basketball-logo datasets.
Text-to-audio search in general
Average performance on the AudioCaps, Clotho, and GTZAN datasets was 57.7%, 7.7% better than Marengo-2.6.
Multi-Vector Architecture in a unified framework
Fundamentally, Marengo-2.7 uses a Transformer-based design that interprets video data using a single, cohesive framework that can comprehend:
- Visual components: Temporal interactions, motion dynamics, fine-grained object identification, and appearance characteristics
- Audio components: Understanding native speech, identifying nonverbal cues, and interpreting music
Marengo-2.7’s distinctive multi-vector representation is one of its main features. Marengo-2.7 breaks down the raw inputs into many specialised vectors, in contrast to Marengo-2.6, which condenses all information into a single embedding. From visual appearance and motion dynamics to OCR text and speech patterns, each vector separately captures unique elements of the video information. More precise and nuanced multimodal search capabilities are made possible by this granular representation. The method performs exceptionally well in typical text-based search tasks and demonstrates a special aptitude in recognising tiny things.
Quantitative Assessment
Marengo 2.7’s performance has been thoroughly assessed on more than 60 benchmark datasets against top multimodal retrieval models and specialised solutions in a variety of disciplines. Evaluation approach provides a thorough examination of the model’s multimodal comprehension by incorporating text-to-visual, image-to-visual, and text-to-audio search capabilities.
Performance of Text-to-Visual Search
Visual Search in General
Marengo 2.7 gets an average recall of 74.9% in general visual search on two benchmark datasets. These outcomes show a 4.6% advantage over external SOTA models and a 4.7% improvement over Marengo 2.6.
Motion-Based Search
Marengo 2.7 obtains an average recall of 78.1% in Something Something v2 motion search. These outcomes show a 30.0% advantage over external SOTA models and a 22.5% improvement over Marengo 2.6.
OCR Lookup
Marengo 2.7 obtains a mean average precision of 77.0% in OCR search on two benchmark datasets. This is a 13.4% improvement over external SOTA models and a 10.1% improvement over Marengo 2.6.
Search for Small Objects
Marengo 2.7 gets an average recall of 52.7% in tiny item search on three bespoke benchmark datasets. These outcomes show an improvement of 10.08% over external SOTA models and 10.14% over Marengo 2.6.
Restrictions and Upcoming Projects
Even though Marengo 2.7 shows notable advancements in a number of modalities, there are still a number of obstacles to overcome before full video comprehension is achieved.
Understanding Complex Scenes
The may overlook minor background activities or parallel events that take place concurrently in the video, even if it is quite good at detecting the main actions and objects.
Visual Exact Match Difficulties
The algorithm occasionally has trouble identifying precise visual matches, especially when looking for particular occurrences of items or individuals that could show up more than once in somewhat different settings.
Interpretation of Queries
Although Marengo 2.7 manages the majority of requests well, it may have trouble with:
- Multiple temporal links in highly compositional queries
- Patterns of denial that go beyond basic situations
- Questions using world knowledge or abstract thinking
Effectiveness in OCR, Conversation, and Logo Search
Furthermore, Marengo 2.7 has limits when it comes to text-to-logo search situations, especially when it comes to logos that take up less than 1% of the frame or those are visible from difficult viewing angles.
The model has trouble with highly accented speech, conversations that overlap, and text in odd typefaces or orientations in both conversation and OCR search. These difficulties are most noticeable in real-world situations with complicated backdrops or inadequate illumination.
As it continue to expand the possibilities of multimodal video interpretation, these constraints are logical topics for further study and advancement. It current efforts are focused on resolving these issues while preserving the model’s present advantages in temporal reasoning and cross-modal comprehension.
Conclusion
With notable advancements in text, audio, and visual modalities, Marengo 2.7 marks a major advancement in multimodal video comprehension. It has demonstrated the ability to attain state-of-the-art performance in challenging video interpretation tasks while preserving high accuracy across many use cases with its creative multi-vector methodology and extensive assessment framework.
Together with it thorough evaluation process, it will be publishing a complete technical report to promote openness and reproducibility in the sector. To help academics and practitioners evaluate findings and further multimodal video knowledge, this framework which includes testing across more than 60 datasets will be open-sourced and routinely maintained.