Multimodal AI
Digital assistants can learn more about you and the environment around you by utilizing multimodal AI, which gains even more power when it can operate on your device and processes various inputs such as text, photos, and video.
Large Multimodal Models(LMM)
Even with its infinite intelligence, generative artificial intelligence (AI) can only do so much because of how well it perceives its environment. Large multimodal models (LMMs) are able to examine text, photos, videos, radio frequency data, and even voice searches in order to offer more precise and pertinent responses.
It’s an important step in the development of generative AI after the widely used Large Language Models (LLMs), which included the ChatGPT initial model, which could only process text. Your PC, smartphone, and productivity apps will all benefit greatly from this improved ability to comprehend what you see and hear. Digital assistants and productivity tools will also become much more helpful. And the procedure will be quicker, more private, and power-efficient if the device can manage these processes.
LLaVA: Large Language and Vision Assistant
Qualcomm Technologies is dedicated on making multimodal AI available on devices. Large Language and Vision Assistant (LLaVA), a community-driven LMM with over seven billion parameters, was initially demonstrated by us back in February on an Android phone powered by the Snapdragon 8 Gen 3 Mobile Platform. In this demonstration, the phone could “recognize” images, such as a dish of fruits and vegetables or a dog in an open environment, and carry on a conversation with them. One may ask to have a recipe made with the things on the platter, or they could ask for an estimate of how many calories the recipe will include overall. Take a look at it:
The AI of the future is multimodal
Multimodal AI 2024
Given the increased clamor surrounding multimodal, this work is crucial. Microsoft unveiled the Phi-3.5 family of devices last week, which offers visual and multilingual support. This came after Google touted LMMs during its Made by Google event, wherein the multimodal input model Gemini Nano was unveiled. GPT-4 Omni, an original multimodal model from OpenAI, was unveiled in May. This comes after comparable research from Meta and community-developed models like LLaVA.
When combined, these developments show the direction that artificial intelligence is taking. It goes beyond simply having you type questions at a prompt. Qualcomm’s goal is to make these AI experiences available on billions of phones worldwide.
Qualcomm Technologies is collaborating with Google to enable the next generation of Gemini on Snapdragon, and it is working with a wide range of firms that are producing LMMs and LLMs, such as Meta’s Llama series. With the help of their partners, these models operate seamlessly on Snapdragon, and they can’t wait to surprise customers with more on-device AI features this year and the next.
While an Android phone is a great place to start when utilizing multimodal inputs, other categories will soon reap the benefits as well. For example, smart glasses that can scan your food and provide nutritional information, or cars that can comprehend your voice commands and help you while driving, are just a few examples of how multimodal inputs will benefit you.
Numerous difficult jobs can be completed via multimodal AI
These are just the beginning for multimodal AI, which may use a mix of cameras, microphones, and vehicle sensors to identify disinterested passengers in the back of an automobile and provide entertaining activities to pass the time. Additionally, it might make it possible for smart glasses to identify exercise equipment at a health club and generate a personalized training schedule for you.
The precision facilitated by multimodal AI will be important in aiding a field technician to diagnose problems with your household appliances or in guiding a farmer to pinpoint the root cause of crop-related problems.
The concept is that by utilizing cameras, microphones, and other sensors, these devices which start with phones, PCs, automobiles, and smart glasses can enable the AI assistant to “see” and “hear” in order to provide more insightful contextual responses.
The significance of on a device
Your phone or car must have sufficient processing capacity to handle those requests in order for all those added capabilities to function optimally. Since the battery on your phone must last the entire day, trillions of operations must occur quickly and effectively when using it. By using the device, you can avoid waiting for servers to react when they are too busy to ping the cloud. They’re also more private because you keep your device and the answers you receive with you.
That has been Qualcomm Technologies’ top concern. Handsets can handle a lot of processing on the phone itself because to the Snapdragon 8 Gen 3 processor’s Hexagon NPU. Likewise, the Snapdragon X Elite and Snapdragon X Plus Platforms enable more than 20 Copilot+ PCs on the market today to manage complex AI functions on the device.