A multimodal search solution with Google Multimodal embeddings, BigQuery, and NLP
Google multimodal embeddings
The digital world of today provides an enormous amount of information, including text as well as photographs and videos. Conventional enterprise search engines struggle to analyze visual content because they were primarily made to handle text-based inquiries. However, a new era of search is emerging that allows your customers to search for an image, video, or information within it in the same manner that they would with text-based content thanks to a combination of natural language processing (NLP) and Google multimodal embeddings.
Multimodal Embeddings
In this blog, Google Cloud demonstrate a robust multimodal embedding model that is specifically tailored for cross-modal semantic search scenarios, such as text-based picture searches or query-based text discovery in images. The model may be used for text search on images, videos, or both. The secret to completing these tasks is multimodal embedding.
Let’s see if this functions!
GCP Multimodal embeddings
A method for combining text, video, and image search
The BigQuery object tables in the architecture make use of Google Cloud Storage to store media assets. Semantic embeddings for the videos and images are created by a multimodal embedding model, and they are indexed in BigQuery for effective similarity search, allowing for smooth cross-modal search experiences.
Use the steps listed below to put a comparable approach into practice.
Steps 1 through 2: Transfer picture and video files to cloud storage
All picture and video files should be uploaded to a Cloud Storage bucket. Google Cloud obtained a few videos and pictures from Google Search that are hosted on GitHub for the experiment. Make sure that before uploading them to your Cloud Storage bucket, the README.md file is removed.
Get your media files ready:
- Gather all the photos and videos that you want to use, using your own data.
- Make sure the files are labelled and arranged correctly for convenient administration and access.
Data upload to cloud storage:
- If you haven’t already, create a bucket in Cloud Storage.
- Place your media files in the bucket via upload. The Cloud Storage API, the gsutil command-line tool, or the Google Cloud console are your options.
- Check to make sure the files are uploaded correctly, and make a note of the location and name of the bucket (gs://your-bucket-name/your-files, for example).
Step 3: In BigQuery, create an object table
To point to your source image and video files in the Cloud Storage bucket, create an Object table in BigQuery. Object tables are read-only tables stored in Cloud Storage that cover unstructured data objects. Here are some other use cases for BigQuery object tables.
Make a connection as explained here before creating the object table. Verify that the Vertex AI API is enabled for your project and that the connection’s principal has been assigned the role of “Vertex AI User.”
Establish a distant relationship
CREATE OR REPLACE MODEL dataset_name.model_name
REMOTE WITH CONNECTION us.connection_name
OPTIONS (ENDPOINT = ‘multimodalembedding@001’);
Make an object table
CREATE OR REPLACE EXTERNAL TABLE dataset_name.table_name
WITH CONNECTION us.connection_name
OPTIONS
( object_metadata = ‘SIMPLE’,
uris = [‘gs://bucket_name/*’]
);
Step 4: Make your Google multimodal embeddings
Google Cloud use a multimodal embedding model that has been trained to provide embeddings, or numerical representations, for your media data. These embeddings allow for effective similarity searches by capturing the semantic information of the content.
Step 5: In BigQuery, create a vector index
To store and query the embeddings created from your picture and video data efficiently, create a VECTOR INDEX in BigQuery for the embeddings. In order to carry out similarity searches later, this index is crucial.
CREATE OR REPLACE
VECTOR INDEX index_name
ON
dataset_name.table_name(ml_generate_embedding_result)
OPTIONS (
index_type = ‘IVF’, distance_type = ‘COSINE’);
Step 6: Send text input with the user’s query
A text input request from a user is sent in plain natural language, such as “elephant eating grass.” Just like it did with the media data, the system turns a user’s textual query submission into an embedding.
Step 7: For the user inquiry, create a text embedding
Using the same multimodal embedding approach, you can generate a text embedding for the user inquiry. Create an embedding for the user query using the same multimodal embedding model before comparing it with the stored embeddings.
Step 8: Look for similarities
Using VECTOR SEARCH, a similarity search is carried out between the user’s query and the source data, which includes pictures and videos. Find the media items that are most comparable to the user query by conducting a similarity search using the vector index that was built in Step 4. This search compares the embedding of the user query with the media data embeddings.
Step 9: Provide the user with the image and video search results
Ultimately, the user is shown the outcomes of the similarity search. The most comparable photos and movies that are kept in the Cloud Storage bucket are listed with their URIs and similarity scores (distances) in the results. The user can now see or download the media content that is relevant to their search query.
Google Multimodal embeddings enable a new search capability
You are only a few steps away from creating a great search experience across your visual content because Google multimodal embeddings support both image and video modalities. Prepare to unlock a new level of search that will improve user experiences and expedite content discovery, regardless of whether your use case involves image, video, or image and video search together.