BigQuery vector search
Google Cloud announce today that BigQuery vector search is now generally available (GA), allowing users to search for vector similarity on BigQuery data. This feature, which is also known as approximate nearest-neighbor search, is essential for enabling a wide range of new data and AI use cases, including retrieval-augmented generation (RAG) using a large language model (LLM), similarity detection, and semantic search.
BigQuery vector search was first revealed in February and offers a serverless and integrated vector-analytics solution for use cases like anomaly detection, multi-modal search, product recommendations, drug discovery, and more. It does this by integrating the creation, management, and search of embeddings within the data platform.
Furthermore, the BigQuery vector search inverted file index (IVF) index is now publicly accessible. With the use of an inverted row locator and a k-means clustering method, this index creates a two-piece index that effectively searches for similar embedding representations of your data. Since being revealed in preview, IVF has undergone the following new improvements:
Increased scalability: 10 billion embeddings can now be indexed, opening the door to large-scale applications.
Managed index with guaranteed correctness: Vector indexes are automatically updated using the current k-means model if the underlying data changes. Even before the system has completed re-indexing the updated data, vector search consistently yields accurate results based on the most recent mutations of the data.
Stored columns: To save money on costly joins when obtaining more data for the search result, you may now save frequently used columns in the index. The situations where this optimization produces the most visible performance gains are those with high result-set cardinality, such as when you require a high top_k or when your query data comprises a large batch of embeddings. For example, using vector indexes with stored columns results in delivering the 1,000 most comparable candidates for an embedding ~4x faster with ~200x less slots than using vector indexes without stored columns for a table containing 1 billion 96-dimensional embeddings.
Pre-filters: By changing the base table statement into a query with filters, vector search results can be pre-filtered in conjunction with stored columns. Pre-filtering increases search quality, reduces the chance of missing results, and optimizes query performance in contrast to post-filtering, which adds WHERE clauses after the VECTOR_SEARCH() function.
Palo Alto Networks and other clients have used BigQuery vector search to uncover comparable, frequently asked queries, which has sped up the time to insight.
Furthermore, even for large-scale workloads like drug discovery, which Vilya has been working on, prototyping with BigQuery vector search and moving to production is straightforward. Moreover, the transition to capacity-based billing models has been easy with tools for budget evaluation and on-demand pricing.
Using an example to build
Let’s say you want to ask a question in an internal Q&A forum, but you want to see whether there are any already answered queries that are semantically related to yours. It will assume that it has created embeddings for the questions and put them in the database in order to illustrate. After that is finished, you may make a vector index and, for optimal efficiency, store the frequently used columns such as title, content, and tags in the index.
CREATE OR REPLACE VECTOR INDEX <index_name>
ON <my_posts_questions>
(embedding)
STORING (title, content, tags)
OPTIONS(distance_type=’COSINE’, index_type=’IVF’)
Even without a vector index, VECTOR_SEARCH() functions, however establishing an index usually improves query performance. When everything is ready, you can use ML.GENERATE_EMBEDDING in conjunction with VECTOR_SEARCH() to look up topics like “Android app using RSS crashing,” among others. You can use a pre-filter on the “tags” column to narrow the search space in order to more effectively refine the results.
SELECT query.query, base.title, base.content, distance
FROM VECTOR_SEARCH(
( SELECT * FROM WHERE SEARCH(tags, ‘android’) ),
’embedding’,
(
SELECT ml_generate_embedding_result AS embedding, content AS query
FROM ML.GENERATE_EMBEDDING(
MODEL ,
(SELECT ‘Android app using RSS crashing’ AS content)
)
))
Additionally, it just revealed a new index type that can even further enhance search performance. It is based on the ScaNN developed by Google and is available in preview. BigQuery vector search is becoming an essential part of a multi-modal Retrieval Augmentation Generation (RAG) solution, built on top of a fully functional BigQuery knowledge base that includes multimodal, structured, and unstructured data, and powered by cutting-edge Gemini models.
Start now
Beginning with fast and affordable searching on those embeddings, vector embeddings and machine learning have the potential to completely transform what you can do with the data kept in your BigQuery business data warehouses.