Announcing The TreeAH Vector Index Preview On Google Cloud

August 21, 2024

161

Announcing The TreeAH Vector Index Preview On Google Cloud

- Advertisement -

Page Contents

Google Cloud is pleased to present the TreeAH vector index preview. In order to make BigQuery the AI-ready data platform for the Gemini era, Google Cloud is constantly adding new features to it. Google Cloud launched vector search earlier this year, allowing users to perform vector similarity searches on BigQuery data. Since then, they have also included a number of features, such pre-filtering and saved columns. LargeQuery’s vector search and AI capabilities are already enabling users to create pipelines and applications for anything from LLM-based retrieval-augmented generation (RAG) to semantic search, thanks to the platform’s size, performance, and ease of use.

TreeAH Index

This is a kind of vector index based on the ScaNN algorithm developed by Google. It functions as follows:

- Advertisement -

There are smaller, easier-to-manage shards within the base table.
The number of clusters used to train a clustering model comes from the leaf_node_embedding_count option in tree_ah_options.
After being product quantized, the vectors are kept in index tables.
In VECTOR_SEARCH, asymmetric hashing hardware optimized for approximative distance computations is effectively used to generate a candidate list for every query vector. Following that, these candidates are reranked and rescored utilizing precise embeddings.

When processing hundreds or more query vectors in a batch query, this method performs best. When compared to IVF, the adoption of product quantization can possibly result in orders of magnitude reductions in delay and expense. Nevertheless, the IVF algorithm might perform better with fewer query vectors because to its higher overhead.

If your use case satisfies the following requirements, Google Cloud advise you to test the TreeAH index type:

There are no more than 200 million rows in your table.
You run huge batch searches with hundreds or perhaps thousands of query vectors on a regular basis.
- When utilising the TreeAH index type, VECTOR_SEARCH may resort to brute force for small batch searches. A Vector Index Unused rationale is then provided to provide an explanation in that scenario.

Pre-filtering or the use of saved columns are not necessary for your operation. When pre-filters are used with a TreeAH index, BigQuery handles them as post-filters.

Google Cloud is pleased to present the TreeAH vector index preview, which integrates key components from Google’s novel and cutting-edge approximate closest neighbour techniques into BigQuery. Compared to the initial index that Google Cloud developed in BigQuery, the inverted file index (IVF), this new index type offers considerable latency and cost reductions in some scenarios. It is powered by the same underlying technology as some of Google’s most well-known services. How important is it? Continue reading to find out how the two vary architecturally, performance findings, and when and how to utilise TreeAH instead of IVF.

An analysis comparing TreeAH indexes vs IVF

BigQuery can optimise the lookups and distance calculations needed to find closely matching embeddings by using a vector index. BigQuery can execute approximation nearest neighbour (ANN) search rather than exact nearest neighbour search because to the IVF and TreeAH indexes. This trade-off between accuracy and reduced query latency and expense is made possible.

- Advertisement -

The vector data is divided into clusters by the scalable k-means clustering technique, which is used by BigQuery’s initial vector index, IVF. The number of distance calculations is significantly reduced when using the VECTOR_SEARCH function to search the vector data. It does this by identifying the clusters that are closest to the query vector and only ranking the vector data from those clusters.

The ScaNN algorithm developed by Google, which powers similarity search across numerous Google services, serves as the foundation for the new TreeAH database. The primary distinction with the IVF index is the utilization of asymmetric hashing (the “AH” in the TreeAH), which compresses embeddings by product quantization. When combined with a distance computation technique optimized for CPU usage, vector search utilizing TreeAH can outperform IVF in terms of speed and cost. Because only the compressed embeddings are stored, index production can also be ten times cheaper, faster, and need less memory.

TreeAH functionality

To compare TreeAH with IVF, Google cloud’s engineering team ran benchmarks across a range of table configurations and query batch sizes. These are the outcomes:

Latency and cost for vector search queries — Image credit to Google Cloud

Vector index training latency and cost — Image credit to Google Cloud

Important outcomes:

Because the TreeAH index introduces more overhead, the IVF index sometimes performs better than TreeAH for small query batches.
|It performs much better than IVF for large query batches because of its improved distance-calculation method.
In most circumstances, TreeAH index training is also far less expensive and much faster than IVF.

(*): Block pruning optimization helped the query that used the IVF index.

When to use TreeAH and its existing features

When the query batch size is huge, the TreeAH index already performs significantly better than the IVF index, as demonstrated by the benchmarks. Index training is less expensive and speedier as well.

This index is still being actively developed, and additional features and performance enhancements will be included eventually. Presently, the following restrictions are in effect:

There can be no more than 200 million rows in the base table.
For the TreeAH index, pre-filtering and stored columns are not supported.

Beginning to use TreeAH

This index type is currently available for use and is in public preview:

CREATE OR REPLACE VECTOR INDEX <index_name>
ON <my_table>()
OPTIONS(distance_type=’COSINE’, index_type=’TREE_AH’)

- Advertisement -

Announcing The TreeAH Vector Index Preview On Google Cloud

TreeAH Index

An analysis comparing TreeAH indexes vs IVF

TreeAH functionality

Important outcomes:

When to use TreeAH and its existing features

Beginning to use TreeAH

Trustworthy AI: Data Integrity Unlocks Business Value

Amazon Bedrock Knowledge Bases Added With RAG Evaluation

Deliberative Alignment: O-Series Model Safety By Reasoning

LEAVE A REPLY Cancel reply

Recent Posts

Trustworthy AI: Data Integrity Unlocks Business Value

Amazon Bedrock Knowledge Bases Added With RAG Evaluation

Deliberative Alignment: O-Series Model Safety By Reasoning

CAT3D Model: 3D Creation With Multi View Diffusion Models

Seagate External Hard Drive Xbox Compatible HDDs And SSDs

OPPO Multi Screen Connect: Seamlessly Link Your Devices

Popular Post

ASRock’s creative AMD FP6 series thin mini-ITX motherboard

ASUS ProArt PA602 The Most Elegant Computer Case!

What is Azure Policy in Microsoft Azure

Boost Your Apps Now: Amazon ElastiCache Serverless Unveiled!

Cardea Z540 SSD Revolutionizes Storage

MSI Motherboards with Intel Application Optimization

About Us

POPULAR CATEGORY