Thursday, September 19, 2024

Announcing The TreeAH Vector Index Preview On Google Cloud

- Advertisement -

Google Cloud is pleased to present the TreeAH vector index preview. In order to make BigQuery the AI-ready data platform for the Gemini era, Google Cloud is constantly adding new features to it. Google Cloud launched vector search earlier this year, allowing users to perform vector similarity searches on BigQuery data. Since then, they have also included a number of features, such pre-filtering and saved columns. LargeQuery’s vector search and AI capabilities are already enabling users to create pipelines and applications for anything from LLM-based retrieval-augmented generation (RAG) to semantic search, thanks to the platform’s size, performance, and ease of use.

TreeAH Index

This is a kind of vector index based on the ScaNN algorithm developed by Google. It functions as follows:

- Advertisement -
  • There are smaller, easier-to-manage shards within the base table.
  • The number of clusters used to train a clustering model comes from the leaf_node_embedding_count option in tree_ah_options.
  • After being product quantized, the vectors are kept in index tables.
  • In VECTOR_SEARCH, asymmetric hashing hardware optimized for approximative distance computations is effectively used to generate a candidate list for every query vector. Following that, these candidates are reranked and rescored utilizing precise embeddings.

When processing hundreds or more query vectors in a batch query, this method performs best. When compared to IVF, the adoption of product quantization can possibly result in orders of magnitude reductions in delay and expense. Nevertheless, the IVF algorithm might perform better with fewer query vectors because to its higher overhead.

If your use case satisfies the following requirements, Google Cloud advise you to test the TreeAH index type:

  • There are no more than 200 million rows in your table.
  • You run huge batch searches with hundreds or perhaps thousands of query vectors on a regular basis.
    • When utilising the TreeAH index type, VECTOR_SEARCH may resort to brute force for small batch searches. A Vector Index Unused rationale is then provided to provide an explanation in that scenario.
  • Pre-filtering or the use of saved columns are not necessary for your operation. When pre-filters are used with a TreeAH index, BigQuery handles them as post-filters.

Google Cloud is pleased to present the TreeAH vector index preview, which integrates key components from Google’s novel and cutting-edge approximate closest neighbour techniques into BigQuery. Compared to the initial index that Google Cloud developed in BigQuery, the inverted file index (IVF), this new index type offers considerable latency and cost reductions in some scenarios. It is powered by the same underlying technology as some of Google’s most well-known services. How important is it? Continue reading to find out how the two vary architecturally, performance findings, and when and how to utilise TreeAH instead of IVF.

An analysis comparing TreeAH indexes vs IVF

BigQuery can optimise the lookups and distance calculations needed to find closely matching embeddings by using a vector index. BigQuery can execute approximation nearest neighbour (ANN) search rather than exact nearest neighbour search because to the IVF and TreeAH indexes. This trade-off between accuracy and reduced query latency and expense is made possible.

- Advertisement -

The vector data is divided into clusters by the scalable k-means clustering technique, which is used by BigQuery’s initial vector index, IVF. The number of distance calculations is significantly reduced when using the VECTOR_SEARCH function to search the vector data. It does this by identifying the clusters that are closest to the query vector and only ranking the vector data from those clusters.

The ScaNN algorithm developed by Google, which powers similarity search across numerous Google services, serves as the foundation for the new TreeAH database. The primary distinction with the IVF index is the utilization of asymmetric hashing (the “AH” in the TreeAH), which compresses embeddings by product quantization. When combined with a distance computation technique optimized for CPU usage, vector search utilizing TreeAH can outperform IVF in terms of speed and cost. Because only the compressed embeddings are stored, index production can also be ten times cheaper, faster, and need less memory.

TreeAH functionality

To compare TreeAH with IVF, Google cloud’s engineering team ran benchmarks across a range of table configurations and query batch sizes. These are the outcomes:

Latency and cost for vector search queries
Image credit to Google Cloud
Vector index training latency and cost
Image credit to Google Cloud

Important outcomes:

  • Because the TreeAH index introduces more overhead, the IVF index sometimes performs better than TreeAH for small query batches.
  • |It performs much better than IVF for large query batches because of its improved distance-calculation method.
  • In most circumstances, TreeAH index training is also far less expensive and much faster than IVF.

(*): Block pruning optimization helped the query that used the IVF index.

When to use TreeAH and its existing features

When the query batch size is huge, the TreeAH index already performs significantly better than the IVF index, as demonstrated by the benchmarks. Index training is less expensive and speedier as well.

This index is still being actively developed, and additional features and performance enhancements will be included eventually. Presently, the following restrictions are in effect:

  • There can be no more than 200 million rows in the base table.
  • For the TreeAH index, pre-filtering and stored columns are not supported.

Beginning to use TreeAH

This index type is currently available for use and is in public preview:

CREATE OR REPLACE VECTOR INDEX <index_name>
ON <my_table>()
OPTIONS(distance_type=’COSINE’, index_type=’TREE_AH’)

- Advertisement -
Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes