Friday, September 20, 2024

ANN Search: Approximate Nearest Neighbor Search to Spanner

- Advertisement -

Spanner is made aware of the existence of the approximate nearest neighbor search also known as ANN search.

Within a big dataset, vector search is a method for locating objects that are comparable to a certain query item. It’s especially helpful for unstructured material, such as text, music, or photos, when more conventional search techniques that rely on exact matches might not be as successful. Because vector search may also be used to improve large language model (LLM) prompts, enhancing their relevance and decreasing hallucinations, it is essential to generative AI applications. You won’t need to manage an additional database or ETL pipeline since scalable vector search capabilities will be integrated straight into your operational database if this functionality is included in a general-purpose database.

- Advertisement -

Approximate Nearest Neighbor Search

Presently in preview, Spanner accurate k-nearest neighbor (KNN) vector search is an excellent match for workloads that are highly partitionable, such looking through personal images, because each query involves very few entities. Spanner clients have been using KNN vector search more frequently since we introduced it earlier this year. Additionally, you can now use Spanner’s approximate nearest neighbor (ANN) search for large-scale unpartitioned workloads, which yields:

  • Scale and speed: a high-recall, quick search that can scale to over 10B vectors
  • Operational simplicity: You don’t have to transfer your data to a dedicated vector database
  • Consistency: the outcomes are consistently up to date with the most recent modifications.

Spanner can now handle your vector search needs in a highly scalable and effective manner with to the addition of Approximate Nearest Neighbor Search capabilities.

Making Use of ANN

Allow us to take you through the specifics of the Spanner technologies that provide vector search on a large scale.

Spanner makes use of Google Research’s extremely effective vector similarity search algorithm, ScaNN (Scalable Nearest Neighbors), which is a crucial component of Google and Google Cloud apps. The primary ScaNN-based improvements available in Spanner at this time comprise:

- Advertisement -
  • By grouping embeddings into a tree-like structure, the score space can be quickly pruned during query times, sacrificing accuracy in exchange for a notable performance increase.
  • Quantised raw vector embeddings to save storage requirements and accelerate scoring
  • Distance calculation was optimised by concentrating on the most pertinent segments of the vectors, enhancing partitioning and ranking to improve recall.

You must use conventional SQL DDL to establish a vector index on vector embeddings, specifying the search tree structure and a distance type, in order to execute ANN in Spanner.

Tree form

The tree may have two or three levels, depending on the amount of the dataset. Three-level trees offer hierarchical partitioning that scales to 10B+ vector datasets because they feature an additional layer of branch nodes between the root and leaves. Calculated and stored in the root and branch nodes are centroids representatives of the leaf partitions that are also embeddings.

ANN search

First, all of the centroids kept in the root are compared with the incoming query embedding at query time. For additional analysis, only the highest nearest centroids, along with their matching leaves (in the case of a two-level tree) or branches (in the case of a three-level tree), are selected. The grey blocks in the diagram above indicate the pruned portion of the search tree, which makes up the great majority of the entire tree, while the blue blocks represent the portion of the tree that is read for scoring.

While the number of leaves to search can be changed at query time, the number of branches and leaves in the tree can be set at index building. You can get the performance and accuracy tradeoff you want by adjusting the query settings and indexing sliders.

Distance-related functions

Vector embeddings can be made more comparable by using distance functions. For KNN search, Spanner currently offers accurate distance functions. Google Cloud is presenting the following approximate distance functions for usage with a vector index in Approximate Nearest Neighbor Search:

  • APPROX_COSINE_DISTANCE()
  • APPROX_EUCLIDEAN_DISTANCE()
  • APPROX_DOT_PRODUCT()

The desired distance type must be given as an index option at the time of vector index formation. Three possible values are DOT_PRODUCT, EUCLIDEAN, and COSINE.

The corresponding distance function must be used in queries on the vector index.

For instance, a search for mutual funds

Let’s examine the illustration from our ANN demo. A client is trying to find the best mutual funds for their investments. One mutual fund is represented by each row in the MutualFunds table:

  • The mutual fund’s unique identifier, or FundId, can be cross-referenced in other tables.
  • The mutual fund’s full name is FundName.
  • The fund’s strategy is outlined in the unstructured text string InvestmentStrategy.
  • A protocol buffer called attributes holds certain structured information about the fund, like its return.
  • The ML.PREDICT function can be used to construct the vector representation of the investment strategy, known as InvestmentStrategyEmbedding.

Sophisticated usage cases

Post filtering

Adding characteristics to the WHERE clause of the query allows for the filtering of Approximate Nearest Neighbor Search results. The funds that meet the query technique and have a 10-year return rate above 8% will be returned by the following query.

Multiple-model enquiries

In the event that the FundName field also has a Full Text Search index created:

You can run searches to find funds that have “emerging markets” in their name and reflect the query strategy, which could be the vector representation of a phrase like “balanced low risk socially responsible fund”:

Alternatively, search for funds that either bear the name of the intended strategy or have “emerging markets” in it. This will help us identify text and semantic matches that don’t necessarily contain those terms:

This allows you to add semantically relevant results to the usual text search results, or you can use them to filter them.

- Advertisement -
Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes