BigQuery Text Analyzer Released

November 13, 2023

321

Page Contents

Real-time Text Analysis: BigQuery‘s Cutting-Edge Tools

BigQuery‘s sophisticated text analyzers and preprocessing tools contain enormous volumes of structured and unstructured text data, including information about customers and company activities. These text data sets may be mined for important business insights using BigQuery’s sophisticated analytical tools for machine learning (ML) and search.

Text preprocessing, which converts unstructured or raw natural language into formats that are machine-readable, is an essential stage in pipelines for Text Analyzer and information retrieval. Many text-based procedures, like machine learning pipelines and full-text search indexing, need it as a precondition. In many instances, the quality of the tokenization method used has a significant impact on how successful a text search index is. Similar to this, the caliber of the preprocessed inputs has a significant impact on how well machine learning models function.

Enhancing text searches by using analyzers

A scenario for investigating fraud

Imagine a fictitious situation involving the investigation and prevention of fraud. It might be helpful to go through company log data throughout the inquiry to find any unusual activity connected to the reported transactions. The procedure entails obtaining entries from a logs table, which is produced by routine company operations, that include pertinent customer information.

You may find the following information useful:

clientele ID

IP address of access
electronic mail address
last four credit card digits.
Building an index setup for a search

Information matching a given RE2 regular expression may be extracted and indexed using a search index with the PATTERN_ANALYZER.

Then, using the collected tokens, we can take further actions to enhance the usefulness and efficacy of the search index:

To enable case-insensitive searches, lowercase the text.
delete a few well-known email addresses, such as testing or trustworthy system emails.
Eliminate some fixed/known IP addresses, such as localhost.
Playing around with various combinations

Before building our search index which may be a costly and time-consuming procedure They experiment with several configurations using the recently introduced TEXT_ANALYZE function to see whether one performs as expected.

Using the index for search

To refresh your memory, the SEARCH function operates by applying the analyzer (with the given configuration) to the search data and the input search query. If the tokens from the search query correspond to a subset of the tokens from the searched data, the function returns TRUE.

Similar to this, searching for IP addresses, customer UUIDs, and other regular words not stop words can be made easier by the search index.

They can see from this example that the new PATTERN_ANALYZER is a useful tool for assisting in the creation of an efficient search index for our hypothetical fraud investigation. The choices and text analyzers itself are designed to be adaptable to a range of use scenarios.

Analyzers of text using BigQuery ML

Two new text preparation functions, ML.TF_IDF and ML.BAG_OF_WORDS, are also announced. These two new functions may be used in the TRANSFORM clause to generate ML models with data preparation, much as existing preprocessing functions. The use of these functions with text analyzers is shown in the example that follows.

Text analyzers’ machine learning use cases are mostly concerned with extracting all text tokens and vectorizing them after Unicode normalization. The aforementioned functions are used with the recently added TEXT_ANALYZE function to accomplish this.

Although BigQuery allows text vectorization using pre-trained ML models, the previously described statistics-based algorithms are easier to use, more comprehensible, and need less processing power. Furthermore, statistics-based approaches often outperform pre-trained model-based methods when dealing with novel domains when significant domain-specific data is absent for fine-tuning.

They will investigate developing a machine learning model to categorize news into five groups in this example: technology, business, politics, sports, and entertainment.

Using TEXT_ANALYZE for preparing raw text

Tokenizing the unprocessed news content and preprocessing the tokens is the initial stage in creating the classifier. They may use the default LOG_ANALYZER and its default delimiter list without any further setting since they are often sufficient.

Using the TRANSFORM clause in model training
The tokenized data may now be used to train our classifier.
Making inferences using the model
Lastly, using a sample from a sports article that isn’t included in the training set of data, they can evaluate the model.

In summary

Our text analysis toolkit has been enhanced with new capabilities that significantly improve its current functionality. They provide consumers more control, flexibility, and data insight. Our toolbox is significantly more extensive and user-friendly with the option to do bespoke text analysis in several ways.

2 COMMENTS

The Power Of Toshiba's GridDB Cloud: Introducing The Free Plan December 6, 2023 At 2:46 pm
[…] public cloud Database Managed Service designed to cater to the dynamic needs of IoT and Big Data […]
Log in to leave a comment
BigQuery Teams Up With Document AI For AI Magic January 6, 2024 At 11:30 am
[…] and information. With the ease and power of SQL, these bespoke models can then be called from BigQuery to extract structured data from documents in a controlled and safe […]
Log in to leave a comment

BigQuery Text Analyzer Released

Real-time Text Analysis: BigQuery‘s Cutting-Edge Tools

Enhancing text searches by using analyzers

Using the index for search

Analyzers of text using BigQuery ML

In summary

ADATA SC750 External SSD: Your High-Speed Data Companion

Probable Root Cause: Improving Instana’s Observability

Microwave 2T XMC-80D Wins iF Design Award 2024 & Red Dot

2 COMMENTS

LEAVE A REPLY Cancel reply

Recent Posts

ADATA SC750 External SSD: Your High-Speed Data Companion

Probable Root Cause: Improving Instana’s Observability

Microwave 2T XMC-80D Wins iF Design Award 2024 & Red Dot

Hex-LLM: High-Efficiency LLM Serving to Vertex AI with TPUs

Toshiba & Quantonation Teams Up to Advance Quantum Science

Modern Art of Bahia Museum’s Unique Heritage Collection

Popular Post

ASRock’s creative AMD FP6 series thin mini-ITX motherboard

ASUS ProArt PA602 The Most Elegant Computer Case!

Cardea Z540 SSD Revolutionizes Storage

What is Azure Policy in Microsoft Azure

MSI Motherboards with Intel Application Optimization

Boost Your Apps Now: Amazon ElastiCache Serverless Unveiled!

About Us

POPULAR CATEGORY