Real-time Text Analysis: BigQuery‘s Cutting-Edge Tools
BigQuery‘s sophisticated text analyzers and preprocessing tools contain enormous volumes of structured and unstructured text data, including information about customers and company activities. These text data sets may be mined for important business insights using BigQuery’s sophisticated analytical tools for machine learning (ML) and search.
Text preprocessing, which converts unstructured or raw natural language into formats that are machine-readable, is an essential stage in pipelines for Text Analyzer and information retrieval. Many text-based procedures, like machine learning pipelines and full-text search indexing, need it as a precondition. In many instances, the quality of the tokenization method used has a significant impact on how successful a text search index is. Similar to this, the caliber of the preprocessed inputs has a significant impact on how well machine learning models function.
Enhancing text searches by using analyzers
A scenario for investigating fraud
Imagine a fictitious situation involving the investigation and prevention of fraud. It might be helpful to go through company log data throughout the inquiry to find any unusual activity connected to the reported transactions. The procedure entails obtaining entries from a logs table, which is produced by routine company operations, that include pertinent customer information.
You may find the following information useful:
- IP address of access
- electronic mail address
- last four credit card digits.
- Building an index setup for a search
Information matching a given RE2 regular expression may be extracted and indexed using a search index with the PATTERN_ANALYZER.
Then, using the collected tokens, we can take further actions to enhance the usefulness and efficacy of the search index:
- To enable case-insensitive searches, lowercase the text.
- delete a few well-known email addresses, such as testing or trustworthy system emails.
- Eliminate some fixed/known IP addresses, such as localhost.
- Playing around with various combinations
Before building our search index which may be a costly and time-consuming procedure They experiment with several configurations using the recently introduced TEXT_ANALYZE function to see whether one performs as expected.
Using the index for search
To refresh your memory, the SEARCH function operates by applying the analyzer (with the given configuration) to the search data and the input search query. If the tokens from the search query correspond to a subset of the tokens from the searched data, the function returns TRUE.
Similar to this, searching for IP addresses, customer UUIDs, and other regular words not stop words can be made easier by the search index.
They can see from this example that the new PATTERN_ANALYZER is a useful tool for assisting in the creation of an efficient search index for our hypothetical fraud investigation. The choices and text analyzers itself are designed to be adaptable to a range of use scenarios.
Analyzers of text using BigQuery ML
Two new text preparation functions, ML.TF_IDF and ML.BAG_OF_WORDS, are also announced. These two new functions may be used in the TRANSFORM clause to generate ML models with data preparation, much as existing preprocessing functions. The use of these functions with text analyzers is shown in the example that follows.
Text analyzers’ machine learning use cases are mostly concerned with extracting all text tokens and vectorizing them after Unicode normalization. The aforementioned functions are used with the recently added TEXT_ANALYZE function to accomplish this.
Although BigQuery allows text vectorization using pre-trained ML models, the previously described statistics-based algorithms are easier to use, more comprehensible, and need less processing power. Furthermore, statistics-based approaches often outperform pre-trained model-based methods when dealing with novel domains when significant domain-specific data is absent for fine-tuning.
They will investigate developing a machine learning model to categorize news into five groups in this example: technology, business, politics, sports, and entertainment.
Using TEXT_ANALYZE for preparing raw text
Tokenizing the unprocessed news content and preprocessing the tokens is the initial stage in creating the classifier. They may use the default LOG_ANALYZER and its default delimiter list without any further setting since they are often sufficient.
- Using the TRANSFORM clause in model training
- The tokenized data may now be used to train our classifier.
- Making inferences using the model
- Lastly, using a sample from a sports article that isn’t included in the training set of data, they can evaluate the model.
Our text analysis toolkit has been enhanced with new capabilities that significantly improve its current functionality. They provide consumers more control, flexibility, and data insight. Our toolbox is significantly more extensive and user-friendly with the option to do bespoke text analysis in several ways.