Generative AI in BigQuery DataFrames turns customer feedback into opportunities
To run a successful business, you must understand your customers’ needs and learn from their feedback. However, extracting actionable information from customer feedback is difficult. Examining and categorizing feedback can help you identify your customers’ product pain points, but it can become difficult and time-consuming as feedback grows.
Several new generative AI and ML capabilities in Google Cloud can help you build a scalable solution to this problem by allowing you to analyze unstructured customer feedback and identify top product issues.
This blog post shows how to build a solution to turn raw customer feedback into actionable intelligence.
Our solution segments and summarizes customer feedback narratives from a large dataset. The BigQuery Public Dataset of the CFPB Consumer Complaint Database will be used to demonstrate this solution. This dataset contains diverse, unstructured consumer financial product and service complaints.
The core Google Cloud capabilities we’ll use to build this solution are:
Text-bison foundation model: a large language model trained on massive text and code datasets. It can generate text, translate languages, write creative content, and answer any question. It’s in Vertex AI Generative AI.
Textembedding-gecko model: an NLP method that converts text into numerical vectors for machine learning algorithms, especially large ones. Vector representations capture word semantics and context. Generative AI on Vertex AI includes it.
The BigQuery ML K-means model clusters data for segmentation. K-means is unsupervised, so model training and evaluation don’t require labels or data splitting.
BigQuery DataFrames for ML and generative AI. BigQuery DataFrames, an open-source Python client, compiles popular Python APIs into scalable BigQuery SQL queries and API calls to simplify BigQuery and Google Cloud interactions.
Data scientists can deploy Python code as BigQuery programmable objects and integrate with data engineering pipelines, BigQuery ML, Vertex AI, LLM models, and Google Cloud services to move from data exploration to production with BigQuery DataFrames. Here are some ML use cases and supported ML capabilities.
Build a feedback segmentation and summarization solution
You can copy the notebook to follow along. Using BigQuery DataFrames to cluster and characterize complaints lets you run this solution in Colab using your Google Cloud project.
Data loading and preparation
You must import BigQuery DataFrames’ pandas library and set the Google Cloud project and location for the BigQuery session to use it.
To manipulate and transform this DataFrame, use bigframes.pandas as usual, but calculations will happen in BigQuery instead of your local environment. BigQuery DataFrames supports 400+ pandas functions. The list is in the documentation.
This solution isolates the DataFrame’s consumer_complaint_narrative column, which contains the original complaint as unstructured text, and drops rows with NULL values for that field using the dropna() panda.
Before applying clustering models to unstructured text data, embeddings, or numerical vectors, must be created. Fortunately, BigQuery DataFrames can create these embeddings using the text-embedding-gecko model, PaLM2TextEmbeddingGenerator.
This model is imported and used to create embeddings for each row of the DataFrame, creating a new DataFrame with the embedding and unstructured text.
You can train the k-means model with the 10,000 complaint text embeddings.
The unsupervised machine learning algorithm K-means clustering divides data points into a predefined number of clusters. By minimizing the distance between data points and their cluster centers and maximizing cluster separation, this algorithm clusters data points.
The bigframes.ml package creates k-means models. The following code imports the k-means model, trains it using embeddings with 10 clusters, and predicts the cluster for each complaint in the DataFrame.
LLM model prompt
Ten groups of complaints exist now. How do complaints in each cluster differ? A large language model (LLM) can explain these differences. This example compares complaints between two clusters using the LLM.
The LLM prompt must be prepared first. Take five complaints from clusters #1 and #2 and join them with a string of text asking the LLM to find the biggest difference.
LLM provided a clear and insightful assessment of how the two clusters differ. You could add insights and summaries for all cluster complaints to this solution.