Creating a reusable and modular BigQuery ML feature preparation
Feature engineering, a preprocessing phase in machine learning, is crucial for turning unstructured input into useful features. In this regard, BigQuery ML has advanced significantly, providing data scientists and ML developers with a flexible range of preprocessing tools for feature engineering These conversions can also be smoothly included into models, guaranteeing their transferability from BigQuery to serving environments such as Vertex AI. With BigQuery ML, They are now expanding on this by offering modularity, a special method of feature engineering. This facilitates direct transfer to Vertex AI and simple feature pipeline reuse inside BigQuery.
Preprocessing features using the TRANSFORM clause
A TRANSFORM statement may be included in the CREATE MODEL statement when building a model in BigQuery ML. This enables the use of preprocessing methods to provide custom requirements for transforming columns from the SELECT query into model characteristics. Because the statistics utilized for transformation are dependent on the data used to create the model, this is a huge benefit.
This offers uniformity of preprocessing comparable to other frameworks notably the Transform component of the TFX framework, which helps avoid training/serving skew. Even without a TRANSFORM command, automated transformations are conducted depending on the model type and data type.
Preprocessing procedures are used before input to impute missing values in the example below, which is taken from the accompanying lesson. Additionally, preprocessing is incorporated using the TRANSFORM command to scale the columns. This scaling is applied to the input data, which has already been imputed before being entered here, and it is integrated into the model. The model saves the computed scaling parameters to apply later when using the model for inference, which is a benefit of the embedded scaling functions.
- Preprocessing that is reusable using the ML.TRANSFORM function
- Direct access to the feature engineering portion of the model is possible with the new ML.TRANSFORM table function.
- This makes it possible for several useful processes, such as
- Utilize one model’s transformations to change another model’s inputs.
The ML.TRANSFORM function is applied to the input data immediately in the example below eliminating the need to compute the scaling parameters using the initial training data. This facilitates the effective repurposing of the modifications for subsequent models, further data scrutiny, and the detection of skew and drift in model monitoring computations.
Preprocessing in modules using TRANSFORM_ONLY models
Make transformation-only models to take reusability to the next level of modularity. This operates like other models by using CREATE MODEL with a TRANSFORM statement and using the variable model_type = TRANSFORM_ONLY. Put otherwise, it produces a model object that only consists of the feature engineering portion of the pipeline. That implies the transform model may be employed to alter inputs of any CREATE MODEL command as well, including registering the model to the Vertex AI Model Registry for usage in ML pipelines outside of BigQuery. For total mobility, the model may also be EXPORTED to GCS.
The TRANSFORM statement is assembled into a model using a standard CREATE MODEL statement. In this instance, all of the imputation processes are saved in one model object, which is capable of remembering the training data’s mean and median values and using them for imputation on subsequent records even during inference time.
Pipelines for features
It is feasible to employ many TRANSFORM_ONLY models in a feature pipeline due to their modularity. The feature pipeline is quite readable because to the BigQuery SQL Query syntax’s WITH clause (CTEs). This concept allows for the easy and flexible use of feature level transformation models, such as feature stores.
Create a TRANSFORM_ONLY model for each of the following features as an illustration of this concept: body_mass_g, culmen_length_mm, culmen_depth_mm, and flipper_length_mm. These are used in this instance to scale columns into features, much like the whole model to first build.
However, there are situations in which models must be employed for purposes other than data warehouse utilization, such as edge applications or internet forecasts. Take note of how the VERTEX_AI_MODEL_ID option was used to build the models above. This indicates that they are almost ready to be deployed to a Vertex AI Prediction Endpoint as they have already been automatically registered in the Vertex AI Model Registry. Additionally, for full mobility, these models, like all BigQuery ML models, may be exported to Cloud Storage using the EXPORT MODEL command.
In summary
Building and maintaining machine learning pipelines and power MLOps may be made simpler with the help of BigQuery ML’s new reusable and modular feature engineering. Modular preprocessing allows you to build transformation-only models that can be exported to Vertex AI or used as building blocks for other models. Feature pipelines in SQL are even made possible by this modularity directly. This may simplify maintenance, avoid training/seving skew, increase accuracy, and save you time.