Thursday, January 23, 2025

Optimize Distributed Data Preprocessing With GKE and Ray

- Advertisement -

Using GKE and Ray for distributed data preprocessing: Enterprise scalability

Growing datasets are a by product of machine learning models‘ exponential development. Due to the inability of standard data preparation techniques to scale, this data flood causes a major bottleneck in the Machine Learning Operations (MLOps) lifecycle. Productivity can be seriously hampered by the Data preprocessing stage, which is essential for converting raw data into a format appropriate for model training.

In order to overcome this difficulty, Google Cloud present in this article a distributed data preparation pipeline that makes use of Ray, a distributed computing framework for scaling Python programs, and One managed Kubernetes service is Google Kubernetes Engine (GKE).
Google Cloud can effectively preprocess big datasets, manage intricate transformations, and speed up the entire machine learning process with this combination.

- Advertisement -

The necessity of data preparation

The quality and functionality of machine learning models are directly impacted by the fundamental data preparation stage of MLOps. To make sure that models learn from the data efficiently, data preprocessing entails activities including data cleansing, feature engineering, scaling, and encoding.

The total speed at which the data is processed may be slowed down by bottlenecks that arise when data preparation calls for a high number of processes. Google Cloud demonstrate a data preprocessing set use case in the example that follows, which involves uploading many photos to a Google Cloud Storage bucket. Up to 140,000 procedures are involved, which, when carried out in a sequential fashion, result in a bottleneck and take more than eight hours to finish.

The dataset

Google Cloud utilise a pre-crawled dataset with 20,000 goods for this example.

Steps for data preprocessing

There are fifteen distinct columns in the dataset. Google Cloud is interested in the following columns: “uniq_id,” “product_name,” “description,” “brand,” “product_category_tree,” “image,” and “product_specifications.”

- Advertisement -

In addition to removing duplicates and null values, Google Cloud also do the following actions on the pertinent columns:

An explanation

Eliminate punctuation and stop words from the product description.

category_tree_product

Divide into distinct columns.

Product details

Divide the product specifications into pairs of keys and values.

Picture

Parse the picture URL list. Download the picture and verify the URL.

Now imagine that a data preprocessing operation is uploading the photos to a Cloud Storage bucket and retrieving several image URLs from every row of a big dataset. This may seem simple, but when done serially in Python, it may become quite time-consuming when dealing with a dataset that has more than 20,000 rows, each of which may have up to seven URLs. Google Cloud has found that this kind of work might take up to eight hours to do!

Solution: Implement parallelism for scalability

Parallelism is Google Cloud’s solution to this scaling problem. Google Cloud can significantly cut down on the overall execution time by segmenting the dataset into smaller pieces and dividing the work across several threads. Ray was Google Cloud’s platform of choice for distributed computing.

Ray: Simplified distributed computing

Ray is a robust framework made for growing Python libraries and applications. It is a good option for putting in place parallel data preparation pipelines as it offers a straightforward API for allocating calculations across several workers.

In Google Cloud’s particular application, Google Cloud utilise Ray to assign many Ray workers to handle the Python code that downloads photos from URLs to Cloud Storage buckets. Google Cloud can concentrate on the essential data preprocessing logic as Ray’s abstraction layer manages the intricacies of employee management and communication.

Among Ray’s primary skills are:

Parallelism in tasks

Ray offers a simple method to parallelise Google Cloud’s picture download process by allowing arbitrary functions to be run asynchronously as tasks on different Python workers.

Model for an actor

Ray’s “actors” provide a means of encapsulating stateful computations, which makes them appropriate for intricate data preprocessing situations where shared state may be required.

Scaling made simpler

Ray is a versatile solution for different data quantities and processing requirements since it can be easily scaled from a single computer to a full-fledged cluster.

Details of implementation

The accelerated platforms repository, which offers the code to construct your GKE cluster and configure prerequisites like running Ray on the cluster so you can execute data preprocessing on the cluster as a container, is what Google Cloud used to perform the data preprocessing on GKE. There were three stages to the job:

Partitioning the dataset

Google Cloud separate the huge dataset into more manageable portions.

A total of 101 smaller chunks, each consisting of 199 rows, were created from the 20,000 rows of raw data. Every chunk has a corresponding Ray job that is carried out on a Ray worker.

Distribution of Ray tasks

Ray remote tasks were made by us. Ray assigns the tasks to the employees and creates and oversees them.

Data processing in parallel

The Ray jobs simultaneously download the photos to Cloud Storage and prepare the data.

Results

Google Cloud were able to drastically cut down on processing time by utilising Ray and GKE. A speedup of around 23 times was achieved when the data preprocessing time for 20,000 rows was reduced from more than 8 hours to only 17 minutes. You may utilise Ray autoscaling and change the batch size to get comparable results as the data size grows.

No more problems with data preparation

Modern machine learning teams have difficulties with data pretreatment, and distributed data preprocessing with GKE and Ray offers a reliable and scalable solution. Google Cloud can speed up data preparation, eliminate bottlenecks, and free up data scientists and ML engineers to concentrate on model development and innovation by using parallelism and cloud infrastructure. Run the setup that illustrates this use case for data preparation using Ray on the GKE cluster to find out more.

- Advertisement -
Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes