Wednesday, April 2, 2025

Hardware Optimization Image Generation Pipeline on Google

A useful manual for streamlining Google Cloud image generation pipelines

Beautiful graphics are produced by generative AI diffusion models like Stable Diffusion and Flux, which give producers in a variety of verticals remarkable image production skills. Even with strong technology like GPUs and TPUs, producing high-quality photos with complex pipelines may be computationally taxing, affecting both prices and turnaround time.

The main difficulty is streamlining the entire workflow to reduce delay and expense without sacrificing image quality. For picture production to reach its full potential in practical applications, this precise balance is essential. For instance, optimise the underlying software and infrastructure to guarantee optimal model performance before reducing the model size to lower image generation expenses.

Google has been helping customers navigate these challenges at Google Cloud Consulting. Optimised image generating pipelines are crucial, and this piece will present three tried-and-true methods to help you deliver great user experiences while being efficient and economical.

An all-encompassing method of optimisation

Google advises implementing an all-encompassing optimisation plan that takes into account the pipeline’s hardware, code, and overall design. At Google Cloud, it tackles this by utilising AI Hypercomputer, a decomposable supercomputing architecture that combines software and frameworks like Pytorch with hardware like TPUs and GPUs. The main areas it concentrate on are broken down as follows:

Hardware optimization: Making the most use of available resources

For deployment, image generating pipelines frequently need GPUs or TPUs, and maximising hardware utilisation can drastically cut expenses. Because GPUs cannot be fractionally allocated, underutilisation is prevalent, particularly when workloads are scaling. This results in inefficiency and higher operating costs. In order to improve resource efficiency, Google Kubernetes Engine (GKE) provides a number of GPU sharing options. Additionally, smaller A3 High VMs with NVIDIA H100 80GB GPUs are available, which aids in cost control and efficient scaling.

In GKE, some important GPU sharing techniques are as follows:

GPUs with multiple instances: This technique allows for hardware separation across workloads by dividing a single GPU into up to seven slices using GKE. The compute, memory, and bandwidth resources of each GPU slice are unique, and they can be allocated to a single container separately. This approach can be used for inference tasks that call for consistent and resilient performance. Prior to implementation, please read over the documented constraints of this approach. It should be noted that NVIDIA A100 GPUs (40GB and 80GB) and NVIDIA H100 GPUs (80GB) are the current GPU types supported for multi-instance GPUs on GKE.

GPU time-sharing: Instruction-level preemption in NVIDIA GPUs enables GPU time-sharing, which allows several containers to use fast context switching between processes to leverage the entire GPU capability. This method works better for interactive and sporadic workloads, as well as for testing and prototyping when complete isolation is not necessary. Through GPU time-sharing, you can minimise GPU idle time and maximise GPU use and cost. Context change, however, might result in some delay overhead for specific applications.

NVIDIA Multi-Process Service (MPS): Multiple processes or containers can operate simultaneously on the same physical GPU without interference thanks to NVIDIA Multi-Process Service (MPS), a version of the CUDA API. By using this method, you may optimise the throughput and hardware utilisation of a single GPU by running several small-to-medium-scale batch processing tasks. When putting MPS into practice, you need to make sure that workloads can withstand the limits of error containment and memory protection.

Example illustration of GPU sharing strategies
Example illustration of GPU sharing strategies

Inference code optimization: Adjusting for effectiveness

If your pipeline is already developed in PyTorch, you may optimise and shorten its execution time in a number of ways.

A method that allows Just-in-time (JIT) compilation of PyTorch code into optimised kernels for quicker execution particularly for the forward pass of the decoder step is to use PyTorch’s compile method. This can be accomplished using a variety of backend computers, depending on the underlying hardware, such NVIDIA TensorRT, OpenVINO, or IPEX. Additionally, some compiler backends are available for usage during training. There is a comparable JIT compilation feature for other frameworks, including JAX.

Activating Flash Attention is an additional method to enhance code latency. Flash Attention can be used natively in PyTorch code to speed up some computations by using the torch.backends.cuda.enable_flash_sdp attribute. If Flash isn’t the best option based on the inputs, another attention mechanism is automatically used.

Minimise data transfers between the GPU and CPU as well in order to lower latency. The data movement in operations like tensor loading and comparing tensors and Python floats results in significant cost. A tensor must be sent to the CPU each time it is compared with a floating-point number, which results in latency. For image generation pipelines that use many models, where latency cascades with each model that is run, it is ideal to load and unload a tensor on and off the GPU only once over the whole pipeline. We can see how much time and memory a model uses with the use of tools like PyTorch Profiler.

Optimising the inference pipeline to streamline the process

You must consider the overall picture even though code optimisation can help speed up individual pipeline components. To create the final image, many multi-step image-generation pipelines cascade several models (such as samplers, decoders, and image and/or text embedding models) one after the other, frequently on a single container with a single GPU attached.

Models like decoders for diffusion-based pipelines may take longer to run because to their considerably larger computational complexity, particularly when compared to embedding models, which are often faster. This implies that some models may result in a generation pipeline bottleneck. In order to minimise this bottleneck and maximise GPU utilisation, you might want to use a multi-threaded queue-based strategy for effective job scheduling and execution. With this method, several requests can be processed concurrently since different pipeline stages can be executed in parallel on the same GPU. Higher throughput is eventually achieved by maximising resource utilisation and minimising GPU idle time through the effective distribution of workloads among worker threads.

Tensors can also be kept on the same GPU throughout the process, which lowers the overhead of data transfers from CPU to GPU and vice versa, increasing efficiency and cutting expenses.

Processing time comparison between a stacked pipeline and a multithreaded pipeline for 2 concurrent requests
Processing time comparison between a stacked pipeline and a multithreaded pipeline for 2 concurrent requests

Concluding observations

Although there are many facets to optimizing image-generation processes, the benefits are substantial. Significant performance gains, cost savings, and outstanding user experiences can be obtained by implementing a comprehensive strategy that incorporates pipeline optimization for higher throughput, code optimization for quicker execution, and hardware optimization for optimal resource utilization.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post