Google Cloud SRE’s Guide To MLOps For ML Optimization

February 21, 2025

69

Google Cloud SREs will be more interested in artificial intelligence’s (AI) characteristics, such as machine learning, as technology becomes more accessible. This is due to the fact that machine learning becomes both a crucial component of the software itself and a component of the infrastructure utilised in production software systems.

In theory, machine learning depends on its pipelines, and you are skilled at handling them! Thus, you may start with pipeline management and then focus on training, model freshness, and efficiency as additional elements that will improve your ML services. Google Cloud will examine some of the ML-specific features of these pipelines that you should take into account in your operation. Next, It use the expertise of Google Cloud SRE to demonstrate how you may use your fundamental SRE abilities to run and oversee the machine-learning pipelines in your company.

Training ML models

Training machine learning models, which frequently use specialised hardware, applies the concept of pipelines to certain kinds of data. Important things to think about regarding the pipeline:

How much information are you consuming?
How recent must this info be?
How the models are trained and deployed by the system
How well the system manages these initial three items

Google Cloud sheds light on the significance of ML systems for goods and how Google Cloud SRE ought to consider them. Capacity planning, resource management, and monitoring are among the difficulties posed by ML systems; other difficulties include comprehending the cost of ML systems in relation to your entire operational environment.

ML freshness and data volume

The amount of data that the system normally ingests and processes is a key component of comprehending it, as is the case with any pipeline-based system. The foundation is laid down in the Google Cloud SRE Workbook’s chapter on data processing pipelines: automate the pipeline’s operation to make it robust and capable of running without human supervision.

Establishing Service Level Objectives (SLOs) can help you gauge the pipeline’s health, particularly with regard to data freshness that is, how recently the model received the data it uses to generate an inference for a client. Since stale data might result in lower-quality conclusions and less-than-ideal results for the user, freshness is a crucial indicator of an ML system’s health. Data freshness might lag on the order of days or longer for certain systems, like spell-checkers, while it may need to be extremely fresh (only minutes or seconds old) for others, like weather forecasts! Each product will have different freshness needs, so it’s critical to understand what you’re creating and how your target audience plans to utilise it.

Freshness is therefore one fact of the customer experience that is covered in the SRE Workbook’s key user journey. The Google Cloud SRE article Reliable Data Processing with Minimal Toil has further information regarding data freshness as a component of pipeline systems.

Ensuring high-quality data involves more than just freshness; it also involves how the model-training pipeline is defined. The essentials of this field are covered in A Brief Guide To Running ML Systems in Production, including how to assess the quality of your input data and how to use contextual metrics to gauge freshness and throughput.

Serving efficiency

An excellent resource for learning how to enhance your model’s performance in a production setting is the 2021 Google Cloud SRE blog article Efficient Machine Learning Inference. (And keep in mind that with ML services, training is never the same as production!)

For practical implementation, machine learning inference serving must be optimised. The authors of this paper investigate multi-model serving from a shared virtual machine. They go over practical use scenarios and how to balance trade-offs between model response latency, cost, and utilisation. Model serving may be made more cost-effective by altering how models are assigned to virtual machines (VMs) and by modifying the size and form of those VMs in terms of processing power, GPU, and RAM.

Cost efficiency

As Google Cloud previously indicated, these AI pipelines frequently require specialised hardware. How can you tell if you’re making effective use of this hardware? What Will It Cost You? Todd Underwood’s presentation at SREcon EMEA 2023 on Artificial Intelligence. provides you an idea of the operating expenses of this specialised gear and how to offer incentives for effective use.

Automation for scale

The SRE team at Google describes methods in this article for minimising manual labour, or toil, and guaranteeing dependable data processing. Using an existing, standard platform for as much of the pipeline as feasible is one of the main lessons learnt. After all, rather than the pipeline itself, your business objectives should center on advances in the way the data and ML model are presented. With an emphasis on using these ideas to create robust data pipelines, the paper discusses automation, monitoring, and incident response. You’ll discover recommended practices for creating data systems that may lessen the operational load on a team and gracefully handle errors.

Next steps

For ML deployments to be dependable and long-lasting, thorough administration and monitoring are necessary. This entails adopting a comprehensive strategy that includes monitoring and accuracy measures in addition to the implementation of data pipelines, training paths, model maintenance, and validation.

Google Cloud SRE’s Guide To MLOps For ML Optimization

Training ML models

ML freshness and data volume

Serving efficiency

Cost efficiency

Automation for scale

Next steps

ML-KEM post-quantum TLS in AWS KMS, ACM, And AWS SM

Dell OneFS: Improved Performance, Security, and Scalability

Dell PowerStoreOS 4.1: Improved Performance, Security & More

LEAVE A REPLY Cancel reply

Page Content

Recent Posts

Micron G9 NAND Takes Flagship Smartphones to The Next Level

ML-KEM post-quantum TLS in AWS KMS, ACM, And AWS SM

Dell OneFS: Improved Performance, Security, and Scalability

Galaxy Watch Sleep Apnea Feature With Stanford Medicine

LeanVec Improves Out-of-Distribution Vector Search Accuracy

Dell PowerStoreOS 4.1: Improved Performance, Security & More

About Us

POPULAR CATEGORY