Balance of power: A full-stack approach to power and thermal fluctuations in ML infrastructure
The data centre infrastructure supporting machine learning (ML) applications is seeing an unprecedented demand for power delivery due to the growth of ML applications. Large-scale batch-synchronized ML workloads show much different power use patterns than server clusters in a typical data centre, where tens of thousands of ML workloads coexist with uncorrelated power profiles. It is becoming more difficult to guarantee the availability and dependability of the ML infrastructure under these new usage circumstances, as well as to increase data-center goodput and energy efficiency.
With a lengthy number of achievements under its belt, Google has been leading the way in data centre infrastructure architecture for several decades. One of the major developments that enabled us to control previously unheard-of power and temperature variations in Google Cloud’s ML system. The value of complete codesign throughout the hardware and software stack, from the ASIC chip to the data centre, is demonstrated by this invention. Google Cloud also go over the ramifications of this strategy and provide a call to action for the industry as a whole.
New ML workloads lead to new ML power challenges
Machine learning workloads, which can take up a whole data centre cluster or perhaps several of them, need synchronised computing across tens of thousands of accelerator chips together with associated hosts, storage, and networking systems. These applications’ peak power consumption may get close to the rated power of all the underlying IT hardware, which would make power overscription considerably more challenging.
Furthermore, because only a small number of heavy ML workloads now account for the majority of the cluster’s power consumption, power consumption increases and decreases between idle and peak utilisation levels considerably more sharply. When a workload begins or ends, or is stopped and then restarted or rescheduled, you may see these power swings. When the workload is operating correctly, you could also see a similar pattern, which is mostly caused by the workload’s alternating compute- and networking-intensive periods inside a training step. The frequency of these intra- and inter-job power variations might vary greatly depending on the nature of the workload. This may have a number of unforeseen effects on the data centre infrastructure’s dependability, performance, and usefulness.

Google Cloud saw power variations in the tens of megawatts (MW) in their most recent batch-synchronous ML workloads operating on specialised Machine Learning clusters. Additionally, the ramp speed might be nearly immediate, repeat as often as every few seconds, and persist for weeks or even months in contrast to a conventional load variation profile!
Such fluctuations carry the following risks:
- Problems with rack and data centre equipment’s functionality and long-term dependability, such as rectifiers, transformers, generators, cables, and busways, can lead to hardware-induced outages, decreased energy efficiency, and higher operating and maintenance expenses.
- Upstream utility damage, outage, or throttling, including breach of contractual obligations to the utility about power use patterns, and associated financial expenses
- Unintentional and frequent activation of the UPS system due to significant power fluctuations, which shortens the UPS system’s lifespan
Significant power variations can also affect hardware dependability on a much smaller scale, such as per-chip or per-system. Power variations can nevertheless result in significant and frequent temperature swings even when the maximum temperature is well under control. This can cause a variety of interactions, such as warpage, modifications to the material properties at the thermal interface, and electromigration.
A full-stack approach to proactive power shaping
Given the complexity and size of it data-center architecture, Google Cloud hypothesised that it may be more effective to actively shape a workload’s power profile rather than merely adjust to it. Google Cloud has all the knobs it need to implement highly effective end-to-end power management features to control the power profiles of their workloads and minimise harmful fluctuations with Google’s complete codesign across the stack, from chip to data centre, from hardware to software, and from instruction set to realistic workload.
In particular, Google Cloud added instrumentation to the TPU compiler to monitor workload signatures, including sync flags, associated with power swings. Then, in order to level out their utilisation over time, Google Cloud dynamically balance the activities of the TPU’s main compute blocks around these flags. It’s objective of reducing power and temperature fluctuations with minimal performance overhead is accomplished. Google Cloud may use a similar strategy in the future at the beginning and end of the task, which would cause a progressive shift in power levels as opposed to an abrupt one.
This compiler-based method of power profile shaping has now been put into practice and used on workloads that are realistic. Plots of the system’s overall power consumption and the hotspot temperature of a single chip with and without mitigation, respectively. Between the baseline case and the mitigation case, the test case’s power fluctuation magnitude decreased by over 50%.
Additionally, the temperature swings decreased from around 20 degrees Celsius in the baseline scenario to about 10 degrees Celsius in the mitigation instance. The length of the training phase and the increase in average power usage were used to calculate the cost of the mitigation. Google Cloud may get the advantages of Google Cloud’s approach with modest increases in average power and less than 1% performance effect if the mitigation settings are properly adjusted.


A call to action
In the upcoming years, it is anticipated that ML infrastructure would overtake traditional server infrastructure in terms of overall power consumption because to its fast growth. However, the power and temperature variations of ML infrastructure are distinct and closely related to the features of the ML workload.
One of the numerous improvements Google Cloud need to assure dependable and high-performing infrastructure is the mitigation of these variations. Google Cloud has been investing in a variety of cutting-edge strategies to address the growing power and thermal difficulties, in addition to the above-described approach. These strategies include data centre water cooling, vertical power distribution, power-aware workload allocation, and many more.
However, Google is not alone in these difficulties. Many hyperscalers, cloud providers, and infrastructure providers are starting to experience power and temperature changes in their machine learning infrastructure. Google Cloud require assistance from partners at every level of the system:
- A uniform definition of acceptable power quality measures should be established by utility providers, particularly in situations when several data centres with significant power fluctuations coexist and communicate with one another inside the same grid.
- Suppliers of power and cooling equipment should improve the quality and dependability of electronic components, especially for use in environments with significant and frequent power and temperature variations.
- To assist build an effective supplier base and ecosystem, hardware providers and data centre designers should develop a standardised suite of solutions, such as rack-level capacitor banks (RLCB) or on-chip functionality.
- ML model creators should take into account the model’s energy consumption features and think about using low-level software mitigations to aid with energy variations.
In order to assist the data centre infrastructure sector overall, Google has been spearheading and promoting industry-wide cooperation on these challenges through platforms like Open Compute Project (OCP). Google Cloud is excited to keep exchanging knowledge and working together to develop creative new solutions.