Colossus Storage
Under the hood, Colossus: How Google provides SSD performance at HDD costs
Nearly every Google product, from BigQuery and Cloud Storage to YouTube and Gmail, relies on Colossus, Google Cloud’s core distributed storage system. Colossus storage, Google’s all-purpose storage platform, features the administration and scalability of an object storage system, an intuitive programming style that all Google teams utilise, and throughput that is on pace with or better than the finest parallel file systems. Additionally, it fulfils the demands of goods with a wide range of specifications, including those related to latency, scale, price, and throughput.
Example application | I/O sizes | Expected performance |
BigQuery scans | hundreds of KBs to tens of MBs | TB/s |
Cloud Storage – standard | KBs to tens of MBs | 100s of milliseconds |
Gmail messages | less than hundreds of KBs | 10s of milliseconds |
Gmail attachments | KBs to MBs | seconds |
Hyperdisk reads | KBs to hundreds of KBs | <1 ms |
YouTube video storage | MBs | seconds |
The adaptability of Colossus is evident in several Google Cloud products that are openly accessible. Impressive scalability is achieved by Hyperdisk ML, which uses a Colossus solid state disc (SSD) to serve 2,500 nodes reading at 1.2 TB/s. The core of Spanner’s tiered storage feature is Colossus, which combines inexpensive HDD storage with lightning-fast SSD storage in the same filesystem. Colossus SSD caching is used by Cloud Storage to provide the least expensive storage while yet enabling the demanding I/O of AI/ML applications. Lastly, extra-large searches can have extremely rapid input/output with BigQuery’s Colossus-based storage.
We wanted to share with you how its features help Google Cloud’s evolving business and what new features Google has implemented, particularly about SSD support.
Background of the Colossus
Here is some background information about Colossus storage first, though:
- A development of the Google File System (GFS) is Colossus.
- One datacenter houses the conventional Colossus filesystem.
- Colossus reduced the complexity of the GFS programming model to an append-only storage system that combines the scalability of object storage with the well-known programming interface of file systems.
- The Colossus metadata service is composed of “custodians,” who maintain disk-space balance and data durability and availability, and “curators,” who handle interactive control tasks like file creation and deletion.
- Colossus storage clients store data directly on “D servers,” which house its HDDs or SSDs, after interacting with curators for metadata.
Recognising that Colossus is a zonal product is also crucial. A Google Cloud zone’s core building block is a single Colossus filesystem, which Google constructs for each cluster. Regardless of the number of workloads operating within the cluster, the majority of data centres have a single cluster and, thus, a single Colossus filesystem. Two distinct filesystems with more than 10 exabytes of storage each are among the several Colossus filesystems with multiple exabytes of storage. Even the most demanding apps won’t run out of disc space near their cluster compute resources within a zone because of this great scalability.
These same demanding applications require high throughput and IOPS. Indeed, some of Google’s biggest filesystems frequently surpass 50 TB/s read and 25 TB/s write throughputs. With this throughput, over 100 full-length 8K movies could be sent every second!
Google Cloud also doesn’t solely use Colossus storage for big streaming inputs and outputs. Numerous programs perform little random reads or log appends. When reads and writes are combined, its busiest single cluster can deliver over 600M IOPS.
It goes without saying that getting the appropriate data to the right spot is necessary to attain such excellent performance. If all of your data is on slow disc drives, it is difficult to read at 50 TB/s. This brings us to two significant new developments in Colossus: SSD data placement and caching, both of which are driven by a mechanism it refers to as “L4”.
What is new with the location of the Colossus SSD?
These days, no storage designer would ever specify a system made entirely of HDDs.
SSD-only storage is still significantly more expensive than a storage fleet that combines SSD and HDD. The difficulty lies in retaining the majority of the data on HDD while putting the appropriate data the data that requires the least latency or receives the most I/Os on SSD.
Now let’s see how Colossus storage determines which data is the most interesting.
Colossus offers multiple methods for choosing which data to put on SSDs:
- In this case, an internal Colossus storage user compels the system to save data on an SSD. The route /cns/ex/home/leg/partition=ssd/myfile is where users can accomplish that. This method is the simplest and guarantees that the file is fully stored on the SSD. However, it is also the most costly choice.
- Employ hybrid placement: More experienced users can instruct the Colossus system to only store one replica on the SSD by using “hybrid placement”: ssd.1/myfile /cns/ex/home/leg/partition. This method is less expensive, but accesses suffer from HDD latency if the D server hosting the SSD copy is not available.
- Use L4: The majority of Google’s developers employ the more recent L4 distributed SSD caching technology, which dynamically selects the data that is most suited for SSD, for the majority of the company’s data.
L4 read caching
The L4 distributed SSD cache automatically inserts the data that is most suited for SSD after analysing an application’s access patterns.
L4 index servers keep a distributed read cache when they are used as a read cache:

This implies that an application first checks with a L4 index server before attempting to read any data. If the client knows that the data is in cache, it reads it from one or more SSDs; if not, it tells the cache that there is a cache miss and retrieves the data from the disc that Colossus storage has placed it on.
L4 may choose to add the accessed data to the SSD cache in response to cache misses. It accomplishes this by instructing an SSD storage server to move the HDD server’s data. As the cache eventually fills up, L4 removes a few things to make room for future entries.
When it comes to the amount of data that should be stored on SSD, L4 can be very demanding. To choose between three possible policies for every workload insert into the L4 cache when the data is written, after it is read for the first time, or only after it is read twice in a short period, Google employs an algorithm driven by machine learning (ML).
This method has significantly increased its IOPS and throughput and is effective for apps that read the same data frequently. However, it does have a significant flaw in that it continues to write the updated data to an HDD. It turns out that L4 read caching isn’t as good at saving resources for other significant classes of data, such as database transaction logs and other files that receive a lot of small appendices, as well as data that is written, read, and deleted rapidly (like intermediate results for a large batch-processing job). It is better to write both workloads straight to SSD and avoid HDD completely because both of them are ill-suited to HDD.
L4 writeback for Colossus
Consider an internal Colossus storage user who wishes to store some of their data on an SSD. This user must carefully consider which files to store on the SSD and how much SSD quota to buy for their workload. Additionally, users may choose to move data from SSD to HDD if they have older files that aren’t being accessed.

The L4 service provides Colossus curators with guidance on whether and how long to store new files on SSD when used as a writeback cache. This is difficult! When a file is created, Colossus can only see the name of the file and the application that is creating it; it has no idea how the file will be used.
Google Cloud employs the same strategy as the L4 read cache outlined in the previously mentioned CacheSack paper to address this issue. Features like the file format or metadata about the database column containing the data are passed to L4 by the application. L4 separates the files into “categories” using these attributes, then tracks the I/O patterns of each category over time. An online simulation of various placement rules, such as “place on SSD for one hour,” “place on SSD for two hours,” or “don’t place on SSD,” is powered by these I/O patterns. L4 selects the optimal policy for every category based on this scenario.
Another significant function of these live simulations is to forecast the location that L4 would select in the event if more or less SSD capacity were available. As a result, it can forecast the quantity of I/O that can be offloaded from HDD with varying SSD capacities. These signals encourage the procurement of new SSD hardware and advise planners on how to distribute SSD capacity among apps for optimal performance.
New files can then be directed to the SSD instead of the default HDD by the curator upon request. If the file is still there after a predetermined period of time, the curator moves the data from SSD to HDD:
Google Cloud puts a small amount of its data on SSD, which absorbs the majority of the reads (which often happen to freshly generated files), before moving the data to less expensive storage to reduce the overall cost when the L4 systems’ simulation accurately forecasts the file access patterns. In the ideal situation, all HDD I/O is avoided by deleting the file before moving it to the HDD.
Google Cloud and Colossus SSD
Colossus storage serves as the foundation for Google and Google Cloud, providing dependable services to billions of users. Its advanced SSD placement features help keep costs low and performance up while automatically adjusting to workload variations.