GCS Hierarchical Namespace
Use the hierarchical namespace of Cloud Storage to expedite AI/ML workloads.
The infrastructure enabling AI and machine learning (ML) must change to satisfy their particular requirements as their workloads continue to increase. In this article, we’ll explain how you may optimise the effectiveness and performance of your AI/ML workloads with Cloud Storage’s new hierarchical namespace (HNS) feature.
The function of storage in AI/ML tasks
The following processes are commonly included in AI/ML data pipelines, and they can put a lot of strain on the underlying storage system:
- Data validation, preprocessing, ingesting data into storage, and converting it into the appropriate format for model training are all included in data preparation and preprocessing.
- The process of model training iteratively builds and improves an AI/ML model by utilising several GPU/TPU compute instances.
In order to conserve time and resources, this method also includes checkpointing, which saves a model’s state on a regular basis so that it can be started from the most recent stored state rather than beginning from scratch. This enables developers experiment with hyperparameters or modify training objectives without erasing previous progress, and it also offers fault tolerance against failures that are typical in large-scale distributed training.
- In order to perform model inference, model serving usually entails loading the model, weights, and dataset into compute instances using GPUs or TPUs.
Large compute clusters with thousands of nodes conducting concurrent I/O on petabyte-scale datasets can handle AI/ML tasks. Because of this, AI/ML pipelines frequently experience bottlenecks caused by the underlying storage system, which leads to underutilisation of costly GPU/TPU cycles.
The advantages of employing a GCS Hierarchical Namespace for workloads including AI and ML
When creating a bucket, Cloud Storage‘s hierarchical namespace can be activated. It offers several advantages for AI/ML applications, such as:
- APIs that are tailored for filesystem semantics and a new “folder” resource type.
- Atomic and quick folder renaming makes checkpointing quicker and more accurate.
- A storage arrangement that is optimised to support more reads and writes per second (QPS).
Let’s take a closer look at these advantages.
Data access and organisation tailored to filesystem semantics
Folders in a GCS Hierarchical Namespace bucket can hold objects and other folders, allowing Cloud Storage data which is typically flat to be arranged in a tree-like structure that resembles a conventional filesystem. This enables client libraries that work directly with folders, such as Cloud Storage FUSE, to map filesystem calls to Cloud Storage APIs.
Using a hierarchical namespace allows you to benefit from filesystem semantics provided natively by the underlying storage system, whereas flat namespace buckets frequently require expensive and inefficient object-level operations to mimic filesystem operations. For instance, when implementing inode lookups, filesystem libraries usually utilise resource-intensive ListObject calls; when utilising a hierarchical namespace, these can be substituted with more effective GetFolderMetadata methods. Because AI/ML workloads frequently rely on frameworks like TensorFlow and PyTorch that communicate with storage through a filesystem interface, this is quite advantageous.
Using GCS Hierarchical Namespace with Cloud Storage FUSE to power their AI/ML workloads has resulted in notable benefits, according to customers like AssemblyAI.
20 times quicker checkpointing
Writing checkpoints and managing intermediate outputs sometimes involves renaming folders and objects. A new RenameFolder API that is atomic and fast is introduced by Cloud Storage’s hierarchical namespace buckets. The hierarchical namespace offering offers a folder-level metadata-only operation that accomplishes this in an atomic action that takes a fraction of the time to complete, whereas simulating a folder rename in a flat namespace bucket could require thousands of individual object rewrites and deletes (depending on how many objects are in the folder). Inconsistencies and complicated state management brought on by partial failures a frequent issue with simulated renames in flat buckets are avoided by atomicity.
Checkpoint benchmarking, which examines folder renames in action, reveals that GCS Hierarchical Namespace buckets can accelerate checkpoint writes by up to 20 times when compared to flat buckets.
Up to eight times the QPS
The associated storage system receives millions of I/O requests from AI/ML workloads running on massive clusters. Serving reads for inference and checkpoint pushes and restorations during model training are examples of extremely bursty workloads in which numerous nodes are synchronised to communicate with storage simultaneously. Storage bottlenecks that could starve costly GPUs and TPUs are avoided with the aid of high QPS capabilities.
The optimised storage arrangement of GCS Hierarchical Namespace buckets allows for up to eight times greater initial object read and write requests per second (QPS) than flat namespace buckets. The QPS can still be doubled every 20 minutes in accordance with Cloud Storage ramp-up recommendations. In comparison to a flat bucket, a cold hierarchical namespace bucket can do 100,000 object write QPS in almost half the time.
In conclusion
Fast-performance checkpointing to optimise GPU/TPU utilisation, fast QPS rates to facilitate rapid ramp up, and effective data organisation and filesystem semantics for tight integration with frameworks are all necessary for AI/ML workloads. In addition to the scalability, dependability, ease of use, and affordability that Cloud Storage is renowned for, hierarchical namespace buckets offer all of these advantages. For AI/ML workloads, Google advises turning on hierarchical namespace on new buckets.
What is Hierarchical Namespace
You can store your data in a logical file system structure and arrange objects into directories with Cloud Storage’s GCS Hierarchical Namespace feature. Data-intensive and file-oriented tasks are easier to manage when your data is stored in a file system structure, which also improves performance and guarantees consistency.
Creating, removing, listing, and renaming folders are among the administration and dependability features offered by folder operations. Objects are arranged hierarchically, which makes data administration and organisation easier. When GCS Hierarchical Namespace is enabled in a bucket, a folder may include other folders, objects, or both.
Hierarchical namespace must be enabled while creating a bucket in order to use folders. It is not possible to modify the hierarchical namespace setting of your bucket once it has been created.
An illustration of a bucket with GCS Hierarchical Namespace enabled, where items are arranged in a hierarchical folder structure, may be found in the following diagram.
Essential characteristics
Features of a hierarchical namespace include:
Higher initial queries per second (QPS): When compared to buckets without hierarchical namespace enabled, buckets with this feature offer up to eight times greater initial QPS limits for reading and writing objects. Increased throughput and easier scaling of data-intensive tasks are two benefits of the higher starting QPS.
Folders: Folders may be used to create, remove, and retrieve folders, and they serve as a container for items and other folders.
Rename folders: No objects are deleted when you use the rename folders operation to atomically change the path of a folder and its underlying folders. This method is effective and saves time, particularly when dealing with big folders that contain several items.
List folders: This action facilitates the management and comprehension of the structure of your data stored in a bucket by listing every folder in the bucket or beneath a particular folder.