Monday, May 27, 2024

Dataflux: Efficient Data Loading for Machine Learning

Dataflux Data Management Server

Large datasets are ideal for machine learning (ML) models, and quick data loading is essential for ML training that is affordable. They created the Dataflux Dataset, a PyTorch Dataset abstraction, to speed up the loading of data from Google Cloud Storage. With small files, Dataflux offers up to 3.5 times faster training times than fsspec.

With the release of this product, Google has demonstrated its support for open standards through more than 20 years of OSS contributions, including TensorFlow, JAX, TFX, MLIR, KubeFlow, and Kubernetes. It has also sponsored important OSS data science projects, like Project Jupyter and NumFOCUS.

Similar speed benefits were also observed when google cloud verified the Dataflux Dataset on Deep Learning IO (DLIO) benchmarks, even with bigger files. Google Cloud advise utilising Dataflux Dataset for training workflows instead of alternative libraries or making direct calls to the Cloud Storage API because of this significant performance gain.

Feature-rich Dataflux Dataset attributes include

  • Direct integration with cloud storage: Get rid of the requirement to download data locally first.
  • Optimise performance to attain up to 3.5 times quicker training times, particularly for smaller files.
  • PyTorch Dataset primitive: Easily integrate with well-known PyTorch ideas.
  • Support for checkpointing: Import and export model checkpoints straight to and from cloud storage.

How to Use Dataflux Datasets

  • Requirements: Python 3.8 and above
  • Installing gcs-torch-dataflux requires $ pip.
  • Use the Google Cloud application’s default authentication for authentication.

For instance, loading photos for instruction

The Dataflux Dataset can be enabled with just a few simple adjustments. It is quite probable that you have created your own Dataset implementation if you use PyTorch and have data stored in cloud storage. The Dataflux Dataset creation process is exemplified in the sample below.

Under the engine

Google tackled the data-loading performance limitations in ML training workflows to obtain such large performance gains for it. During a training run, data is processed and then sent from CPU to GPU for ML-Training calculations after being loaded in batches from storage. Longer training times result from the GPU being essentially blocked and underutilised if reading and building a batch takes longer than GPU processing.

It takes longer to retrieve data from a cloud-based object storage system (such as Google Cloud Storage) than it does from a local disc, particularly if the data is contained in little objects. The cause of this is latency from the first byte. However, the cloud storage infrastructure offers rapid throughput once an object is “opened.” Google used a Cloud Storage function in Dataflux called Compose items, which allows us to dynamically merge multiple smaller items into a larger one.

Subsequently, they download 30 larger objects to memory and merely fetch the remaining 1024 little objects (batch size). Following their breakdown into their component smaller objects, the bigger objects are once again used as dataset samples. During the procedure, all temporarily constructed objects are likewise cleared out.

High-throughput parallel-listing is another optimisation used by this Datasets to expedite the basic metadata required for the dataset. In order to greatly speed up listings, Dataflux uses a sophisticated mechanism called work-stealing. Even on datasets with tens of millions of objects, the initial AI training run, or “epoch,” is faster with this method than it is with Dataflux Datasets without parallel-listing.

Fast-listing and dynamic composition work together to guarantee that GPU delays during ML training with Dataflux are kept to a minimum, resulting in much shorter training times and higher accelerator utilisation.

The Dataflux Client Libraries include fast-listing and dynamic composition, which may be accessed on GitHub. Below the surface, Dataflux Dataset makes use of these client libraries.

Unfortunately, there is some ambiguity in the term “Dataflux Dataset futures”. Two meanings are possible:


This can be a reference to the Dataflux firm. It’s hard to tell if such a company exists and what it does with datasets without more information. It is also likely that the term “Dataflux” refers to the flow of data in a broader meaning.

Futures of Datasets

This could be a reference to datasets generally. This idea is more intriguing. Here are some concepts regarding datasets’ future:

Enhanced Volume and Variety

There will likely be a significant rise in the quantity of data gathered. A greater variety of sources, such as social media, sensors, and the Internet of Things (IoT), will provide this data.

Put Quality and Security First

As datasets get larger, it will be increasingly more important to make sure they are secure and of high quality. Anonymization, privacy protection, and data cleansing techniques will be essential.

Advanced Analytics

To extract insights from large and complicated datasets, new analytical tools and methodologies will be developed. Machine learning and artificial intelligence will progress as a result.

Standardisation and Interoperability

To facilitate the sharing and integration of datasets across various platforms and applications, there will probably be a push for more standardised data formats and protocols.

Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.


Please enter your comment!
Please enter your name here

Recent Posts

Popular Post Would you like to receive notifications on latest updates? No Yes