IBM Research data loader improves open-source community’s access to AI models for training.
Training AI models More quickly than ever
IBM showcased new advances in high-throughput AI model training at PyTorch 2024, along with a state-of-the-art data loader, all geared toward empowering the open-source AI community.
IBM Research experts are contributing to the open-source model training framework at this year’s PyTorch Conference. These contributions include major advances in large language model training throughput as well as a data loader that can handle enormous amounts of data with ease.
It must constantly enhance the effectiveness and resilience of the cloud infrastructure supporting LLMs’ training, tuning, and inference to supply their ever-increasing capabilities at a reasonable cost. The open-source PyTorch framework and ecosystem have greatly aided the AI revolution that is about to change its lives. IBM joined the PyTorch Foundation last year and is still bringing new tools and techniques to the AI community because it recognizes that it cannot happen alone.
In addition to IBM’s earlier contributions, these new tools are strengthening PyTorch’s capacity to satisfy the community’s ever-expanding demands, be they related to more cost-effective checkpointing, faster data loading, or more effective use of GPUs.
An exceptional data loader for foundation model training and tuning
Using a high-throughput data loader, PyTorch users can now easily distribute LLM training workloads among computers and even adjust their allocations in-between jobs. In order to prevent work duplication during model training, it also enables developers to save checkpoints more effectively. And all of it is attributable to a group of researchers who were only creating the instruments they required to complete a task.
When you wish to rerun your training run with a new blend of sub-datasets to alter model weights, or when you have all of your raw text data and want to use a different tokenizer or maximum sequence length, the resulting tool is well-suited for LLM training in research contexts. With the help of the data loader, you can tell your dataset what you want to do on the fly rather than having to reconstruct it each time you want to make modifications of this kind.
You can adjust the job even halfway through, for example, by increasing or decreasing the number of GPUs in response to changes in your resource quota. The data loader makes sure that data that has already been viewed won’t be viewed again.
Increasing the throughput of training
Bottlenecks occur because everything goes at the speed of the slowest item when it comes to model training at scale. The efficiency with which the GPU is being used is frequently the bottleneck in AI tasks.
Fully sharded data parallel (FSDP), which uniformly distributes big training datasets across numerous processors to prevent any one machine from becoming overburdened, is one component of this method. It has been demonstrated that this distribution greatly increases the speed and efficiency of model training and tuning while enabling faster AI training with fewer GPUs.
This development progresses concurrently with the data loader since the team discovered ways to use GPUs more effectively while they worked with FSDP and torch.compile to optimize GPU utilization. Consequently, data loaders rather than GPUs became the bottleneck.
Next up
Although FP8 isn’t yet generally accessible for developers to use, Ganti notes that the team is working on projects that will highlight its capabilities. In related work, they’re optimizing model tweaking and training on IBM’s artificial intelligence unit (AIU) with torch.compile.
Triton, Nvidia’s open-source platform for deploying and executing AI, will also be a topic of discussion for Ganti, Wertheimer, and other colleagues. Triton allows programmers to write Python code that is then translated into the native programming language of the hardware Intel or Nvidia, for example, to accelerate computation. Although Triton is currently ten to fifteen percent slower than CUDA, the standard software framework for using Nvidia GPUs, the researchers have just completed the first end-to-end CUDA-free inferencing with Triton. They believe Triton will close this gap and significantly optimize training when this initiative picks up steam.
The starting point of the study
IBM Research’s Davis Wertheimer outlines a few difficulties that may arise during extensive training: It’s possible to use an 80/20 rule to large-scale training. In the published research, algorithmic tradeoffs between GPU memory and compute and communication make up 80% of the work. However, because the pipeline moves at the pace of the narrowest bottleneck, you may expect a very long tail of all these other practical concerns when you really try to build something 80 percent of the time.
The IBM team was running into problems when they constructed their training platform. Wertheimer notes, “As we become more adept at using our GPUs, the data loader is increasingly often the bottleneck.”
Important characteristics of the data loader
Stateful and checkpointable: If your data loader state is saved whenever you save a model, and both the model state and data loader states need to be recovered at the same time whenever you recover from a checkpoint.”
Checkpoint auto-rescaling: During prolonged training sessions, the data loader automatically adapts to workload variations. There are a lot of reasons why you might have to rescale your workload in the middle. Training could easily take weeks or months.”
Effective data streaming: There is no build overhead for shuffling data because the system supports data streaming.
Asynchronous distributed operation: The data loader is non-blocking. The data loader states to be saved and then distributed in a way that requires no communication at all.”
Dynamic data mixing: This feature is helpful for changing training requirements since it allows the data loader to adjust to various data mixing ratios.
Effective global shuffling: As data accumulates, shuffling remains effective since the tool handles memory bottlenecks when working with huge datasets.
Native, modular, and feature-rich PyTorch: The data loader is built to be flexible and scalable, making it ready for future expansion. “What if we have to deal with thirty trillion, fifty trillion, or one hundred trillion tokens next year?” “it needs to build the data loader so it can survive not only today but also tomorrow because the world is changing quickly.”
Actual results
The IBM Research team ran hundreds of small and big workloads over several months to rigorously test their data loader. They saw code numbers that were steady and fluid. Furthermore, the data loader as a whole runs non-blocking and asynchronously.