Which capacity advancements will AI bring about?
Working in storage is exciting right now. In the IT sector, a seismic shift is about to occur. It centers on how artificial intelligence (AI) will alter our expectations of what computers can accomplish for us and how we design and construct servers. Generative AI is generating a lot of excitement in both the public and industry.
When ChatGPT first appeared earlier this year, people’s expectations were sparked by the idea that a machine could comprehend questions we ask in natural language, have a discussion with us about any topic, and even compose poetry and rhymes like a person. Or the several AI models for picture production that, when given basic language inputs from the user, are capable of producing breathtaking visual masterpieces.
There is a significant need for higher bandwidth memory, or HBM, due to the quick development of AI. These days, HBM solutions are more in demand than gold. The need for a greater capacity memory footprint on the CPU is being driven by large language models (LLM) in order to handle progressively larger and more complicated models. Although the benefits of increased memory bandwidth and capacity are well acknowledged, storage’s contribution to the development of artificial intelligence is sometimes overlooked.
What part does storage play, and how important is it in AI workloads?
Two places will have a significant impact from storage. One is the fast local storage that serves as a cache for training data that is sent into the GPU’s HBM. A high-performance SSD is used due to the performance requirements. The storage of all training datasets in huge data lakes is another essential function of storage.
Nearby cache disk
Human-generated content from books, dictionaries, and the internet is used to teach LLMs. Large data blocks are read in order to prefetch the following batch of data into memory during the structured input/output pattern to the training data on the local cache drive. Therefore, in conventional LLMs, GPU computation is typically not hampered by the SSD’s speed. Other AI/ML models that place more strain on the local cache drive include computer vision and mixed mode LLM+CV. These models demand larger bandwidths.
Graph Neural Networks (GNN) are widely employed in fraud detection, network intrusion, and deep learning recommendation models (DLRM) for product suggestion. There are instances when the DLRM is called the biggest online income generating algorithm. GNN training models often access data in smaller block sizes and with greater randomness.
They have the potential to seriously impair the local cache SSD’s performance and cause costly GPUs to idle. In order to relieve this performance constraint, new SSD functionalities are needed. Working with industry experts, Micron is actively developing solutions. At SC23 in Denver, we will showcase some of this work and show how the GPU and SSD can work together to accelerate some I/O-intensive processing times by up to 100x.
AI repositories
Big-capacity SSDs will become the preferred storage medium for big data lakes. Larger capacity HDDs cost less ($/TB), but they also slow down (MB/s / TB). The ability of massive data lakes to power-efficiently source the kind of bandwidth (TB/s) required for large AI/ML GPU clusters will be seriously challenged by HDD sizes greater than 20TB.
In contrast, SSDs offer high performance and, when designed with specific uses in mind, may provide the necessary capacities at power (8x lower Watt/TB) and even lower electrical energy (10x lower kW-hr /TB) levels than HDD. The data center has more power now that more GPUs can be added. Currently, Micron is integrating its 32TB high-capacity data center SSD into many object stores and AI data lakes. In the future, capacities for 15-watt SSDs that can each provide several GB/s of bandwidth will increase to 250TB.
How may AI impact the market for NAND flash storage?
First, data is needed for every new AI/ML model to “learn” from during training. According to IDC estimates, the quantity of data created annually surpassed the amount of storage purchased annually beginning in 2005. This implies that certain data must become transient. The monetary value of the data can only be established by the user, as will when that cost of purchasing more space for preserving the material overcomes its worth.
Modern machines output hundreds of magnitude more data per day than humans can handle. Examples of these machines include cameras, sensors, IoT, jet engine diagnostics, packet routing information, swipes, and clicks. AI/ML algorithms may now benefit greatly from machine-generated data that people were previously unable or did not have the time to evaluate in order to extract important and valuable information. The need for storage should increase as AI and ML become more prevalent since they should make this data more valuable to keep.
AI data lakes house this training set. These data lakes have characteristics of higher-than-normal access density to allow a high mixing of preprocessing and intake while concurrently feeding an increasing number of GPUs per cluster.
Additionally, a lot of retraining is done on the data, which means that there is frequently little “cold” data. Large-capacity, power-efficient SSDs are a far better fit for that workload profile than conventional HDD-based object storage. These data lakes may be hundreds of petabytes in size and are used for computer vision applications like DLRM and autonomous driving. The capacity and quantity of these data lakes will increase, creating a significant growth potential for NAND flash SSDs.
NAND flash storage will be more and more necessary as AI models develop and increase in order to sustain their exponential performance development.