FMS ’23: Performance SSDs support for ATS/ATC
The computational power of CPUs has increased dramatically over time, but the need for memory has expanded even more quickly. We’re witnessing that many workloads, such as AI training, are now constrained by the shortage of accessible system memory due to the explosive growth in CPU and GPU computation capabilities.
Although SSD-based swap space, or virtual memory, can prevent system crashes caused by memory shortages, it is not a suitable way to increase memory capacity for high-performance tasks. Compute Express Link, or CXL, is one technological solution to this problem.
It provides memory scalability by orders of magnitude above local DRAM and connects and shares memory pools among several computer nodes. Additionally, it necessitates fine-grained memory tiering according to locality and speed, spanning from significantly slower remote memory with more arcane coherence structures to CPU caches and GPU local memory (high-bandwidth memory, or HBM).
This increase in memory capacity is typically limited to DRAM + CXL;
storage is essentially unchanged, therefore NVMe SSD functionality shouldn’t be hampered, right? Not quite.
For high-performance workloads, SSDs can be tuned to increase latencies, performance, or both. They should also be aware of tiered memory. For this use scenario, one SSD optimization needs the assistance of ATS and the associated ATC, which we’ll cover in this article.
ATS/ATC: What is it?
Virtualized systems are where ATS/ATC is most applicable. Although they may also be utilized in non-virtualized systems, ATS/ATC functions simply by following a virtual machine’s (VM) data route in a direct-assignment virtualized system utilizing industry-standard methods as SrIOV.
Below is a reference diagram:
Memory addressing is one effect of this method. Because the guest operating system is intended to run on a dedicated system, it perceives that it is using system memory addresses (SVA = system virtual address; SPA = system physical address). However, in actuality, it is operating in virtual machine (VM) space under the supervision of the hypervisor, which provides guest addresses that are specific to the VM but completely distinct from the system address mapping.
The guest operating system (guest OS) uses guest virtual address (GVA) and guest physical address (GPA) instead of SVA and SPA. The transition from the local (guest) to the global (system) addressing scheme requires caution. The memory management unit (MMU) maintains memory that is directly accessed by programs and is not significant for SSDs. The other translation mechanism in the CPU is the IOMMU, which is crucial since it allows the translation of all DMA transfers.
The guest OS in the virtual machine (VM) needs to supply a DMA address each time it wishes to read data from the SSD, as demonstrated in Fi. It gives what it believes to be a GPA when in fact it is an SPA. Simply said, a GPA is what is transmitted to the SSD as a DMA address. The transaction layer packets (TLPs), which are PCIe packets with a maximum size of 512 bytes, are sent by the SSD together with the requested data, along with the GPA that it is aware of.
In this instance, the incoming address is recognized by the IOMMU as a GPA. It then compares it to the appropriate SPA using its conversion table, replacing the GPA with the SPA, allowing the right memory location to be utilized.
For NVMe, why is ATS/ATC important?
An excessive number of SSDs or CXL devices in a system may result in an address translation storm that clogs the IOMMU and creates a bottleneck in the system.
A contemporary SSD, for instance, is capable of up to 6 million 4KB IOPS. Each IO is divided into 8 TLP given TLP = 512B, so there are 48 million TLP to translate. 192 million TLP are translated per second by IOMMU, which is instantiated every four devices. TLPs can range from “up to 512B,” so while this is a big figure, things might be worse. Translations are higher in proportion to smaller TLPs. We must devise a strategy to cut down on the quantity of translations.
what ATS/ATC is all about: a way to request translations in advance and utilize them for as long as they remain accurate. Taking into account that the OS page has 4KB, every translation serves 8 TLP, hence decreasing the total amount of translations. Each translation can be utilized for 8*2M/4K = 4096 consecutive TLP (or more if using TLP smaller than 512B), although pages can be sequential, and on most virtualized systems, the next permissible granularity is 2MB.
As a result, there is less chance of a clog because IOMMU no longer has to give as many translations (around 200 million) as before.
Building ATS/ATC Models for NVMe
Each command in NVMe uses a unique address for the submission and completion queues (SQ and CQ). Such (static) translations should be preserved, right? Yes, indeed. And ATS/ATC does just that it maintains a cached copy of the translations that are most often used.
Right now, the most important query is: What DMA address patterns would the SSD receive such that ATS and ATS/ATC might be built around them? Regretfully, there isn’t any information or writing on this.
In order to tackle this issue, we developed a program that keeps note of every address that the SSD receives and stores it for future usage. The data must originate from a respectable representation of real-world applications and have sufficient data points to constitute a legitimate dataset for our cache implementation in order to be considered relevant.
Common workload benchmarks for several data center applications were selected, executed, and IO traces were recorded at 20-minute intervals. As a result, each trace produced hundreds of millions of data points to support the modeling endeavor.
Data ATC assessment for storage
Features Procedure:
- Pretend that a virtual machine is executing typical tasks.
- trace the addresses of distinct buffers for every task. Put them on the STU (2 MB) bottom pages.
- Create an ATC model in Python.
- To verify hit rate, replay traces to the model.
- Repeat for a fair number of workloads and configurations.
We created a Python model of the cache, replayed the entire trace (with hundreds of millions of entries), and examined cache behavior to determine the amount of cache needed. This made it possible for us to model modifications to the STU sizes, eviction rules, cache size (number of lines), and any other modeling-relevant characteristics.
After analyzing between 150 million to 370 million data points for each benchmark, we discovered that the average number of unique addresses utilized was in the tens of thousands, which is an excellent outcome for cache sizing. If we map them further on the 2MB page that is most frequently used (the less hundreds or low thousands of pages (the smallest transmission unit, or STU).
This suggests a very high buffer reuse rate, which makes this a fantastic candidate for caching even if the system memory is in the TB level. The quantity of data buffers utilized for IOs is in the GB range, and the number of addresses used is in the thousands range.
We performed additional testing against many data application benchmarks because we were worried that the significant address reuse was caused by locality in our particular system setup.
Contrasts a TPC-H on Microsoft SQL Server utilizing XFS, a radically different benchmark, with one of the YCSB/RocksDB/XFS tests mentioned above: TPC-H correlation: significantly different IO distribution 3.2 times more unique addresses, however they are dispersed among 70% of STU -> Higher locality Cache hit rate is similar with RocksDB, averaging 64 entries.
Although the data traces are very different, they converge to the same hit rate if the cache size is sufficiently enough, say over a meager 64 lines. For the sake of brevity, similar results have been verified using a number of different benchmarks that are not included here.
1.Dependency on size:
- YCSB WL B benchmark with Cassandra was utilized.
Cache: four ways Assign Algorithm Associative: Round Robin
2.Constatations:
- Hit rate is obviously closely correlated with STU size.
- Higher STU sizes result in higher hit rates.
Not all data are created equal;
as SQ and CQ are required for every NVMe operation, they have a significant influence on hit rate.
Modelling Data Pinning and Eviction Algorithms We can also model the impact of different algorithms for special data pinning (Submission Queue and Completion Queue) and data replacement.
The first set of graphs verifies cache dependencies on both line size, STU size and whether pinning SQ and CQ to ATC makes any difference. The answer is an obvious “yes” for relatively small caches, as the two set of curves are very similar in shape but the ones with SQ/CQ caches start from a much higher hit rate at small caches.
For example, at STU = 2MB and only 1 cache line (very unlikely in practice but helps making the point) the hit rate without any SQ/CQ caching is somewhere below 10% but with SQ/CQ pinning is close to 70%. As such, this is a good practice to follow.
As for cache sensitivity to the selected eviction algorithms, we tested Least Recently Used (LRU), Round Robin (RR) and just Random (Rand). As shown below, the impact is quite negligible. As such, the simpler and most efficient algorithm should be chosen.
- Algorithms dependency:
- Benchmark used:
- YCSB WL B with Cassandra Part of a much larger set Associativity:
- Full and 4 ways Eviction algo:
- LRU, Rand and Round Robin Outcome:
Replacement algorithms do not make any visible difference Selecting the simplest implementation may be the most effective approach