Wednesday, July 10, 2024

Isima’s Z3 VM Success: Get 10X Throughput, Half the Cost

Isima’s experiment with Z3 virtual machines for e-commerce yielded a price-performance ratio of two times and a throughput of ten times.

Workloads requiring a lot of storage, such as log analytics and horizontal, scale-out databases, require a high SSD density and reliable performance. They also require regular maintenance so that data is safeguarded in the case of an interruption. Google Cloud Next ’24 marked the official launch of Google Cloud’s first storage-optimized virtual machine family, the Z3 virtual machine series. Z3 delivers extraordinarily dense storage configurations of up to 409 SSD (GiB):vCPU on next-generation local SSD hardware, with an industry-leading 6M 100% random-read and 6M write IOPs.

One of the first companies to test it was the Silicon Valley company and e-commerce analytics cloud, Isima. Their bi(OS) platform offers serverless infrastructure for AI applications and real-time retail and e-commerce data. In order to onboard, process, and operate data for real-time data integration, feature stores, data science, cataloguing, observability, DataOps, and business intelligence, it has a scale-out SQL-friendly database and zero-code capabilities.

Google Cloud compares Z3 to N2 VMs for general purposes and summarises Isima‘s experiments and findings in this blog post. Warning: There will be spoilers ahead: Google Cloud promises 2X better price-performance, 10X higher throughput, and much more.

The examination

Isima tested Z3 on a range of taxing, real-world ecommerce workloads, including microservice calls, ad hoc analytics, visualisation queries, and more, all firing simultaneously. As a Google Cloud partner, Isima was granted early access to Z3.

Isima tested Z3 on a range of taxing
Image credit to Google cloud

In order to optimise Z3 and emulate real-world high-availability implementations, Isima separated each of the three z3-highmem-88 instances into five Docker containers, each of which was deployed across several zones to run bi(OS). There were two 3TB SSDs, 128GB RAM, and 16 vCPUs allotted to each Docker container. With this configuration, Isima was able to compare Z3 more effectively with earlier tests that they ran with n2-highmem-16 instances.

Isima evaluated the following to simulate extreme stress and several worst-case scenarios:

Demand spikes

They demanded (and attained) 99.999% reliability despite hitting the system with a brief peak load that saturated system resources to 70%. They then relentlessly maintained a 75% of that peak for the whole 72-hour period.

Select queries

To prevent inadvertent caching effects by the operating system or bi(OS), they tested select queries. To make sure they were reading data from the Local SSD and not RAM something vital for Z3 Isima purposefully travelled back in time while reading data (using select queries), for example, by querying data that had been entered 30 minutes earlier. They were secure that the outcomes of performance testing would withstand the demands of everyday work because of this.

Various deployment scenarios

Z3’s capacity to manage a range of real-world deployments was validated by testing of both single-tenant and multi-tenant setups.

Simulated maintenance events

Z3’s adaptability to disruptions is demonstrated by Isima, which even factored in scheduled maintenance utilising Docker restarts.

The decision

Throughput: With 2X higher price-performance, bi(OS) on Z3 handled ~2X+ more throughput than the tests conducted using n2-highmem-16 last year.

DB Requestn2-highmem-16(~ops/sec)One 16-vCPU docker onz3-highmem-88(~ops/sec)ImprovementUse-cases
inserts(single row/query)334260101.8XOnboarding of click-stream data
selects(single row/query)3006132XFeature store reads for personalization, ATP, etc.
upserts(single row/query)61215002.45XUpdating ML Scores
selects(multiple rows/query)160041102.56XBulk reads (as part of ETL)

NVMe disc latencies: Read latencies remained constant, but write latencies improved by almost six times.

NVMe drive latencies
Image credit to Google cloud

Drive variation: Every drive on every z3-highmem-88 virtual machine recorded a variance in read and write latencies of +/- 0.02 milliseconds over a span of 72 hours.

Drive variation
Image Credi to Google cloud

These findings for the new Z3 instances excite Google Cloud, and they will undoubtedly unleash the potential of many more workloads.

Improved encounter with maintenance

Numerous new infrastructure lifecycle technologies that offer more precise and stringent control over maintenance are included with Z3 virtual machines. The mechanism notifies Z3 VMs several days ahead of a scheduled maintenance event. The maintenance event can then be planned at a time of your choosing or it can automatically occur at the scheduled time. This enables us to provide better secure and performant infrastructure while also enabling you to more accurately plan ahead of a disruptive occurrence. Additionally, you’ll get in-place upgrades that use scheduled maintenance events to protect your data.

Driven by Titanium

Z3 virtual machines are constructed using Titanium, Google’s proprietary silicon, security microcontrollers, and tiered scale-out offloads. Better performance, lifecycle management, dependability, and security for your workloads are the ultimate results. With Titanium, Z3 can offer up to 200 Gbps of fully secured networking, three times faster packet processing than previous generation virtual machines (VMs), near-bare-metal speed, integrated maintenance updates for most workloads, and sophisticated controls for applications that are more sensitive.

“Going forward, Google Cloud is pleased to work with us on the development of Google Cloud’s first storage-optimized virtual machine family, building on Google cloud prosperous collaboration since 2016.” Through this partnership, Intel’s 4th generation Intel Xeon CPU and Google’s unique Intel IPU are made available, opening up new performance and efficiency possibilities. – Suzi Jewett, Intel Corporation’s General Manager of Intel Xeon Products

Hyperdisk capacity

Google Cloud offers their next-generation block storage, called Hyperdisk. Because Hyperdisk is based on Titanium, it offers much improved performance, flexibility, and efficiency because it separates the virtual machine host’s storage processing from it. With Hyperdisk, you can effectively meet the storage I/O requirements of data-intensive workloads like databases and data analytics by dynamically scaling storage performance and capacity separately. Choosing pricey, huge compute instances is no longer necessary to obtain better storage performance.

Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.


Please enter your comment!
Please enter your name here

Recent Posts

Popular Post Would you like to receive notifications on latest updates? No Yes