Saturday, July 27, 2024

Microsoft Open Phi-3 Mini Languages Quicken with NVIDIA

NVIDIA revealed that NVIDIA TensorRT-LLM, an open-source framework for optimising large language model inference while running on NVIDIA GPUs from PC to cloud, has accelerated Microsoft’s new Phi-3 Mini open language model.

Phi-3 Mini advances Phi-2 from its research-only origins by bringing the power of 10x larger models to the masses, and it is licenced for both commercial and research use. Workstations equipped with NVIDIA RTX GPUs or PCs sporting GeForce RTX GPUs possess the necessary performance to execute the model locally through TensorRT-LLM or Windows DirectML.

With 512 NVIDIA H100 Tensor Core GPUs, the model with 3.8 billion parameters was trained on 3.3 trillion tokens in just seven days.

The Phi-3 Mini comes in two versions: the 4k token variation supports up to 128K tokens, making it the first model in its class for extremely extended contexts. This enables developers to ask the model questions using 128,000 tokens, or the atomic components of language that the model processes, and gets more pertinent answers.

At ai.nvidia.com, developers can test Phi-3 Mini with the 128K context window. It is packaged as an NVIDIA NIM, a microservice with a standard API that can be deployed anywhere.

Developing Effectiveness for the Advantage

Through community-driven tutorials, such as those on Jetson AI Lab, developers working on autonomous robots and embedded devices can learn how to construct and implement generative AI. Phi-3 is deployed on NVIDIA Jetson.

The Phi-3 Mini variant has 3.8 billion parameters, which makes it small enough to function well on edge devices. In memory, parameters resemble knobs that have been fine-tuned during the model training process to enable the model to react to input cues with a high degree of accuracy.

In usage instances where resources and costs are limited, Phi-3 can help, particularly with easier jobs. On important language benchmarks, the model can perform better than some larger models while still meeting latency requirements.

In order to increase inference speed and latency, TensorRT-LLM employs a variety of optimisations and kernels, including LongRoPE, FP8, and inflight batching. It will also support the extended context window of the Phi-3 Mini. Soon, the TensorRT-LLM implementations will be accessible on GitHub in the examples folder. Developers can then convert to the TensorRT-LLM checkpoint format, which is readily deployable with NVIDIA Triton Inference Server and optimised for inference.

Creating Open Systems

Having released more than 500 projects under open-source licences, NVIDIA is a prominent participant in the open-source ecosystem.

NVIDIA supports a wide range of open-source foundations and standards bodies in addition to contributing to numerous external projects like JAX, Kubernetes, OpenUSD, PyTorch, and the Linux kernel.

The announcement today builds on long-standing NVIDIA partnerships with Microsoft, which have facilitated advancements in DirectML, Azure cloud, generative AI research, healthcare, and life sciences.

What is ONNX Runtime?

ONNX Runtime enhances machine learning models. Its functions are:

Machine learning accelerators speed up machine learning model inference and prediction.

Cross-platform: It works on Windows, Linux, macOS, mobile devices, and web browsers.

It supports models from PyTorch, TensorFlow, scikit-learn, and others. This eliminates the need to preserve the framework for model use.

ONNX Runtime simplifies machine learning model deployment and performance optimisation in various contexts.

ONNX Runtime Tutorial

Thanks to ONNX Runtime and DirectML, Microsoft’s most recent in-house Phi-3 models may now be used on a wide variety of hardware and operating systems. They are pleased to inform that both the phi3-mini-4k-instruct and phi3-mini-128k-instruct flavours of Phi-3 will be supported on day one. The phi3-mini-4k-instruct-onnx and phi3-mini-128k-instruct-onnx websites offer the optimised ONNX models.

While most language models are too big to operate locally on most systems, Phi-3 is a notable exception to this rule, since this compact yet powerful suite of models performs on par with models ten times larger! The Phi-3 Mini variant is unique among its weight class in that it can accommodate lengthy contexts with up to 128K tokens.

Phi-3 Mini on Windows is scaled using DirectML and ONNX Runtime

Phi-3 is small enough to run on a wide range of Windows machines by itself, so why stop there? The model’s reach on Windows would be greatly increased by making Phi-3 even smaller by quantization, but not all quantization methods are made equal. Their goal was to guarantee both scalability and model correctness.

Phi-3 Mini can be quantized using Activation-Aware Quantization (AWQ), which allows us to benefit from quantization’s memory savings with negligible accuracy loss. In order to do this, AWQ quantizes the remaining 99% of weights and finds the top 1% of salient weights that are essential to preserving model correctness. When AWQ is used for quantization, significantly less accuracy is lost than with many other methods.

Whether an AMD, Intel, or NVIDIA GPU is supported by DirectX 12 on Windows, it can all run DirectML. Developers can now execute and distribute this quantized version of Phi-3 across hundreds of millions of Windows machines because DirectML and ONNX Runtime now support INT4 AWQ!

In the upcoming weeks, they will release driver updates that will further enhance performance in collaboration with its hardware manufacturer partners. To find out more, come to their Build Talk in late May!

ONNX Mobile Runtime

The ONNX Runtime is a genuinely cross-platform framework because it can run the Phi-3 Mini models on mobile and Mac CPUs in addition to supporting them on Windows. The ONNX Runtime facilitates the operation of these models on a wide range of hardware types by supporting quantization techniques such as RTN.

With the ONNX Runtime Mobile, developers may use AI models for on-device inference on mobile and edge devices. Through the elimination of client-server connections, ORT Mobile offers cost-free privacy protection. They can run both at a moderate pace on a Samsung Galaxy S21 and greatly lower the size of the state-of-the-art Phi-3 Mini versions using RTN INT4 quantization. There is a tuning parameter for the int4 accuracy level when using RTN INT4 quantization.

This option balances the trade-offs between accuracy and performance by defining the minimum accuracy level needed to activate MatMul in int4 quantization. With int4_accuracy_level=1, optimised for accuracy, and int4_accuracy_level=4, optimised for performance, two versions of RTN quantized models have been made available. They advise using the model with int4_accuracy_level=4 if you would rather have higher performance with a minor accuracy trade-off.

ONNX Runtime Server

ONNX Runtime with CUDA is a fantastic solution that supports a wide range of NVIDIA GPUs, including both consumer and data centre GPUs, for Linux developers and beyond. For ONNX Runtime with CUDA, Phi-3 Mini-128K-Instruct outperforms PyTorch in all batch size, prompt length combinations.

ONNX Runtime PyTorch

Phi-3 Mini-128K-Instruct with ORT outperforms PyTorch by up to 5X and up to 9X for FP16 CUDA and INT4 CUDA, respectively. Llama does not yet support the Phi-3 Mini-128K-Instruct.cpp.

Phi-3 Mini-4K-Instruct with ORT outperforms PyTorch by up to 5X and up to 10X on FP16 and INT4 CUDA, respectively. For large sequence lengths, Phi-3 Mini-4K-Instruct performs up to three times quicker than Llama.cpp.

With ONNX Runtime, there is a way to infer models effectively on Windows, Linux, Android, and Mac!

Use ONNX Runtime Generate()

They are excited to unveil their new Generate() API, which wraps several features of generative AI inferencing to make Phi-3 models easier to execute on multiple devices, platforms, and EP backends. Just drag and drop LLMs into your app with the Generate() API. Run these early models using ONNX.

Performance Metrics DirectML

ONNX Runtime + DirectML performance of Phi-3 Mini (4k sequence length) quantized with AWQ and 128 block size on Windows was measured. The test computer had an Intel Core i9-13900K CPU and NVIDIA GeForce RTX 4090 GPU. As shown in the table, DirectML has good token throughput with longer prompts and generation lengths.

AMD, Intel, and NVIDIA enable DirectML, which lets developers deploy models across Windows and achieve exceptional performance. Best of all, AWQ gives developers scale and model accuracy.

Their hardware partners’ optimised drivers and ONNX Generate() API upgrades will boost performance in the coming weeks.

Prompt LengthGeneration LengthWall Clock tokens/s
16256266.65
16512251.63
161024238.87
162048217.5
32256278.53
32512259.73
321024241.72
322048219.3
64256308.26
64512272.47
641024245.67
642048220.55

ONNX Runtime FP16

Phi-3 Mini 128K Instruct ONNX model average throughput of the first 256 tokens created (tps) improved as seen in the table below. CUDA FP16 and INT4 precisions on 1 A100 80GB GPU are compared.

PyTorch Compile and Llama.cpp do not currently support the Phi-3 Mini 128K instruct model.
Image Credit to ONNX Runtime

Try Phi3 ONNX Runtime

This blog post describes ONNX Runtime and DirectML Phi-3 model optimisation. Phi-3 instructions for Windows and other systems and early benchmarking data are supplied. Stay tuned for ONNX Runtime 1.18 in early May for more changes and perf optimisations!

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes