Wednesday, July 3, 2024

Launch LLM Chatbot and Boost Gen AI Inference with Intel AMX

LLM Chatbot Development

Hi There, Developers! We are back and prepared to “turn the volume up” by using Intel Optimized Cloud Modules to demonstrate how to use our 4th Gen Intel Xeon Scalable CPUs for GenAI Inferencing.

Boost Your Generative AI Inferencing Speed

Did you know that our newest Intel Xeon Scalable CPU, the 4th Generation model, includes AI accelerators? That’s true, the CPU has an AI accelerator that enables high-throughput generative AI inference and training without the need for specialized GPUs. This enables you to use CPUs for both conventional workloads and AI, lowering your overall TCO.

For applications including natural language processing (NLP), picture production, recommendation systems, and image identification, Intel Advanced Matrix Extensions (Intel AMX), a new built-in accelerator, provides better deep learning training and inference performance on the CPU. It focuses on int8 and bfloat16 data types.

Setting up LLM Chatbot

The 4th Gen Intel Xeon CPU is currently generally available on GCP (C3, H3 instances) and AWS (m7i, m7i-flex, c7i, and r7iz instances), in case you weren’t aware.

Let’s get ready to deploy your FastChat GenAI LLM Chabot on the 4th Gen Intel Xeon processor rather than merely talking about it. Move along!

LLM Chatbot and Boost Gen AI Inference with Intel AMX
Image Credit to Intel

Modules for Intel’s Optimized Cloud and recipes for Intel’s Optimized Cloud

Here are a few updates before we go into the code. At Intel, we invest a lot of effort to make it simple for DevOps teams and developers to use our products. The creation of Intel’s Optimized Cloud Modules was a step in that direction. The Intel Optimized Cloud Recipes, or OCRs, are the modules’ companions, which I’d want to introduce to you today.

Intel Optimized Cloud Recipes: What are they?

The Intel Optimized Cloud Recipes (OCRs), which use RedHat Ansible and Microsoft PowerShell to optimize operating systems and software, are integrated with our cloud modules.

LLM Chatbot and Boost Gen AI Inference with Intel AMX
Image Credit to Intel

Here’s How We Go About It

Enough reading; let’s turn our attention to using the FastChat OCR and GCP Virtual Machine Module. You will install your own generative AI LLM chatbot system on the 4th Gen Intel Xeon processor using the modules and OCR. The power of our integrated Intel AMX accelerator for inferencing without the need for a discrete GPU will next be demonstrated.

To provision VMs on GCP or AWS, you need a cloud account with access and permissions.

Implementation: GCP Steps

The steps below are outlined in the Module README.md (see the example below) for more information.

Usage

  1.  Log on to the GCP portal
  2. Enter the GCP Cloud Shell (Click the terminal button on the top right of the portal page)
  3. Run the following commands in order:

git clone https://github.com/intel/terraform-intel-gcp-vm.git
cd terraform-intel-gcp-vm/examples/gcp-linux-fastchat-simple
terraform init
terraform apply

# Enter your GCP Project ID and “yes” to confirm

Running the Demo

  1. Wait approximately 10 minutes for the recipe to download and install FastChat and the LLM model before continuing.
  2. SSH into the newly created CGP VM
  3. Run:  source /usr/local/bin/run_demo.sh
  4. On your local computer, open a browser and navigate to http://<VM_PLUBLIC_IP>:7860  .
    Get your public IP from the “Compute Engine” section of the VM in the GCP console.
  5. Or use the https://xxxxxxx.gradio.live URL that is generated during the demo startup (see on-screen logs)

“chat” and observe Intel AMX in operation after launching (Step 3), navigating to the program (Step 4), and “chatting” (Step 3).

LLM Chatbot and Boost Gen AI Inference with Intel AMX
Image Credit to Intel

Using the Intel Developer Cloud instead of GCP or AWS for deployment

A virtual machine powered by an Intel Xeon Scalable Processor 4th Generation can also be created using the Intel Developer Cloud.

For details on how to provision the Virtual Machine, see Intel Developer Cloud. After the virtual machine has been set up:

As directed by the Intel Developer Cloud, SSH onto the virtual machine.

To run the automated recipe and launch the LLM chatbot, adhere to the AI Intel Optimized Cloud Recipe Instructions.

GenAI Inferencing: Intel AMX and 4th Gen Xeon Scalable Processors

I hope you have a chance to practice generative AI inference! You can speed up your AI workloads and create the next wave of AI apps with the help of the 4th Gen Intel Xeon Scalable Processors with Intel AMX. You can quickly activate generative AI inferencing and start enjoying its advantages by utilizing our modules and recipes. Data scientists, researchers, and developers can all advance generative AI.

Source

agarapuramesh
agarapurameshhttps://govindhtech.com
Agarapu Ramesh was founder of the Govindhtech and Computer Hardware enthusiast. He interested in writing Technews articles. Working as an Editor of Govindhtech for one Year and previously working as a Computer Assembling Technician in G Traders from 2018 in India. His Education Qualification MSc.
RELATED ARTICLES

7 COMMENTS

  1. […] AI relies on big language models (LLMs) created from humans’ massive unlabeled text. Artificial brains with billions of variables and often many networks create material for spoken queries that mimic human replies. ChatGPT and DALL-E, which generate realistic and imaginative text and graphics from user input, are popular LLMs. Although amazing, these LLMs demand a lot of computer power and data. Most run in the cloud to leverage huge computing and network capacity. […]

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes