Monday, December 23, 2024

Develop An AI Avatar ChatBot with PyTorch And OPEA

- Advertisement -

Use PyTorch to Build an AI Avatar chat Bot.

Construct and implement on Intel Gaudi Al Accelerators and Intel Xeon Scalable Processors.

- Advertisement -

Summary

The business of several sectors is changing in the current AI era because to AI avatar chatbots. Digital human AIs are used everywhere, from the front desks of financial and educational institutions to public spaces like hospitals and airports, to help with customer service and individualized advice.

It created an AI Avatar Audio Chatbot to satisfy this demand from businesses using digital human AIs. This article explains how to build an AI avatar chatbot on Intel Xeon Scalable Processors and Intel Gaudi Al Accelerators using Open Platform for Enterprise AI (OPEA), a robust, open, and multi-provider framework of decomposable building blocks for state-of-the-art (SOTA) Generative AI (GenAI) systems. It also demonstrates how Intel-optimized programs like PyTorch and Intel Gaudi Software Suite may help speed up these AI systems’ training and inference performance. Speed up these AI solutions’ inference and training capabilities.

OPEA (Open Platform for Enterprise AI)

Included in the OPEA platform are:

  • A collection of microservice building elements, such as prompt engines, data stores, and LLM, for SOTA GenAI systems
  • Retrieval-Augmented Generation (RAG) GenAI component stack architecture plans
  • GenAI solutions can be deployed and put into production via micro and megaservices.
  • Performance, features, reliability, and enterprise-grade preparedness are the four criteria used to measure GenAI systems.
Retrieval Augmented Generation (RAG) enhanced GenAI Ref Solution on OPEA
Image credit to Intel

There are three main parts of any OPEA Enterprise AI solution:

- Advertisement -

Solution flexibility and scalability are offered by microservices

All of the available microservices are hosted in the GenAIComps repository. Every microservice is made to carry out a certain activity or function inside the application architecture.

Megaservices: Offers all-inclusive answers

A set of use case-based applications may be found in the GenAIExamples repository. A megaservice coordinates several microservices to provide a complete solution, as opposed to individual microservices that concentrate on particular tasks.

Portals: Facilitate communication

Users can access a megaservice and its underlying microservices through the Gateway. Request transformation, rate limitation, API design, versioning, and data retrieval from microservices are all supported by gateways.

Furthermore, the majority of AI solutions come with a matching user interface (UI) that enables more direct, interactive, and visual user interaction with OPEA megaservices.

Building an OPEA-based AI Avatar Chatbot?

Read this blog to learn How an AI avatar chatbot be created using the OPEA framework.

How to set up Intel Gaudi AI Accelerators and Intel Xeon scalable processors

The necessary Docker images for the same FastAPI service, microservice, or megaservice may differ depending on the deployment environment you have in mind. The “wav2lip” service container, for instance, utilizes the image “opea/wav2lip:latest” when it is installed on Intel Xeon CPUs and “opea/wav2lip-gaudi:latest” when it is installed on Intel Gaudi AI Accelerators. To incorporate their corresponding dependencies, these images make use of distinct Dockerfile versions.

Because of the aforementioned distinction, each service container must have a YAML file explicitly set. OPEA users can modify the following essential components using the “compose.yaml” file:

  • Service image: the particular Docker image that each service should utilize
  • Ports: connecting exterior ports to the internal port of the container
  • Environment variables: making certain that every service has the required configurations and context (e.g., device, inference mode, LLM model name, additional inputs, etc.)
  • Volumes: information exchanged between the host and containers
  • Additional components like runtime, network, IPC, cap_add, etc.

The OPEA-based AI Avatar Audio Chatbot example automatically divides its workload over four Intel Gaudi cards on a single Intel Gaudi 2 AI accelerator node while running on Intel Gaudi AI accelerators. Every microservice has its card. By setting the “HABANA_VISIBLE_MODULES” environment variable to values 0, 1, 2, and 3, the “asr,” “llm,” “tts,” and “animation” microservices’ Docker containers are each linked to a certain Intel Gaudi card. This can be confirmed in the “compose.yaml” configuration file. The phrase “Multiple Dockers Each With a Single Workload” describes this.

A better understanding of Intel Gaudi card layouts can be obtained by delving deeper into the Intel Gaudi 2 node. Using the “hl-smi” System Management Interface program, it first determines the mapping between the index and module ID of the Intel Gaudi processors.

Likewise, it may determine NUMA Affinity with another “hl-smi” command. Aligning devices with a particular area of memory to maximize performance is known as NUMA (Non-Uniform Memory Access) affinity for devices. NUMA affinity gives Intel Gaudi cards quick access to the memory managed by the CPU closest to them in multi-card systems like Intel Gaudi. In this case, memory managed by CPUs 0 and 1 is represented by Intel Gaudi cards with module IDs 0–3.

Optimizations for Deep Learning

Operation Mode: Lazy mode versus eager mode

The environment variable “PT_HPU_LAZY_MODE,” which can have values 0 or 1 (0 for Eager mode and 1 for Lazy mode), controls the mode of operation.

This variable is set to 0 in “GenAIComps/comps/animation/entrypoint.sh” in order to employ Eager mode. It wraps the face detector, Wav2Lip, and GFPGAN models into matching graphs while simultaneously extending the Eager mode with torch.compile. Torch.compile’s Eager mode minimizes host computing cost by eliminating the need to construct a graph for every iteration, in contrast to Lazy mode. Additionally, for both training and inference, the torch.compile backend argument needs to be set to hpu_backend.

Load Wav2Lip, BG sampler, GFPGAN models

model = load_model(args)
model = torch.compile(model, backend=”hpu_backend”)
print(“Wav2Lip Model loaded”)

However, Lazy mode allows users to keep the advantages and flexibility of the PyTorch define-by-run method. Only when a tensor value is needed does the accumulated graph’s operations begin to execute. This enables the construction of the Intel Gaudi graph with numerous operations, giving the graph compiler the chance to optimize the device execution for these operations.

Utilize PyTorch Autocast and an Intel Neural Compressor to Perform Inference Using BF16 and FP8

While lowering the size of the model weight in memory, mixed-precision quantization speeds up the operations of deep-learning neural networks. In this example, it used two techniques to enable the quantization of the Wav2Lip model and the face detector.

First, it can automatically run a default set of registered operators (ops) in the lower precision bfloat16 data type using the native PyTorch auto-cast. This allows BF16’s facial animation module to be inferred.

with torch.no_grad():
with torch.autocast(device_type=”hpu”, dtype=torch.bfloat16):
pred = model(mel_batch, img_batch)

In a second approach, it can provide FP8 inference on the Intel Gaudi accelerator by utilizing the Intel Neural Compressor package. The model’s needed memory bandwidth is cut in half when the FP8 data type is used for inference. Additionally, FP8 computes twice as quickly as BF16. To quantize the model, it must first create a JSON configuration file and then utilize the “convert” API, as indicated in this link.

Enhancements to Features

Service for Text-to-Speech (TTS)

As the default TTS model to use in OPEA, it employed the SOTA model “microsoft/SpeechT5”. During batch splitting of the lengthy text, it uses the code in the “speecht_model.py” file to automatically detect the final punctuation in the final token chunk rather than allowing it to convert the complete length of the text tokens to audio. This change enables the TTS service to produce entire, continuous text without halting abruptly.

Service for Animation

The code is available at this site. In Wav2Lip animation, it permits configurable frames-per-second (fps) for the video frame creation. The user-specified “fps” parameter regulates this. It can choose the frame rate for the finished video when the visual input to Wav2Lip is an image with the avatar’s face. A single frame is used to fuse a variable number of audio Mel spectrogram pieces together, depending on the frame rate. Setting fps=10 results in 1/3 video frames to generate, and hence 1/3 neural network iterations and computation, even if fps=30 offers slightly higher video render smoothness. It is best to make the frame rate adjustable for high-throughput, low-latency animation situations.

- Advertisement -
Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes