Monday, December 23, 2024

AWS Bedrock Prompt Caching And Intelligent Prompt Routing

- Advertisement -

Use Amazon Bedrock Intelligent Prompt Routing and Bedrock Prompt Caching to cut expenses and latency (preview)

Two features that assist lower costs and latency for generative AI applications have been previewed by Amazon Bedrock today:

- Advertisement -

Intelligent Prompt Routing on Amazon Bedrock

You can now combine foundation models (FMs) from the same model family when calling a model to assist optimize for both cost and quality. For instance, based on the intricacy of the question, Amazon Bedrock can dynamically route requests between Claude 3.5 Sonnet and Claude 3 Haiku using the Anthropic Claude model family. Likewise, queries can be routed between Meta Llama 3.1 70B and 8B using Amazon Bedrock.

The quick router optimizes response quality and cost by predicting which model will perform best for each request. This is especially helpful for applications like customer service agents, where more powerful models answer complex queries while smaller, faster, and more economical models handle simpler ones. Without sacrificing accuracy, Intelligent Prompt Routing can cut expenses by as much as 30%.

Prompt caching Bedrock

Bedrock Prompt Caching is now supported by Amazon Bedrock, allowing you to store frequently used context in prompts for usage over several model invocations. This is particularly useful for applications that use the same context frequently, such coding assistants that need to preserve context about code files or document Q&A systems where users ask many questions about the same document. After every access, the cached context is accessible for a maximum of five minutes. For supported models, Amazon Bedrock’s quick caching can save latency by up to 85% and expenses by up to 90%.

It is simpler to lower latency and strike a balance between cost and performance with these capabilities. Let’s examine their potential use in your applications.

- Advertisement -

Using the console’s Amazon Bedrock Intelligent Prompt Routing feature

Amazon Bedrock Intelligent Prompt Routing optimizes cost and response quality by predicting each model’s performance for each request using sophisticated prompt matching and model understanding techniques. The default prompt routers for Anthropic’s Claude and Meta Llama model families are available for usage during the preview.

The AWS Command Line Interface (AWS CLI), the AWS Management Console, and the AWS SDKs are the ways to obtain intelligent prompt routing. You can select Prompt routers from the Foundation models area of the navigation pane in the Amazon Bedrock UI.

To learn more, you can select the Anthropic Prompt Router by default.

You can see from the prompt router’s settings that cross-region inference profiles are being used to route requests between Claude 3.5 Sonnet and Claude 3 Haiku. According to the router’s internal model at runtime, the routing criteria specify the quality difference between the responses of the largest and smallest models for each prompt. Anthropic’s Claude 3.5 Sonnet is the backup model that is employed in the event that none of the selected models satisfy the required performance standards.

When doing assessments or integrating with other Amazon Bedrock features like Amazon Bedrock Knowledge Bases and Amazon Bedrock Agents, prompt routers are used.

Using an AWS SDK and Bedrock Prompt Caching

The Amazon Bedrock Converse API supports rapid caching. The model processes the input and stores the intermediate results in a cache when you tag content for caching and deliver it to the model for the first time. The model greatly lowers latency and costs by loading the preprocessed results from the cache for subsequent requests with the same content.

Here are some steps to include Bedrock Prompt Caching into your applications:

  • Determine which parts of your prompts are used most often.
  • Use the new cachePoint block to mark these message list portions for caching.
  • In the response metadata utilisation area, keep an eye on cache utilisation and latency improvements.

Things to be aware of

Today, Amazon Bedrock Intelligent Prompt Routing is previewing in the US West (Oregon) and US East (N. Virginia) AWS Regions. You can use the default prompt routers during the preview, and using a prompt router doesn’t cost extra. You cover the chosen model’s price. Prompt routers can be used with other Amazon Bedrock features including configuring agents, utilising knowledge bases, and doing assessments.

Intelligent prompt routing presently only supports English language prompts since the prompt routers’ internal model must comprehend the prompt’s intricacy.

Anthropic’s Claude 3.5 Sonnet V2 and Claude 3.5 Haiku are available in preview in the US West (Oregon) with Amazon Bedrock support for rapid caching. For Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Pro, Bedrock Prompt Caching is also offered in US East (N. Virginia). The Amazon Bedrock prompt caching preview is available upon request here.

Cache reads are 90 percent less expensive than noncached input tokens when Bedrock Prompt Caching is used. Cache storage does not incur extra infrastructure fees. Tokens stored in the cache come at an extra expense when employing Anthropic models. Cache writes with Amazon Nova models don’t incur any extra fees.

Content is cached for a maximum of five minutes when prompt caching is used; this countdown is reset with each cache hit. Cross-region inference has been transparently supported with the implementation of Bedrock Prompt Caching. In this manner, your applications can benefit from the flexibility of cross-region inference along with the cost optimization and latency advantages of rapid caching.

These new features facilitate the development of generative AI applications that are both high-performing and reasonably priced. You may drastically cut expenses while preserving and even enhancing application performance by strategically rerouting requests and caching commonly used content.

- Advertisement -
Thota nithya
Thota nithya
Thota Nithya has been writing Cloud Computing articles for govindhtech from APR 2023. She was a science graduate. She was an enthusiast of cloud computing.
RELATED ARTICLES

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes