Implicit Caching Is Now Supported In Gemini 2.5 Models

In May 2024, Google introduced context caching, which allowed developers to use explicit caching to save 75% of the repetitious context that was supplied to the models. Implicit caching, a much-requested feature of the Gemini API, is being released today.

Using the Gemini API for implicit caching

Developers can immediately benefit from cache cost savings through implicit caching, which eliminates the requirement for explicit cache creation. A cache hit is now possible when you send a request to one of the Gemini 2.5 models if it has the same prefix as one of your earlier requests. You will receive the same 75% token discount as it dynamically pass cost savings back to you.

The content at the start of the request should remain the same, but at the end of the prompt, you should include additional context that may vary from request to request, such as a user’s inquiry, to enhance the likelihood that your request contains a cache hit. The Gemini API documentation contains more best practices for implicit caching.

Google Developers lowered the minimum request size for Gemini 2.5 Flash to 1024 tokens and 2.5 Pro to 2048 tokens in order to increase the number of requests that could be cache hits.

Gemini 2.5: An understanding of token discounts

You can still use it explicit caching API, which supports the Gemini 2.5 and 2.0 models, if you wish to ensure cost reductions. Cached_content_token_count, which shows how many tokens in the request were cached and will thus be charged at the reduced price, will appear in the use information if you are currently utilising Gemini 2.5 models.

Before becoming stable, preview models may undergo modifications and have more stringent rate constraints.


Free Tier
Paid Tier, per 1M tokens in USD
Input priceFree of charge, use “gemini-2.5-pro-exp-03-25”$1.25, prompts <= 200k tokens
$2.50, prompts > 200k tokens
Output price (including thinking tokens)Free of charge, use “gemini-2.5-pro-exp-03-25”$10.00, prompts <= 200k tokens
$15.00, prompts > 200k
Context caching priceNot available$0.31, prompts <= 200k tokens
$0.625, prompts > 200k
$4.50 / 1,000,000 tokens per hour
Grounding with Google SearchFree of charge, up to 500 RPD1,500 RPD (free), then $35 / 1,000 requests
Used to improve our productsYesNo

Context caching

The same input tokens may be repeatedly passed to a model in a standard AI process. There are two distinct caching techniques available in the Gemini API:

  • Implicit caching (automatic, no assurance of cost savings)
  • Explicit caching (manual, guaranteed to save money)

Gemini 2.5 models come with implicit caching enabled by default. Google Developers immediately return the cost savings to you if a request includes content that is a cache hit.

When you want to ensure cost reductions, explicit caching can be helpful, but it will need more developer work.

Implicit caching

A default setting for all Gemini 2.5 models is implicit caching. In the event that your request reaches caches, it immediately pass on cost savings. You don’t have to do anything to make this possible. It goes into effect on May 8, 2025. For 2.5 Flash and 2.5 Pro, the minimum number of input tokens required for context caching is 1,024 and 2,048 respectively.

To make an implicit cache hit more likely:

  • Consider starting your prompt with large, common contents.
  • Make an effort to send requests quickly with similar prefixes.

The usage_metadata field of the response object displays the number of tokens that were cache hits.

Explicit caching

You can send some content to the model once, cache the input tokens, and then use the cached tokens for other queries by using the Gemini API’s explicit caching capability. Using cached tokens is less expensive than repeatedly passing in the same corpus of tokens at specific volumes.

You can specify how long you want a cache to hold a collection of tokens before they are automatically removed. The time to live (TTL) refers to this caching period. The TTL defaults to one hour if it is not set. The size of the input tokens and the desired persistence period of the tokens determine the caching cost.

As demonstrated in the quickstart, this part assumes that you have installed the Gemini SDK (or have curl installed) and that you have set up an API key.

Drakshi
Drakshi
Since June 2023, Drakshi has been writing articles of Artificial Intelligence for govindhtech. She was a postgraduate in business administration. She was an enthusiast of Artificial Intelligence.
RELATED ARTICLES

Page Content

Recent Posts

Index