Anthropic’s new Claude prompt caching will save developers a fortune

Anthropic introduced prompt caching on its API, which remembers the context between API calls and allows developers to avoid repeating prompts.

The prompt caching feature is available in public beta on Claude 3.5 Sonnet and Claude 3 Haiku, but support for the largest Claude model, Opus, is still coming soon.

Prompt caching, described in this 2023 paper, lets users keep frequently used contexts in their sessions. As the models remember these prompts, users can add additional background information without increasing costs. This is helpful in instances where someone wants to send a large amount of context in a prompt and then refer back to it in different conversations with the model. It also lets developers and other users better fine-tune model responses.

Anthropic said early users “have seen substantial speed and cost improvements with prompt caching for a variety of use cases — from including a full knowledge base to 100-shot examples to including each turn of a conversation in their prompt.”

The company said potential use cases include reducing costs and latency for long instructions and uploaded documents for conversational agents, faster autocompletion of codes, providing multiple instructions to agentic search tools and embedding entire documents in a prompt.

Pricing cached prompts

One advantage of caching prompts is lower prices per token, and Anthropic said using cached prompts “is significantly cheaper” than the base input token price.

For Claude 3.5 Sonnet, writing a prompt to be cached will cost $3.75 per 1 million tokens (MTok), but using a cached prompt will cost $0.30 per MTok. The base price of an input to the Claude 3.5 Sonnet model is $3/MTok, so by paying a little more upfront, you can expect to get a 10x savings increase if you use the cached prompt the next time.

Claude 3 Haiku users will pay $0.30/MTok to cache and $0.03/MTok when using stored prompts.

While prompt caching is not yet available for Claude 3 Opus, Anthropic already published its prices. Writing to cache will cost $18.75/MTok, but accessing the cached prompt will cost $1.50/MTok.

However, as AI influencer Simon Willison noted on X, Anthropic’s cache only has a 5-minute lifetime and is refreshed upon each use.

Of course, this is not the first time Anthropic has tried to compete against other AI platforms through pricing. Before the release of the Claude 3 family of models, Anthropic slashed the prices of its tokens.

It’s now in something of a “race to the bottom” against rivals including Google and OpenAI when it comes to offering low-priced options for third-party developers building atop its platform.

Highly requested feature

Other platforms offer a version of prompt caching. Lamina, an LLM inference system, utilizes KV caching to lower the cost of GPUs. A cursory look through OpenAI’s developer forums or GitHub will bring up questions about how to cache prompts.

Caching prompts are not the same as those of large language model memory. OpenAI’s GPT-4o, for example, offers a memory where the model remembers preferences or details. However, it does not store the actual prompts and responses like prompt caching.