Skip to content

Conversation

tpaulshippy
Copy link
Contributor

@tpaulshippy tpaulshippy commented Jun 9, 2025

What this does

Automatically opts into prompt caching in both Anthropic and Bedrock providers for Claude models that support it. And report prompt caching token counts for OpenAI and Gemini which cache automatically.

Disable prompt caching:

RubyLLM.configure do |config|
  config.cache_prompts = false # Disable prompt caching with Anthropic models
end

Caching just system prompts:

chat = RubyLLM.chat
chat.with_instructions("You are a helpful assistant.")
chat.ask("What is the capital of France?", cache: :system)

Caching just user prompts:

chat = RubyLLM.chat
chat.ask("What is the capital of France?", cache: :user)

Caching just tool definitions:

chat = RubyLLM.chat
chat.with_instructions("You are a helpful assistant.")
chat.with_tool(MyTool)
chat.ask("What is the capital of France?", cache: :tools)

Caching system prompts and tool definitions:

chat = RubyLLM.chat
chat.with_instructions("You are a helpful assistant.")
chat.with_tool(MyTool)
chat.ask("What is the capital of France?", cache: [:system, :tools])

Type of change

  • New feature

Scope check

  • I read the Contributing Guide
  • This aligns with RubyLLM's focus on LLM communication
  • This isn't application-specific logic that belongs in user code
  • This benefits most users, not just my specific use case

Quality check

  • I ran overcommit --install and all hooks pass
  • I tested my changes thoroughly
  • I updated documentation if needed
  • I didn't modify auto-generated files manually (models.json, aliases.json)

API changes

  • New public methods/classes

Related issues

Resolves #13

@tpaulshippy tpaulshippy changed the title Prompt caching Prompt caching for Claude Jun 9, 2025
@tpaulshippy tpaulshippy marked this pull request as ready for review June 9, 2025 21:44
@tpaulshippy
Copy link
Contributor Author

@crmne As I don't have an Anthropic key, I'll need you to generate the VCR cartridges for that provider. Hoping everything just works, but let me know if not.

@crmne
Copy link
Owner

crmne commented Jun 11, 2025

@tpaulshippy this would be great to have! Will you be willing to enable it on all providers?

I'll do a proper review when I can.

@tpaulshippy
Copy link
Contributor Author

My five minutes of research indicates that at least OpenAI and Gemini take the approach of automatically caching for you based on the size and structure of your request. So the only support I think we'd really need for those two is to populate the cached token counts on the response messages. Unless we want to try to support explicit caching on the Gemini API but that looks complex and not as commonly needed.

Do you know of other providers that require payload changes for prompt caching?

def with_cache_control(hash, cache: false)
return hash unless cache

hash.merge(cache_control: { type: 'ephemeral' })
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing this might cause errors on older models that do not support caching. If it does, we could raise here, or just let the API validation handle it. I'm torn on whether the capabilities check complexity is worth it as these models are probably so rarely used.

@tpaulshippy
Copy link
Contributor Author

@crmne As I don't have an Anthropic key, I'll need you to generate the VCR cartridges for that provider. Hoping everything just works, but let me know if not.

Scratch that. I decided to stop being a cheapskate and just pay Anthropic their $5.

@tpaulshippy
Copy link
Contributor Author

Looking to implement this in our project and now I'm wondering if it should be an opt out rather than an opt in. If you are using unique prompts every time I guess it adds some cost to cache them but my guess is in most applications prompts will get repeated, especially system prompts.

Copy link
Owner

@crmne crmne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this feature @tpaulshippy, however there are several improvements I'd like you to make before we merge this.

On top of the ones made in the comments, and the most important one, I'd like to have prompt caching implemented in all providers.

Plus I have not fully checked the logic in providers/anthropic but the patch seems a bit heavy-handed with the amount of changes needed at first glance. Where all changes necessary or could it be done in a simpler manner?

@crmne crmne added the enhancement New feature or request label Jul 16, 2025
@tpaulshippy
Copy link
Contributor Author

tpaulshippy commented Jul 16, 2025

I'd like to have prompt caching implemented in all providers.

Did you see this? Is the request to populate the cached token counts on the response messages for OpenAI and Gemini?

@crmne
Copy link
Owner

crmne commented Jul 16, 2025

Did you see this? Is the request to populate the cached token counts on the response messages for OpenAI and Gemini?

Thank you for pointing that out, I had missed it. I think it would certainly be a nice addition to RubyLLM to have all providers have almost the same level of support of caching.

@tpaulshippy
Copy link
Contributor Author

Did you see this? Is the request to populate the cached token counts on the response messages for OpenAI and Gemini?

Thank you for pointing that out, I had missed it. I think it would certainly be a nice addition to RubyLLM to have all providers have almost the same level of support of caching.

Ok we have a bit of a naming issue. Here's the property names we get from each provider:

Anthropic
cache_creation_input_tokens
cache_read_input_tokens

OpenAI
cached_tokens

Gemini
cached_content_token_count

My reading of the docs indicates that the OpenAI and Gemini values correspond pretty closely with the cache_read_input_tokens in Anthropic.

What should we call these properties in the Message?

@crmne
Copy link
Owner

crmne commented Jul 16, 2025

For the naming, let's go with:

  • cached_tokens - maps to the cache read values from all providers (the main property developers will use)
  • cache_creation_tokens - Anthropic-specific cache creation cost (nil for other providers)

This keeps it consistent with our existing input_tokens/output_tokens pattern while handling the provider differences cleanly.

Can you update the Message properties to use these names? Thanks Paul!

@tpaulshippy
Copy link
Contributor Author

That could work for system and user messages. How about tools?

I still think there will be some complexity in the anthropic provider that may be unavoidable.

@tpaulshippy
Copy link
Contributor Author

tpaulshippy commented Aug 30, 2025

One complication is that these params have to be nested under the content array for anthropic.

So it would actually need to be something like this:

chat.add_message(
  role: :user,
  content: "huge doc",
  params: { content: [{ cache_control: { type: 'ephemeral' } }] }
)

Right?

@tpaulshippy
Copy link
Contributor Author

tpaulshippy commented Aug 30, 2025

I don't think adding the params to all messages with chat.with_message_params will generally work as you suggest.

Couple of reasons -

  1. Anthropic limits you to 4 cache breakpoints per request.
  2. You generally want your cache breakpoints on the last system or the last user message to get the maximum use of the cache.

@tpaulshippy
Copy link
Contributor Author

tpaulshippy commented Aug 30, 2025

I do like the suggestion of adding params to messages because that could give more fine grained control over exactly where you want the breakpoints. Would also enable passing optional keys on other providers like this one from OpenAI --
image
EDIT: Not seeing this option on the OpenAI responses API...

Copy link
Contributor Author

@tpaulshippy tpaulshippy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to explore making this a per message option and maybe something like a with_tools_params option to make this more provider agnostic.

Thanks for the feedback @crmne

chat = RubyLLM.chat(model: 'claude-3-5-haiku-20241022')

# Enable caching for different types of content
chat.cache_prompts(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, the more I'm thinking it is odd to have a provider specific feature enabled like this. I wish anthropic was just more like the others in this area.

@tpaulshippy
Copy link
Contributor Author

One complication is that these params have to be nested under the content array for anthropic.

So it would actually need to be something like this:

chat.add_message(
  role: :user,
  content: "huge doc",
  params: { content: [{ cache_control: { type: 'ephemeral' } }] }
)

Looks like to make this work, deep_merge will have to be enhanced to support arrays. Currently it just overrides the whole array rather than merging the hashes within. Will need something like this:
image
What do you think @crmne ?

@tpaulshippy
Copy link
Contributor Author

I'm honestly a bit torn on the params approach. My concern is that it will require consumers to deal with a lot of messy hashes and avoiding that is one of the main reasons I switched to this library. On the other hand this particular feature (opting in to caching with per message breakpoints) is very Anthropic specific so following the pattern set with other provider-specific features it should be enabled via a params hash.

@tpaulshippy
Copy link
Contributor Author

tpaulshippy commented Sep 8, 2025

I decided to try a different approach. Since the opting into caching is a provider specific feature, I decided to introduce with_provider_options where you can pass either named parameters or an instance of an options class.

I don't love the params: { cache_control: { type: 'ephemeral' } } approach for a couple of reasons:

  1. The hash looks messy and feels like I'm going back to langchainrb
  2. It requires people to understand the intricacies of the provider payload. Avoiding this is one of the reasons people use this gem.

I think we want a beautiful way to turn on caching for anthropic.

What do you think, @crmne?

@AlexVPopov
Copy link
Contributor

Hey @tpaulshippy thank you for this! This feature is a dealbreaker for us and was the last blocker for us to fully migrate to RubyLLM. We've forked your fork and are currently running it in production - works like a charm! Thank you! ❤️

@crmne we really hope you would consider merging this. 😊🤞🏻

@crmne
Copy link
Owner

crmne commented Sep 21, 2025

I decided to try a different approach. Since the opting into caching is a provider specific feature, I decided to introduce with_provider_options where you can pass either named parameters or an instance of an options class.

Honestly, this feels like a weird in between.

The reason why I proposed .with_message_params is that it is generic and we can reuse it in many different contexts, not just prompt caching for Anthropic, and it's completely in line with the rest of the .with_params and .with_headers APIs.

The other way I'm willing to accept is: RubyLLM.chat(cache: true) and chat.ask("message", cache: true)

Simple. Maybe we should even enable that by default (and add a configuration toggle).

@tpaulshippy
Copy link
Contributor Author

The other way I'm willing to accept is: RubyLLM.chat(cache: true) and chat.ask("message", cache: true)

And that would do what? Cache the last message, the last system prompt, and the tools? What if someone wants to just cache tools or just their system message? Setting the cache breakpoint on the user message is useful in multi-turn conversations, but in one shot prompt scenarios you only want to cache your system prompt since it is unlikely you will see the exact same user message again in a five minute window.

@crmne
Copy link
Owner

crmne commented Sep 22, 2025

And that would do what? Cache the last message, the last system prompt, and the tools? What if someone wants to just cache tools or just their system message? Setting the cache breakpoint on the user message is useful in multi-turn conversations, but in one shot prompt scenarios you only want to cache your system prompt since it is unlikely you will see the exact same user message again in a five minute window.

I get your point. Let's simply expand on that then: cache can accept a boolean, a symbol that's either :system, :user, or :tools, or an array of said symbols.

@tpaulshippy tpaulshippy requested a review from crmne September 22, 2025 15:37
@sosso
Copy link

sosso commented Sep 24, 2025

One-shot prompt scenarios is our main use case, the above would work great. Caching support is also a blocker on us making the jump to RubyLLM, thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support prompt caching
6 participants