Prompt Caching

What's Prompt Cache

Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt.

"Prompt" in this context is referring to the input you send to the model as part of your chat completions request. Rather than reprocess the same input tokens over and over again, the service is able to retain a temporary cache of processed input token computations to improve overall performance. Prompt caching has no impact on the output content returned in the model response beyond a reduction in latency and cost.

Info

Typically, cache read fees are about 10%-25% of the original input cost, saving up to 90% of input costs.

Best Practices for Prompt Cache

Maximizing Cache Hit Rate

Optimization Recommendations

Maintain Prefix Consistency: Place static content at the beginning of prompts, variable content at the end
Use Breakpoints Wisely: Set different cache breakpoints based on content update frequency
Avoid Minor Changes: Ensure cached content remains completely consistent across multiple requests
Control Cache Time Window: Initiate subsequent requests within 5 minutes to hit cache

Extending Cache Time (1-hour TTL)

If your request intervals may exceed 5 minutes, consider using 1-hour cache:

{
    "type": "text",
    "text": "Long document content...",
    "cache_control": {
        "type": "ephemeral",
        "ttl": "1h" # Extend to 1 hour #
    }
}

The write cost for 1-hour cache is 2x the base fee (compared to 1.25x for 5-minute cache), only worthwhile in low-frequency but regular call scenarios.

Avoiding Common Pitfalls

Common Issues

Cached Content Too Short: Ensure cached content meets minimum token requirements
Content Inconsistency: Changes in JSON object key order will invalidate cache (certain languages like Go, Swift)
Mixed Format Usage: Using different formatting approaches for the same content
Ignoring Cache Validity Period: Cache becomes invalid after 5 minutes

Caching Types

Models supported by Siraya AI offer two types of prompt caching mechanisms:

Caching Type	Usage Method
Implicit Caching	No configuration needed, `automatically managed by model provider`
Explicit Caching	Requires `cache_control` parameter

Implicit Caching

The following model providers provide implicit automatic prompt caching, requiring no special parameters in requests—the model automatically detects and caches reusable content.

Model Provider	Official Documentation	Quick Start
OpenAI	Prompt Caching	#openai
DeepSeek	Prompt Caching
xAI	Prompt Caching	#grok
Google	Prompt Caching	#google-gemini
Alibaba	Prompt Caching
MoonshotAI	Prompt Caching
Z.AI	Prompt Caching

💡 Optimization Recommendations

To maximize cache hit rate, follow these best practices:

Static-to-Dynamic Ordering: Place stable, reusable content (such as system instructions, few-shot examples, document context) at the beginning of the messages array
Variable Content at End: Place variable, request-specific content (such as current user question, dynamic data) at the end of the array
Maintain Prefix Consistency: Ensure cached content remains completely consistent across multiple requests (including spaces and punctuation)

Explicit Caching

Anthropic Claude and Qwen series models can explicitly specify caching strategies through specific parameters. This approach provides the finest control but requires developers to actively manage caching strategies.

Model Provider	Official Documentation	Quick Start
Anthropic Claude	Prompt Caching	#anthropic-claude

Caching Working Principle

When you send a request with cache_control markers:

The system checks if a reusable cache prefix exists
If a matching cache is found, cached content is used (reducing cost)
If no match is found, the complete prompt is processed and a new cache entry is created

Cached content includes the complete prefix in the request: tools → system → messages (in this order), up to where cache_control is marked.

Automatic Prefix Check

You only need to add a cache breakpoint at the end of static content, and the system will automatically check approximately the preceding 20 content blocks for reusable cache boundaries. If the prompt contains more than 20 content blocks, consider adding additional cache_control breakpoints to ensure all content can be cached.

Getting Started

Anthropic Claude

Minimum Cache Length

Minimum cacheable token count for different models:

Model Series	Minimum Cache Tokens
Claude Opus 4.1/4	1024 tokens
Claude Haiku 3.5	2048 tokens
Sonnet 4.5/4/3.7	1024 tokens

Caching Price

Cache writes: charged at 1.25x the price of the original input pricing
Cache reads: charged at 0.1x the price of the original input pricing

Cache Breakpoint Count

Prompt caching with Anthropic requires the use of cache_control breakpoints. There is a limit of 4 breakpoints, and the cache will expire within 5 minutes. Therefore, it is recommended to reserve the cache breakpoints for large bodies of text, such as character cards, CSV data, RAG data, book chapters, etc. And there is a minimum prompt size of 1024 tokens.

Click here to read more about Anthropic prompt caching and its limitation.

The cache_control breakpoint can only be inserted into the text part of a multipart message. Prompts shorter than the minimum token count will not be cached even if marked with cache_control. Requests will be processed normally but no cache will be created.

Cache Validity Period

Default TTL: 5 minutes
Extended TTL: 1 hour (requires additional fee)

Cache automatically refreshes with each use at no additional cost.

System message caching example:

{
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a historian studying the fall of the Roman Empire. You know the following book very well:"
        },
        {
          "type": "text",
          "text": "HUGE TEXT BODY",
          "cache_control": {
            "type": "ephemeral"
          }
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What triggered the collapse?"
        }
      ]
    }
  ]
}

User message caching example:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Given the book below:"
        },
        {
          "type": "text",
          "text": "HUGE TEXT BODY",
          "cache_control": {
            "type": "ephemeral"
          }
        },
        {
          "type": "text",
          "text": "Name all the characters in the above book"
        }
      ]
    }
  ]
}

Basic Usage: Caching System Prompts

Python

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.siraya.pro/v1",
    api_key="<API_KEY>",
)

# First request - create cache
response = client.chat.completions.create(
    model="claude-sonnet-4-5@20250929", 
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant specializing in literary analysis. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"
                },
                {
                    "type": "text",
                    "text": "<Complete content of Pride and Prejudice>",
                    "cache_control": {"type": "ephemeral"} 
                }
            ]
        },
        {
            "role": "user",
            "content": "Analyze the main themes of Pride and Prejudice."
        }
    ]
)

print(response.choices[0].message.content)

# Second request - cache hit
response = client.chat.completions.create(
    model="claude-sonnet-4-5@20250929",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are an AI assistant specializing in literary analysis. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"
                },
                {
                    "type": "text",
                    "text": "<Complete content of Pride and Prejudice>",
                    "cache_control": {"type": "ephemeral"} # Same content hits cache #
                }
            ]
        },
        {
            "role": "user",
            "content": "Who are the main characters in this book?" # Only question differs #
        }
    ]
)

print(response.choices[0].message.content)

Advanced Usage: Caching Tool Definitions

When your application uses many tools, caching tool definitions can significantly reduce costs:

Python

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.siraya.pro/v1",
    api_key="<API_KEY>",
)

response = client.chat.completions.create(
    model="claude-sonnet-4-5@20250929",
    tools=[ 
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a specified location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and province, e.g. Beijing, Beijing"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "Temperature unit"
                        }
                    },
                    "required": ["location"]
                }
            }
        },
        # Can define more tools...
        {
            "type": "function",
            "function": {
                "name": "get_time",
                "description": "Get current time for a specified timezone",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "timezone": {
                            "type": "string",
                            "description": "IANA timezone name, e.g. Asia/Shanghai"
                        }
                    },
                    "required": ["timezone"]
                }
            },
            "cache_control": {"type": "ephemeral"} # Mark cache on last tool #
        }
    ],
    messages=[
        {
            "role": "user",
            "content": "What's the current weather and time in Beijing?"
        }
    ]
)

print(response.choices[0].message)

By adding a cache_control marker on the last tool definition, the system will automatically cache all tool definitions as a complete prefix.

Advanced Usage: Caching Conversation History

In long conversation scenarios, you can cache the entire conversation history:

Python

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.siraya.pro/v1",
    api_key="<API_KEY>",
)

response = client.chat.completions.create(
    model="claude-sonnet-4-5@20250929",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "...long system prompt",
                    "cache_control": {"type": "ephemeral"} # Cache system prompt #
                }
            ]
        },
        # Previous conversation history
        {
            "role": "user",
            "content": "Hello, can you tell me more about the solar system?"
        },
        {
            "role": "assistant",
            "content": "Of course! The solar system is a collection of celestial bodies orbiting the sun. It consists of eight planets, numerous satellites, asteroids, comets and other celestial objects..."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Great."
                },
                {
                    "type": "text",
                    "text": "Tell me more about Mars.",
                    "cache_control": {"type": "ephemeral"} # Cache all conversation up to here #
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

By adding cache_control to the last message of each conversation round, the system will automatically find and use the longest matching prefix from previously cached content. Even if content was previously marked with cache_control, as long as it's used within 5 minutes, it will automatically hit the cache and refresh the validity period.

Advanced Usage: Multi-Breakpoint Combination

When you have multiple content segments with different update frequencies, you can use multiple cache breakpoints:

Python

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.siraya.pro/v1",
    api_key="<API_KEY>",
)

response = client.chat.completions.create(
    model="claude-sonnet-4-5@20250929",
    tools=[ 
        # Tool definitions (rarely change)
        {
            "type": "function",
            "function": {
                "name": "search_documents",
                "description": "Search knowledge base",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string", "description": "Search query"}
                    },
                    "required": ["query"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "get_document",
                "description": "Retrieve document by ID",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "doc_id": {"type": "string", "description": "Document ID"}
                    },
                    "required": ["doc_id"]
                }
            },
            "cache_control": {"type": "ephemeral"} # Breakpoint 1: Tool definitions #
        }
    ],
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a research assistant with access to a document knowledge base.\n\n# Instructions\n- Always search for relevant documents first\n- Provide citations...",
                    "cache_control": {"type": "ephemeral"} # Breakpoint 2: System instructions #
                },
                {
                    "type": "text",
                    "text": "# Knowledge Base Context\n\nHere are the relevant documents for this conversation:\n\n## Document 1: Solar System Overview\nThe solar system consists of the sun and all celestial bodies orbiting it...\n\n## Document 2: Planetary Characteristics\nEach planet has unique characteristics...",
                    "cache_control": {"type": "ephemeral"} # Breakpoint 3: RAG documents #
                }
            ]
        },
        {
            "role": "user",
            "content": "Can you search for information about Mars rovers?"
        },
        {
            "role": "assistant",
            "content": [
                {
                    "type": "tool_use",
                    "id": "tool_1",
                    "name": "search_documents",
                    "input": {"query": "Mars rovers"}
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "tool_result",
                    "tool_use_id": "tool_1",
                    "content": "Found 3 relevant documents..."
                }
            ]
        },
        {
            "role": "assistant",
            "content": "I found 3 relevant documents. Let me get more details from the Mars exploration document."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Okay, please tell me specific information about the Perseverance rover.",
                    "cache_control": {"type": "ephemeral"} # Breakpoint 4: Conversation history #
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

Using multiple cache breakpoints allows content with different update frequencies to be cached independently:

Breakpoint 1: Tool definitions (almost never change)
Breakpoint 2: System instructions (rarely change)
Breakpoint 3: RAG documents (may update daily)
Breakpoint 4: Conversation history (changes every round)

When only the conversation history is updated, the cache for the first three breakpoints remains valid, maximizing cost savings.

What Invalidates Cache

The following operations will invalidate part or all of the cache:

Changed Content	Tool Cache	System Cache	Message Cache	Impact Description
Tool Definitions	✘	✘	✘	Modifying tool definitions invalidates entire cache
System Prompt	✓	✘	✘	Modifying system prompt invalidates system and message cache
tool_choice Parameter	✓	✓	✘	Only affects message cache
Add/Remove Images	✓	✓	✘	Only affects message cache

OpenAI

Caching price changes:

Cache writes: no cost
Cache reads: charged at 0.1x ~ 0.5x the price of the original input pricing

Click here to view OpenAI's cache pricing per model.

Prompt caching with OpenAI is automated and does not require any additional configuration. There is a minimum prompt size of 1024 tokens.

Click here to read more about OpenAI prompt caching and its limitation.

Grok

Caching price changes:

Cache writes: no cost
Cache reads: charged at 0.25x the price of the original input pricing

Click here to view Grok's cache pricing per model.

Prompt caching with Grok is automated and does not require any additional configuration.

Google Gemini

Implicit Caching

Gemini 2.5 Pro and 2.5 Flash models now support implicit caching, providing automatic caching functionality similar to OpenAI’s automatic caching. Implicit caching works seamlessly — no manual setup or additional cache_control breakpoints required.

Pricing Changes:

No cache write or storage costs.
Cached tokens are charged at 0.1x the price of original input token cost.

Note that the TTL is on average 3-5 minutes, but will vary. There is a minimum of 1028 tokens for Gemini 2.5 Flash, and 2048 tokens for Gemini 2.5 Pro for requests to be eligible for caching.

Official announcement from Google

Info

To maximize implicit cache hits, keep the initial portion of your message arrays consistent between requests. Push variations (such as user questions or dynamic context elements) toward the end of your prompt/requests.

Explicit Caching

Gemini caching in Siraya AI requires you to insert cache_control breakpoints explicitly within message content, similar to Anthropic and Qwen. We recommend using caching primarily for large content pieces (such as CSV files, lengthy character cards, retrieval augmented generation (RAG) data, or extensive textual sources).

Info

There is not a limit on the number of cache_control breakpoints you can include in your request. Siraya AI will use only the last breakpoint for Gemini caching. Including multiple breakpoints is safe and can help maintain compatibility with Anthropic, but only the final one will be used for Gemini.

Cache Validity Period

Default TTL: 5 minutes
Extended TTL: 1 hour (requires additional fee)

Caching Working Principle

When you send a request with cache_control markers:

The system checks if a reusable cache prefix exists
If a matching cache is found, cached content is used (reducing cost)
If no match is found, the complete prompt is processed and a new cache entry is created

Cached content includes the complete prefix in the request: tools → system → messages (in this order), up to where cache_control is marked.

Examples

System Message Caching Example

Python

import requests
import json

response = requests.post(
  url="https://llm.siraya.pro/v1/chat/completions",
  headers={
    "Authorization": "Bearer YOUR-API-KEY",
    "Content-Type": "application/json"
  },
  data=json.dumps({
    "model": "google/gemini-2.5-flash", 
    "messages": [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a historian studying the fall of the Roman Empire. Below is an extensive reference book:"
                },
                {
                    "type": "text",
                    "text": "HUGE TEXT BODY HERE",
                    "cache_control": {
                        "type": "ephemeral",
                        "ttl": 300
                    }
                }
            ]
        },{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What triggered the collapse?"
                }
            ]
        }
    ],
    "provider": {
        "order": ["google-vertex"]
    },
    "max_tokens": 1024
  })
)
print(response.json())

User Message Caching Example

Python

import requests
import json

response = requests.post(
  url="https://llm.siraya.pro/v1/chat/completions",
  headers={
    "Authorization": "Bearer YOUR-API-KEY",
    "Content-Type": "application/json"
  },
  data=json.dumps({
    "model": "google/gemini-2.5-flash", 
    "messages": [
        {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Based on the book text below:"
            },
            {
                "type": "text",
                "text": "HUGE TEXT BODY HERE",
                "cache_control": {
                    "type": "ephemeral",
                    "ttl": 300
                }
            },
            {
                "type": "text",
                "text": "List all main characters mentioned in the text above."
            }
        ]
        }
    ],
    "provider": {
        "order": ["google-vertex"]
    },
    "max_tokens": 1024
  })
)
print(response.json())

Best Practices

Optimization Recommendations

Maintain Prefix Consistency: Place static content at the beginning of prompts, variable content at the end
Avoid Minor Changes: Ensure cached content remains completely consistent across multiple requests
Control Cache Time Window: Initiate subsequent requests within 5 minutes to hit cache
Extending Cache Time (1-hour TTL): If your request intervals may exceed 5 minutes, consider using 1-hour cache:

{
    "type": "text",
    "text": "Long document content...",
    "cache_control": {
        "type": "ephemeral",
        "ttl": 3600 # Extend to 1 hour #
    }
}