Evaluate Cost Optimization Projections
Want to see how reducing token inputs affects your monthly API expenses? Plug your target values into our LLM Cost Calculator.
Launch Cost Calculator1. Prompt Engineering Optimization: Pruning System Instructions
The most common cause of token bloat is "lazy prompt engineering." Developers often copy massive, multi-paragraph system prompts containing conversational rules, examples, and markdown schemas, and append them to every single message in a session. Because LLMs are stateless, the provider must process the entire system instructions on every turn.
To optimize prompt size:
- Consolidate Rules: Remove conversational boilerplate (e.g., "You are a helpful assistant. Try to be polite..."). Models are already trained on safety and behavior; prioritize deterministic, functional instructions instead.
- Limit Few-Shot Examples: Providing 5 examples in a prompt might improve accuracy slightly, but it consumes thousands of input tokens. Test if you can get identical accuracy with 1 or 2 high-quality examples, or offload examples to a fine-tuned model.
- Use Short Variables: Replace verbose tags with concise structures.
2. Structuring Payloads: XML vs. JSON
Serialization formats have a high impact on token counts. When communicating structured data to an LLM, developers frequently default to JSON because it maps directly to programming objects. However, JSON is verbose:
A JSON block like: {"username": "johndoe", "email": "john@example.com", "role": "admin"}
Requires colons, commas, double-quotes, and brackets that tokenizers must parse individually.
XML tags represent a far more token-efficient format: <user name="johndoe" email="john@example.com" role="admin" />.
Furthermore, models like Claude are pre-trained on XML documentation. They recognize opening and closing XML tags (e.g., <doc> and </doc>) as unified semantic concepts, allowing tokenizers to merge them into fewer token IDs compared to JSON brackets.
3. Implementing Prompt Caching Strategies
For RAG pipelines or complex conversational agents, implementing Context Caching is the highest leverage cost-saving tool available.
To maximize cache-hits, developers must understand how cache keys are calculated. Providers cache context blocks starting from the beginning of the prompt. If any character changes in the middle of a cached segment, the cache invalidates for everything after that character.
Therefore, you must structure prompts with static content first:
// CORRECT STRUCTURE (CACHE FRIENDLY): 1. [STATIC] System Prompt & Instructions 2. [STATIC] RAG Documents / Reference Context 3. [DYNAMIC] User Conversation History 4. [DYNAMIC] New User Query
If you place conversation history (which changes on every turn) before reference documents, the cache will invalidate on every message, rendering prompt caching useless.
4. Model Routing: The Router-Agent Architecture
Not every query requires a $15 / MTok reasoning engine. A robust cost reduction architecture implements a **routing layer**:
When a user query arrives, a cheap classifier model (such as GPT-4o mini or Llama 3.1 8B, costing $0.15 / MTok) evaluates the complexity:
- If the query is a simple greeting or factual question: The classifier responds immediately or routes to the cheap model.
- If the query requires multi-step math or programming logic: The classifier routes the prompt to the premium model (Claude 3.5 Sonnet).
By offloading 70% of low-complexity requests to lightweight models, the blended cost of operation drops dramatically without degrading perceived capability.
5. Summary Checklist for Developers
Keep these items in mind during AI app development:
- Verify token sizes of prompts using our Token Calculator.
- Inject prompt caching headers in API payloads.
- Filter RAG contexts using semantic rerankers (e.g., Cohere Rerank) to restrict retrieval size to under 5 high-relevance chunks.
- Enforce strict output length restrictions via system prompts to limit expensive completion token generation.
Test Cost Reductions
Model different token configurations and track the financial returns of your prompt optimizations.
Launch Cost CalculatorFrequently Asked Questions
What is the most effective way to cut API costs immediately?
Implement prompt caching for static instructions and large database schemas. For dynamic systems, structure routing protocols that route simple queries to small models (like GPT-4o mini) and only escalate complex queries to larger frontier models.
How does payload structure (XML vs JSON) affect token sizes?
JSON requires closing braces, quotes, and punctuation that tokenizers split into separate tokens. XML tags (e.g. <input>) are processed much more efficiently by modern models like Claude because closing tags can often be represented as single tokens.
Should I compress user inputs before sending them to the LLM?
Yes. In RAG pipelines, filtering search segments using a semantic reranker removes redundant contexts. Truncating excess spaces, carriage returns, and duplicate text from input payloads can save 10% to 20% on input token counts.
Can I use LLMs to compress prompts for other LLMs?
Yes. Developers use techniques like LLMLingua to compress prompts. By analyzing token probabilities, a smaller, fast model can strip out 30% of low-information tokens from a prompt without altering the reasoning output of the target model.