73% of agent failures trace back to poor context engineering--not model limitations, not prompt wording, not insufficient training data. Your agent isn't failing because of the model. It's failing because of context.
Andrej Karpathy described LLMs as "a new kind of operating system," with context functioning as RAM. Poorly managed context causes hallucinations and forgotten instructions--just like a computer thrashing with insufficient memory.
This guide covers four battle-tested patterns, real cost analysis, working code, and metrics frameworks.
Understanding Context as a Resource
Context isn't a single thing--it's a complex allocation problem. Every agent juggles four distinct types, each competing for the same finite token budget:
- History of user interactions
- Questions, clarifications, responses
- Grows linearly with conversation
- Why long conversations degrade
- Instructions, persona, guidelines
- Relatively stable but underestimated
- 2,000-4,000 tokens typical
- Consumed before conversation starts
- Function definitions and schemas
- Tool descriptions and results
- 20 tools = 3,000+ tokens
- Each call adds more
- RAG systems, memory stores
- External knowledge bases
- Most variable (500-5,000 tokens)
- Single retrieval can dominate
The Context Budget Mental Model
Think of context as a budget, not a container. Every token has a cost--both literal (API pricing) and functional (attention dilution). The question isn't "does this fit?" but "is this worth its cost?"
| Model | Input (per 1M tokens) | Output (per 1M tokens) | 100K Context Cost |
|---|---|---|---|
| Claude Opus 4.5 | $5.00 | $25.00 | $0.50 input |
| Claude Sonnet 4 | $3.00 | $15.00 | $0.30 input |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 input |
| GPT-4o | $5.00 | $20.00 | $0.50 input |
| GPT-4o mini | $0.60 | $2.40 | $0.06 input |
| DeepSeek-V3 | $0.28 | $1.10 | $0.03 input |
Pricing as of January 2026. Cache hits typically reduce input costs by 90%.
A single conversation that accumulates 100,000 tokens of context costs between $0.03 and $0.50 per turn on input alone. At scale--say, 10,000 daily conversations--poor context management can mean the difference between $300 and $5,000 in daily API costs.
Context Rot: The Hidden Performance Killer
Contrary to what tutorials suggest, larger context windows don't mean better performance. Chroma's research on "Context Rot" shows LLM accuracy degrades significantly as input length increases.
"Lost in the Middle" Effect
Information in the middle of long contexts is retrieved less reliably than at the beginning or end.
Sudden Accuracy Cliffs
Performance doesn't decrease smoothly--models often hit sharp drop-offs at certain context lengths.
Bottom line: Stuffing more context into prompts isn't a strategy--it's a liability.
Context Engineering Patterns
Four essential patterns have emerged from production deployments:
Pattern 1: Context Compression
Compression reduces context size while preserving essential information. The goal is maintaining signal while reducing tokens.
When to use: Long-running conversations, extensive tool outputs, accumulated history.
When to avoid: When exact wording matters (legal documents, code review, precise quotes).
The most practical implementation combines a buffer for recent messages with summarization for older history:
from langchain.memory import ConversationSummaryBufferMemory
from langchain_anthropic import ChatAnthropic
def create_compressed_memory(max_recent_tokens: int = 4000):
# Keep recent messages verbatim, summarize older ones
llm = ChatAnthropic(
model="claude-sonnet-4-20250514",
max_tokens=500
)
memory = ConversationSummaryBufferMemory(
llm=llm,
max_token_limit=max_recent_tokens,
return_messages=True,
human_prefix="User",
ai_prefix="Assistant"
)
return memory
# Usage
memory = create_compressed_memory(max_recent_tokens=4000)
Production tip: Use a cheaper, faster model for summarization. Summarizing with Haiku at $1/million tokens instead of Opus at $5/million tokens reduces compression overhead by 80% with minimal quality loss.
Pattern 2: Context Prioritization
Not all context is equally valuable. Context value is dynamic--what's relevant changes based on the current query.
When to use: Multi-turn conversations, multiple retrieved documents, competing context sources.
A robust priority system scores items on multiple dimensions:
from dataclasses import dataclass, field
import time
@dataclass
class ContextItem:
content: str
source: str # "conversation", "tool", "retrieval", "system"
token_count: int
created_at: float = field(default_factory=time.time)
relevance_score: float = 0.5
importance: float = 0.5
@property
def recency_score(self) -> float:
age_minutes = (time.time() - self.created_at) / 60
return max(0.1, 1.0 - (age_minutes / 60))
@property
def priority(self) -> float:
return (self.relevance_score * 0.5 +
self.recency_score * 0.3 +
self.importance * 0.2)
Production tip: Track which context items are excluded and why. This data reveals patterns--if conversation history is consistently dropped, your system prompt might be too long. If retrieved documents are excluded, your retrieval is returning too much.
Pattern 3: Context Externalization
Move information outside the context window while keeping it accessible--the foundation of RAG systems, but extends beyond document retrieval.
When to use: Information occasionally needed but not required every turn, historical data, reference material.
- Current task instructions
- Active conversation (last 5-10 turns)
- Tool definitions for likely actions
- Critical user preferences
- Historical task results
- Older conversation history
- Tool definitions for rare actions
- General user profile data
Pattern 4: Context Segmentation
Break complex tasks into context-bounded subtasks, each handled by a specialized agent with a clean context window.
When to use: Multi-step workflows, tasks requiring different expertise, long-running agents.
Key principle: Each segment should complete its task with minimal context from previous segments. If extensive prior context is needed, you've drawn boundaries wrong.
Production Challenges and Solutions
Theory is clean; production is messy. Here are common challenges:
- Problem: Early context gets summarized/dropped in long conversations
- Fix: Periodic context anchoring - refresh critical info every N turns
- Problem: Single tool call returns 10K+ tokens, crowds out everything
- Fix: Tool result policies with max token budgets per tool
- Problem: Critical info buried in middle of context gets missed
- Fix: Position-aware assembly - important info at start and end
- Problem: 50-turn conversation costs 10x more than 5-turn
- Fix: Tiered context strategies based on query complexity
Measuring Context Effectiveness
What gets measured gets managed. Here are the core metrics:
% of context budget used. Too low = over-conservative. Too high = no headroom.
% of context actually referenced in response. Below 40% = wasted tokens.
True efficiency metric. Total tokens for successful completions only.
% of failures from truncation, missing info, or outdated data.
Struggling with context management in your AI agents? Fenlo AI can help you implement these patterns in production. Book a free discovery call to discuss your specific challenges.
Implementation Guide
A practical 5-step path to implementing context engineering:
Audit Current Context Usage
Log complete context payloads for 100+ conversations. Calculate average context size per turn. Identify largest consumers (usually tool results or retrieved documents). Track failure correlations with context state.
Implement Context Budgeting
Set initial budget based on model limits (leave 20% headroom). Reserve tokens for system prompts. Allocate remaining budget across context types. Log utilization metrics from day one.
Add Compression Layer
Implement ConversationSummaryBufferMemory or equivalent. Set buffer size based on average useful history length. Monitor for compression that loses critical information. Tune aggressiveness based on observed issues.
Build Context Observability
Implement the ContextObserver pattern. Create dashboards for utilization, exclusion rates, and costs. Set alerts for anomalies. Review weekly to identify optimization opportunities.
Iterate Based on Metrics
Run A/B tests on strategy changes. Analyze failure cases for context-related root causes. Adjust budgets and priorities based on production data. Document what works for your specific use case.
Common Implementation Mistakes
Too Conservative Limits
Using only 40% of available context = leaving capability on the table.
Equal Treatment
Tool results and documents need different budgets than conversation history.
Over-Compression
If users say "I already told you that," compression is losing critical info.
Ignoring Position
Burying critical instructions in the middle of long contexts reduces recall.
No Measurement
Without metrics, context engineering becomes pure guesswork.
Conclusion
Context engineering is the primary determinant of whether your agent succeeds or fails in production. The four patterns--compression, prioritization, externalization, and segmentation--transform context from a source of mysterious failures into a well-understood, optimizable system.
1. Measure: Add logging to capture your current context sizes and utilization rates
2. Analyze: Review 10 failed conversations and check for context-related causes
3. Budget: Implement basic context budgeting with your model's limits
4. Compress: Add conversation summarization for histories over 10 turns
5. Track: Set up a dashboard for context metrics
Context engineering is an ongoing discipline with substantial payoffs: agents that maintain coherence, costs that scale predictably, and failures that trace to understandable causes.
Need Help with Context Engineering?
FenloAI specializes in building production AI agents with robust context engineering. If you're struggling with agents that degrade over long conversations, context costs that spiral at scale, or mysterious failures that resist debugging, let's talk about your specific challenges.
Get in TouchReferences
- Anthropic Engineering. "Effective Context Engineering for AI Agents." anthropic.com
- Chroma Research. "Context Rot - How Increasing Input Tokens Impacts LLM Performance." research.trychroma.com
- LangChain. "Context Engineering for Agents." blog.langchain.com
- LangGraph Documentation. "Context Engineering." docs.langchain.com
- MongoDB. "Powering Long-Term Memory for Agents with LangGraph." mongodb.com