What Is Context Engineering?
If you've been working with AI models in 2026, you've probably noticed something: the quality of your prompts matters less than the quality of context you feed your models. This shift has a name — context engineering — and a new peer-reviewed paper with 9,649 experiments just proved why it's replacing prompt engineering as the critical skill for AI practitioners.
Context engineering is the systematic practice of structuring, formatting, and delivering information to large language models (LLMs) through their context windows. Unlike prompt engineering, which focuses on how you ask, context engineering focuses on what information surrounds your request — the schemas, files, data formats, and retrieval architecture that determine whether a model succeeds or fails at complex tasks.
The paper "Structured Context Engineering for File-Native Agentic Systems" by Damon McMillan, published February 2026, provides the first large-scale empirical study of how context structure affects LLM agent performance. The results challenge several common assumptions — and have major implications for anyone building with AI.
The Study: 9,649 Experiments Across 11 Models
McMillan's research is the most comprehensive study of context engineering to date. Using SQL generation as a proxy for programmatic agent operations, the study tested:
11 models spanning frontier and open-source tiers
4 data formats: YAML, Markdown, JSON, and TOON (Token-Oriented Object Notation)
Schema scales ranging from 10 to 10,000 database tables
Two architectures: single-context vs. file-based context retrieval
The results, covered by Simon Willison, reveal five key findings that every AI developer needs to understand.
Finding #1: Model Choice Dwarfs Everything Else
The single biggest factor in task accuracy wasn't the format, the architecture, or the prompt — it was the model itself. Frontier models (Claude Opus 4.5, GPT-5.2, Gemini 2.5 Pro) outperformed open-source models (DeepSeek V3.2, Kimi K2, Llama 4) by a massive 21 percentage points on accuracy.
That 21-point gap dwarfs any effect from format choice or retrieval architecture. As Willison noted, this reinforces what the Terminal Bench 2.0 leaderboard already shows: Anthropic, OpenAI, and Google still dominate agentic coding tasks.
Tier | Models Tested | Relative Accuracy |
|---|---|---|
Frontier | Claude Opus 4.5, GPT-5.2, Gemini 2.5 Pro | +21 percentage points vs. open-source |
Open Source | DeepSeek V3.2, Kimi K2, Llama 4 | Baseline |
The takeaway? If you're building serious AI agents, model selection is your highest-leverage decision — not prompt tweaking.
Finding #2: File-Based Context Helps Frontier Models, Hurts Open Source
This is perhaps the most surprising and actionable finding. The study tested two approaches to delivering schema context to models:
Single-context: Dumping all schema information into one prompt
File-based retrieval: Spreading schemas across a filesystem that the agent navigates
For frontier models, file-based context retrieval improved accuracy by +2.7% (p=0.029) — a statistically significant improvement. These models could effectively navigate filesystem structures, grep for relevant files, and assemble the context they needed.
But for open-source models, file-based retrieval produced an aggregate accuracy decrease of -7.7% (p<0.001). The models struggled with the multi-step navigation required to find and read relevant files.
As Simon Willison observed: "This reinforces my feeling that the filesystem coding agent loops aren't handled as well by open weight models just yet." If you're using tools like Claude Code or similar AI coding agents, the filesystem-native workflow matters — but only if the model behind it can handle it.
Finding #3: Format Doesn't Matter (Much)
One of the most liberating findings: format choice had no statistically significant effect on aggregate accuracy (chi-squared=2.45, p=0.484). Whether you use YAML, Markdown, JSON, or the new TOON format, the models performed roughly the same overall.
However, individual models — particularly open-source ones — showed format-specific sensitivities. This means there's no universal "best format," but there may be a best format for your specific model.
Format | Token Efficiency | Accuracy Impact | Best For |
|---|---|---|---|
YAML | Good | No significant difference | Familiar, widely used in config files |
Markdown | Moderate | No significant difference | Human-readable documentation |
JSON | Verbose | No significant difference | Programmatic interop, round-tripping |
TOON | Most compact | No significant difference (with caveats) | Token-constrained scenarios |
Finding #4: The "Grep Tax" — Compact Doesn't Mean Faster
One of the most counterintuitive results involved TOON (Token-Oriented Object Notation), a new format designed to represent structured data in as few tokens as possible. TOON combines YAML's indentation with CSV-style tabular layouts to minimize token usage.
In theory, fewer tokens should mean lower cost and faster processing. In practice, the study found a significant "grep tax": models unfamiliar with TOON spent significantly more tokens over multiple iterations trying to parse and understand the format. The token savings from the compact format were more than offset by the extra reasoning tokens the models needed to figure it out.
This is a critical lesson for context engineering: file size does not predict runtime efficiency. A format the model understands instantly (like JSON or YAML) can be cheaper to use than a hyper-optimized format the model has to puzzle over. Familiarity beats compression.
Finding #5: File-Native Agents Scale to 10,000 Tables
The study demonstrated that file-native agents can successfully navigate schemas with up to 10,000 database tables — far beyond what fits in any single context window. The key technique: domain-partitioned schemas, where related tables are grouped into logical directories that agents can navigate hierarchically.
This finding validates the architecture behind modern AI coding agents that work within filesystem structures rather than trying to cram everything into a single prompt. For enterprise applications with massive codebases or database schemas, context engineering through file organization is a proven strategy.
Prompt Engineering vs. Context Engineering: What Changed?
Prompt engineering dominated the AI conversation from 2023 to 2025. The idea was simple: craft the perfect instruction and the model would deliver. Techniques like chain-of-thought, few-shot examples, and system prompts became standard practice.
But as models became more capable and AI agents became mainstream, the bottleneck shifted. Today's AI agents don't just answer questions — they navigate codebases, query databases, read documentation, and execute multi-step workflows. The prompt is a small fraction of what the model sees. The context — the schemas, files, retrieved documents, and system architecture — is what determines success or failure.
Dimension | Prompt Engineering | Context Engineering |
|---|---|---|
Focus | How you ask the question | What information surrounds the question |
Scope | Single instruction | Entire information architecture |
Scale | Hundreds of tokens | Thousands to millions of tokens |
Primary lever | Wording, structure, examples | Data format, retrieval, file organization |
Impact (per study) | Marginal at frontier tier | Architecture choice: +2.7% to -7.7% depending on model |
Practical Takeaways for AI Developers
Based on McMillan's research, here's what you should actually do differently:
1. Invest in Model Selection First
The 21-point accuracy gap between frontier and open-source models is the single largest effect in the study. Before optimizing your context format, make sure you're using the best model you can afford for your use case.
2. Match Architecture to Model Capability
If you're using Claude, GPT, or Gemini — lean into file-based context retrieval. Structure your data across files and let the agent navigate. If you're using open-source models, consider providing more context upfront in a single pass to avoid the navigation penalty.
3. Don't Obsess Over Format
The data shows no significant aggregate difference between YAML, Markdown, JSON, or TOON. Use whatever format your team is comfortable with and your tooling supports. The exception: if you're locked into a specific open-source model, test formats individually — some models have format-specific quirks.
4. Beware the Grep Tax
Don't assume a more compact format will save you money. If the model doesn't "know" the format natively, it will burn tokens trying to understand it. Stick with familiar formats (JSON, YAML, Markdown) unless you've tested and confirmed that an exotic format actually performs better for your specific model.
5. Organize for Scale
If you're working with large datasets or codebases, invest in domain-partitioned file structures. The study shows this approach scales to 10,000 tables while maintaining high navigation accuracy — but only with frontier models. For a deeper dive into how AI agents handle large codebases, check out our coverage of AI agent capabilities and limitations.
What This Means for the AI Context Window
Context windows have been growing rapidly — Claude Opus 4.5 supports 200K tokens (expandable to 1M), GPT-5.2 pushed to similar territory, and Gemini 2.5 Pro offers 2M tokens. But McMillan's research suggests that how you fill the context window matters more than how big it is.
A well-structured 50K-token context with domain-partitioned schemas and familiar formats will outperform a chaotic 500K-token dump. Context engineering is the discipline of making every token in that window count.
The Future of Context Engineering
McMillan's paper is a starting point, not the final word. As AI agents become more sophisticated, expect context engineering to evolve in several directions:
Dynamic context assembly: Agents that automatically determine what context they need and retrieve it on-the-fly
Context-aware fine-tuning: Models trained to be more efficient at parsing specific context structures
Standardized context protocols: Emerging standards (like Anthropic's Model Context Protocol) for how agents consume structured information
Context compression: Better approaches to fitting more information into fewer tokens without the "grep tax"
The key insight from this research is clear: architectural decisions should be tailored to model capability rather than assuming universal best practices. What works for Claude Opus 4.5 may actively hurt DeepSeek V3.2. Context engineering requires understanding your specific model, your specific data, and the interaction between the two.
FAQ
What is context engineering in AI?
Context engineering is the practice of structuring, formatting, and delivering information to large language models through their context windows. It focuses on the data architecture surrounding your request — file formats, retrieval methods, schema organization — rather than the prompt itself. A 2026 study of 9,649 experiments showed that these architectural decisions significantly impact model accuracy.
How is context engineering different from prompt engineering?
Prompt engineering focuses on crafting the instruction you give to an AI model — the wording, examples, and structure of your request. Context engineering focuses on everything else the model sees: the schemas, documents, files, and data formats that make up the majority of the context window. As AI agents handle more complex multi-step tasks, context engineering has become the higher-leverage skill.
What is the best format for AI context — YAML, JSON, or Markdown?
According to McMillan's study, no single format is statistically better than others in aggregate (chi-squared=2.45, p=0.484). YAML, JSON, Markdown, and TOON all produced similar accuracy across 11 models. However, individual models may have format-specific sensitivities, so testing with your specific model is recommended. The study found that familiar formats often outperform exotic compact formats due to the "grep tax."
Do open-source AI models handle context engineering differently than frontier models?
Yes, significantly. Frontier models (Claude Opus 4.5, GPT-5.2, Gemini 2.5 Pro) benefit from file-based context retrieval (+2.7% accuracy), while open-source models (DeepSeek V3.2, Kimi K2, Llama 4) show a -7.7% decrease with the same approach. The 21 percentage point accuracy gap between tiers is the largest factor in the study, larger than any format or architecture effect.
What is TOON format and should I use it?
TOON (Token-Oriented Object Notation) is a compact data format that combines YAML's indentation structure with CSV-style tabular layouts to minimize token usage. While TOON uses fewer tokens than JSON, McMillan's study found a "grep tax" — models unfamiliar with TOON spent more reasoning tokens parsing it, offsetting the savings. Unless you've tested TOON with your specific model and confirmed performance gains, stick with JSON, YAML, or Markdown.