I have both source pages. Writing the lesson now.
Context engineering
The whole lesson in one sentence: context engineering is the discipline of filling the context window with precisely the right information for the next step, and treating that window as a budget you curate rather than a place you dump everything you have.
The decision you make every step
The previous lesson left you with the loop: model calls a tool, gets a result, calls the model again. Now look at what happens between those calls. Every time the model is invoked, something assembles the text it sees. That assembly is the real work.
Prompt engineering assumed one static prompt you perfect once. Context engineering rejects that. A production agent needs a different set of information at each step: the tool outputs from the last few turns, the right slice of documentation, a summary of history instead of the full transcript, the procedural memory that happens to be relevant right now. Karpathy's definition, coined in June 2025, is the one to keep:
"Context engineering is the delicate art and science of filling the context window with just the right information for the next step."
Cognition put it more bluntly in Don't Build Multi-Agents: "Context Engineering is the #1 job of engineers building AI agents." Anthropic codified it three months later in Effective context engineering for AI agents. Three independent sources landed on the same term inside three months. That is a signal: the field hit the same wall at the same time.
The wall is that prompt engineering is a one-time design choice. Context engineering is a continuous runtime decision. Every call in the loop is a fresh composition.
Why the budget is real: context rot
Nominal context windows are enormous now. Claude runs 200K to 1M, Gemini reaches 2M, Llama 4 Scout advertises 10M. None of that is usable headroom. Chroma's context rot research found effective recall stalls at 30 to 50% of the advertised maximum. The middle of the haystack gets forgotten first. Attention is O(n²), so the cost of relating every token to every other token grows fast.
Karpathy's framing is the one to hold onto:
"Context windows are really kind of like working memory."
Treat the window as a finite attention budget, not unlimited storage. Anything you stuff in but do not use for the next step is noise, and noise lowers the signal-to-noise ratio for everything that matters.
Smart zone vs dumb zone
Matt Pocock, building on a model from Dex Hardy at Human Layer, gives you the working number. Capability degrades quadratically with context size, and around 100K tokens every model tips from a smart zone into a dumb zone, regardless of whether the advertised window is 200K or 1M.
"By around 100K is kind of my new marker. It doesn't matter whether you're using 1 million context window or 200K. It always starts to just get dumber."
His metaphor: adding a token is like adding a team to a football league. The number of matches grows quadratically. Too many relations, not enough sharpness to tell them apart.
The practical correction: window size is not smart-zone size. A 1M window is good for retrieval, finding one fact in War and Peace, one-shot lookups. It is bad for coding, iterative work, and complex reasoning. As Pocock puts it, large windows "shipped a lot more dumb zone to you." Useful for retrieval, less good for coding.
Curate, do not dump
If the window is a budget, the system prompt is a fixed cost. It is always present, so every token in it is a permanent subtraction from your smart zone.
"If you have 250K tokens like I have seen people put in there, then you're just going to go straight into the dumb zone without even being able to do anything."
The same logic exposes MCP bloat. Installing too many MCP servers can consume 76K of 200K tokens on tool descriptions alone. Effective Sonnet context drops to around 100K before you have done anything. Fewer, higher-signal tools is a context decision, not only a tooling one.
Here are the four strategies Anthropic and the practitioners converge on for keeping the budget under control:
- Compaction. When the window fills, summarize and restart with system prompt plus summary plus the last N messages. Claude Code does this automatically. Devin trained a dedicated compression model for it. Pocock prefers clearing over compacting, because compaction leaves "sediment," a summarized previous session that lingers and pollutes later reasoning. Clearing gives a clean slate.
- Just-in-time retrieval. Keep light identifiers in context, file paths, query strings, IDs, and load the data only when you use it. This contrasts with classic RAG that pulls all chunks upfront. In Claude Code,
CLAUDE.mdis preloaded as procedural memory, butgrep,glob, andread_filefetch specific files only when needed. The folder structure of an agent becomes a form of context engineering. - Structured external notes. The agent writes to a filesystem that persists across runs. Anthropic's memory tool operates on a
/memoriessandbox with view, create, str_replace, and insert primitives. - Context isolation via subagents. Parallel subagents each get their own window, do their work, and return a 1 to 2K token summary. Pocock's grill-me skill burns 93K tokens in a subagent and returns only the summary, keeping the main context light.
Push vs pull
The last distinction you need is how information reaches the agent. Pocock splits it in two.
| Mode | What | Example |
|---|---|---|
| Push | Always in context, forced | CLAUDE.md, system prompts, injected reviewer instructions |
| Pull | On demand, with a description header | Skills with description: frontmatter the agent calls itself |
His rule for an agentic pipeline: "In the reviewer I would push the coding standards. In the implementer I would allow it to pull." The implementer needs variable context per issue, so pull is efficient. The reviewer has to compare code against standards, so push guarantees the standard is present. Anthropic's progressive disclosure, skills and scoped MCP tools, is fundamentally a pull discipline instead of push-everything-upfront.
The framing that ties it together
Karpathy, in his Sequoia talk on Software 3.0, reframes the whole thing:
"What's in the context window is your lever over the interpreter that is the LLM."
Designing a good context window is no different from designing a good program. Composition, scoping, naming, comment discipline all apply. And as Pocock adds: that lever degrades quadratically. Keep your tasks small enough to fit the smart zone. The same advice Martin Fowler and The Pragmatic Programmer gave human developers, now for agents too.
Takeaways
- Context is a budget, not a dump. Every token you do not use for the next step is noise.
- Effective recall stalls at 30 to 50% of the advertised window. Capability tips into the dumb zone around 100K tokens, whatever the nominal size.
- Keep the fixed cost low: small system prompt, few high-signal tools.
- Curate with four moves: compact, retrieve just in time, write external notes, isolate in subagents.
- Push what must always be compared against, pull what varies per step.
Where this goes next
You now have the runtime discipline for a single window. But agents need information that survives across windows, across sessions, across days. That is the difference between working memory and the rest of the system. The next lesson, Memory, covers the taxonomy, where state lives outside the window, and how procedural, episodic, and semantic memory map onto the strategies you just learned.