How Leading AI Teams Are Engineering Context at Scale

Leading AI teams at Anthropic, Manus, and Cognition are solving agents that break after dozens of tool calls. Lance Martin reveals battle-tested techniques keeping production AI systems coherent through hundreds of interactions—from smart offloading to multi-agent coordination.
How Leading AI Teams Are Engineering Context at Scale
Photo by Alain Pham / Unsplash

/ From Theory to Practice

Introduction

While Drew Breuning's presentation established the philosophical and linguistic foundations of Context Engineering, Lance Martin's follow-up talk at LangChain HQ dove into the methods.

Where Drew asked "what is Context Engineering and why does it matter?", Lance answered "how are the world's leading AI teams actually doing it?"

Lance's presentation revealed the battle-tested techniques that companies like Anthropic, Manus, and Cognition are using to build production AI systems that can handle hundreds of tool calls without degrading into chaos. It's the difference between understanding why context matters and knowing exactly how to manage it when your agent is 50 tool calls deep and still needs to maintain coherent reasoning.

The Context Explosion: Why Agents Break Everything

Lance opened with a reality check that validated what Drew had theorized: agents fundamentally change the context game. While Drew introduced us to the concept of context collapse, Lance showed us the brutal math:

  • Traditional chatbots: Linear growth in context from user messages
  • Modern agents: Exponential growth from tool observations compounding with each interaction
  • The Manus benchmark: 50 tool calls for a typical request
  • The Anthropic finding: Production agents routinely hit hundreds of conversation turns

This isn't just about longer conversations—it's about a fundamentally different information architecture. As Lance put it, "agents not only receive prompts from the user, they also receive observation from tool calls." Every tool response adds to the context burden, creating what he perfectly described as context that "blows up considerably."

When Context Fails: Production Failure Modes

Building on Drew's theoretical framework, Lance presented specific failure modes that teams have documented in production:

Context Poisoning

The Gemini Pokemon example wasn't just amusing—it represented a systemic failure where hallucinated information contaminates future reasoning. Once an agent invents a fact, it treats that invention as truth in all subsequent operations.

Context Distraction

When Gemini hit 100k tokens, it didn't just slow down—it fundamentally changed behavior, favoring repetition over innovation. The agent got "stuck" in behavioral loops, unable to break free from established patterns.

Context Confusion

More tools don't always mean better performance. When agents have access to similar tools, they struggle with selection, leading to decreased performance even when theoretically more capable.

Context Clash

Perhaps most insidiously, when sequential tool calls return contradictory information, agents don't gracefully handle the conflict—they degrade unpredictably.

The Practitioner's Toolkit: Five Battle-Tested Techniques

Lance presented a framework of practical solutions that he's observed:

1. Offloading: The Universal Solution

Why it's the favorite: Nearly every team uses offloading because it's simple, effective, and doesn't risk information loss.

Lance highlighted several patterns:

  • Manus's todo.md approach: Continuously updated files that track agent state
  • Research brief pattern: Generate planning documents early, store them externally, reload them when needed
  • Long-term memory files: User preferences and historical data live outside the active context
The key insight: "If you've done brief generation and spurred the message history and you have 100,000 tokens of research, the agent may or may not remember that original plan if it's buried at the top of your context window."

2. Reducing: Handle with Extreme Care

The community is split on reduction techniques:

Proponents like:

  • Anthropic's auto-compaction at 95% capacity
  • Tool call summarization at boundaries
  • Agent-to-agent handoff compression

Skeptics warn:

  • Manus explicitly discourages reduction due to information loss
  • Cognition uses specialized fine-tuned models just for summarization
  • The risk of losing critical details often outweighs benefits
Lance's take: "Be very careful about information loss when you're doing this."

3. Retrieving: Beyond Basic RAG

The presentation revealed sophisticated retrieval systems in production:

  • Windsurf: Mixed retrieval methods with re-ranking
  • Cursor's Preempt: An entire system dedicated to assembling retrievals into prompts
  • Tool retrieval: Dynamically selecting relevant tools based on the current task

This isn't your grandmother's RAG—it's sophisticated, multi-layered retrieval orchestration.

4. Isolating: The Multi-Agent Minefield

Here's where the rubber meets the road on multi-agent systems:

The promise: Parallel processing, specialized contexts, avoided contamination

The peril: Conflicting decisions, coordination nightmares, inconsistent outputs

Lance's critical insight from their open-deep-research system: "We only do context gathering in the sub-agents. We don't actually write sections of the report." This prevents the common failure mode where different agents write conflicting content.

5. Caching: The Cost Optimizer

A Manus innovation that others haven't widely adopted yet:

  • Cache immutable content (system prompts, tool descriptions)
  • 10x cost reduction for cached tokens on Claude
  • Doesn't solve length problems but dramatically improves economics

The Production Reality Check

Lance presented a comparative analysis that revealed fascinating disagreements in the field:

TechniqueConsensus LevelKey Insight
OffloadingUniversal adoptionThe safest, most reliable technique
ReducingHighly controversialManus says never, Cognition says maybe with fine-tuning
RetrievingStandard practiceEveryone does it, but implementation varies wildly
IsolatingPhilosophical divideCognition warns against it, Anthropic embraces it
CachingEmerging practiceOnly Manus discussed it publicly

Case Study: Open-Deep-Research in Action

Lance's team's research system provided a masterclass in combining techniques:

  1. Offloading: Research briefs stored in LangGraph state
  2. Reduction: Careful summarization of tool outputs
  3. Isolation: Sub-agents for parallel research, but centralized writing

The critical design decision: "Sub-agents lower risk if avoid decisions." They can gather information in parallel, but only one agent makes synthesis decisions.

The Uncomfortable Truth About Multi-Agent Systems

One of the most valuable parts of Lance's presentation was his honest assessment of multi-agent architectures:

The dream: Specialized agents working in harmony

The reality: "Multi-agents can't coordinate very well and so they can make conflicting decisions"

His pragmatic solution: Use multi-agent systems for information gathering only. Let them explore in parallel, but centralize all decision-making and synthesis.

Beyond the Hype: What Actually Works

Lance concluded with a simple summary:

  • Most popular: Offloading (because it just works)
  • Most controversial: Reduction (information loss is real)
  • Most sophisticated: Retrieval systems (but implementation quality varies)
  • Most dangerous: Uncoordinated multi-agent systems
  • Most underutilized: Caching (huge cost savings waiting to be captured)

The Road Ahead

Lance's parting thought reinforced Drew's vision while adding practical urgency: "Context engineering is just one small piece of an emerging thick layer of non-trivial software that coordinates LLM calls into full LLM apps."

The term "ChatGPT wrapper" isn't just wrong—it's dangerously misleading. What teams are building requires sophisticated orchestration, careful information architecture, and yes, context engineering.

Key Takeaways for Practitioners

  1. Start with offloading: It's the safest, most universally applicable technique
  2. Be extremely careful with reduction: Information loss is real and often catastrophic
  3. Invest in retrieval infrastructure: This is where competitive advantage lives
  4. Design multi-agent systems for gathering, not deciding: Centralize synthesis
  5. Don't ignore caching: The cost savings are too significant to pass up

Conclusion: From Philosophy to Production

Where Drew gave us the language and framework to think about Context Engineering, Lance showed us what it looks like in the trenches. The techniques presented here aren't theoretical—they're extracted from systems handling millions of tokens and hundreds of tool calls in production today.

The message is clear: Context Engineering isn't just a useful concept—it's an essential discipline for anyone building serious AI applications. The teams that master these techniques will build the agents that actually work. The ones that don't will keep wondering why their demos break in production.

As we move from the "ChatGPT wrapper" era to the age of sophisticated AI applications, Context Engineering will separate the demos from the deployments. Lance's presentation didn't just validate Drew's vision—it gave us the blueprints to build it.

Resources for Deeper Exploration

The future isn't just about better prompts—it's about better context. And now we know how to build it.


This article covers Lance Martin's practical presentation on Context Engineering, delivered at LangChain HQ in San Francisco on July 23rd. As one of the early engineers at LangChain and a core contributor to their Python open source library, Lance brings firsthand experience from building the tools and systems that thousands of developers use to implement context engineering in production.