Claude :: Week10 :: Special Series :: AI Task Delegation Research :: Shrinking Always-On Agent Config + Tiered Lazy Context Loading: A Deep Research Report

  • Deep Research request. Be thorough, cite sources, prioritize reliable information from the last ~24 months.


    OBJECTIVE: Identify the most effective and most NON-OBVIOUS techniques for shrinking a large


    instruction/configuration file that an AI agent re-reads at the start of every single turn (so its size is


    paid repeatedly), WITHOUT losing any behavioral fidelity.


    Research and answer ALL of the following up front (do not pause to ask me; state any assumptions you make):


    1. What categories of content can be moved out and loaded on demand vs. what must stay inline to preserve behavior?


    2. Which concrete techniques cut size the most — across structure, references/pointers, externalization, summarization, and encoding/compression?


    3. How do practitioners PROVE nothing was lost after compression (regression/parity/eval methods)?


    4. Which "compressions" look safe but silently change model behavior (the traps)?


    5. What current tools, papers, or established patterns address this (e.g. prompt/context compression, context distillation, instruction caching)? Name them with sources.


    SOURCES & RECENCY: Favor credible, recent (≈last 24 months) sources; cite every non-obvious claim with a link;


    distinguish established practice from experimental; flag thin/contested evidence.


    OUTPUT: For each distinct technique, use: (1) Idea one-liner (2) How it works (3) Why non-obvious (4) Concrete


    worked example (5) Failure modes (6) Adoption cost (low/med/high + why). End with a Source map and a


    "State of the art as of today" paragraph.


    CONSTRAINT: Answer generically (universal AI-agent design). Do not ask me for proprietary details.

Shrinking Always-On Agent Config + Tiered Lazy Context Loading: A Deep Research Report

Scope & assumptions: This report answers two related questions about AI-agent context engineering, generically (no proprietary internals). Topic 1 is shrinking a large instruction/config file re-read every turn without losing behavior. Topic 2 is "load only what's needed now" startup design. I assume a modern LLM agent (200K–1M token window) with filesystem/tool access, paying input-token cost per turn. Evidence is weighted toward 2024–2026 sources; I flag emerging vs. established and contested claims.


TL;DR

  • Topic 1: The biggest non-obvious win is not compressing prose but moving rarely-used content out of the always-on file behind pointers/progressive disclosure (load on demand) while keeping the small set of behavior-anchoring tokens — persona, hard safety rules, output contract, and a few canonical examples — inline; pair this with prompt caching so the stable prefix is nearly free (cache reads cost 0.1× base input price). Prove fidelity with a frozen regression/eval suite + behavioral diffing, because plausible-looking compressions (dropping examples, summarizing edge cases, reordering) silently change behavior.
  • Topic 2: The cheapest correct boot reads a small, fixed minimal set — core instructions + a rolling/hierarchical summary + the recent "tail" of raw history + lightweight pointers/indices to everything else — and defers the bulk, pulling deferred context just-in-time when triggers fire (retrieval miss, explicit reference, tool error, user follow-up). Recover the recent tail cheaply via append-only logs + checkpoints rather than re-reading whole history files.
  • State of the art (both): Progressive disclosure (Agent Skills), compaction, structured note-taking/file-as-memory, sub-agent context isolation, and tiered memory stores (MemGPT/Letta, mem0) are the converged patterns; token-level prompt compression (LLMLingua family) and learned compression (gisting, context distillation) are powerful but more situational and carry real failure modes.

TOPIC 1 (W1·Q1): Shrinking a large always-on instruction/config file

Framing: why every token in an always-on file is expensive twice over

A file re-read at the start of every turn pays its token cost repeatedly, and those tokens also consume the model's finite "attention budget." Anthropic frames context as "a finite resource with diminishing marginal returns," citing "context rot": as token count grows, recall degrades (Anthropic, "Effective context engineering for AI agents," Sep 29 2025). The Chroma Research report "Context Rot: How Increasing Input Tokens Impacts LLM Performance" by Kelly Hong, Anton Troynikov & Jeff Huber (July 2025) evaluated 18 LLMs — including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 — and concluded "their performance grows increasingly unreliable as input length grows," even on trivial tasks and non-uniformly across models. So shrinking an always-on file improves both cost and quality — the goal, per Anthropic, is "the smallest set of high-signal tokens that maximize the likelihood of your desired outcome."

Two empirical anchors govern what you can safely cut:

  • Instruction-density limits: "How Many Instructions Can LLMs Follow at Once?" (IFScale, Jaroslawicz et al., arXiv:2507.11538) evaluated 20 state-of-the-art models across seven providers on a 500-keyword business-report task and found "even the best frontier models only achieve 68% accuracy at the max density of 500 instructions" (e.g., o3 62.8%, grok-3 61.9%, claude-3.7-sonnet 52.7% — notably beating claude-opus-4's 44.6%). Top "threshold-decay" reasoning models (gemini-2.5-pro, o3) "maintain near-perfect performance through moderate densities (100–250 instructions) before degradation," whereas GPT-4.1 and claude-3.7-sonnet show steadier linear decay. Practitioner lore converges on keeping CLAUDE.md-style files short (HumanLayer reports their root file is under 60 lines; general consensus cited as <300 data-preserve-html-node="true" lines).
  • Position/salience effects: "Lost in the Middle" (Liu et al., TACL 2024) and "Found in the Middle" (Hsieh et al., arXiv:2406.16008) show a U-shaped attention bias — content at the start and end of context gets more attention than the middle, regardless of relevance. This means reordering a config file changes behavior even if content is identical.

Q1.1 — What can move out vs. what must stay inline

Must stay inline (behavior-anchoring, high-salience):

  • Core behavioral rules / persona / role definition.
  • Hard safety & policy constraints (these are terminal-risk; see traps).
  • The output contract (format/schema) the agent must always honor.
  • A small set of canonical few-shot examples that anchor tone and edge behavior.
  • Trigger/index metadata telling the agent what on-demand resources exist and when to load them.

Safe to externalize (load on demand):

  • Reference material, long API/docs, style guides a linter can enforce.
  • Rarely-used procedures and domain playbooks.
  • Verbose explanations, background, rationale.
  • Large example libraries beyond the canonical few.

Anthropic's guidance: system prompts should hit the "right altitude," and you should "curate a set of diverse, canonical examples" rather than "stuff a laundry list of edge cases." HumanLayer: "don't tell Claude all the information... tell it how to find important information," using file:line references — and "never send an LLM to do a linter's job."


Techniques (6-part format)

Technique 1 — Progressive disclosure / index-plus-references (ESTABLISHED)

  1. One-liner: Keep only a name+description "table of contents" inline; load full instructions/resources only when a task matches.
  2. How it works: Anthropic Agent Skills (released as an open standard Dec 18 2025) load in three stages — Discovery (name+description in the system prompt), Activation (read full SKILL.md when relevant), Execution (load bundled scripts/reference files as needed). The SKILL.md body is recommended to stay under ~500 lines / ~5,000 tokens, with overflow pushed to referenced files.
  3. Why non-obvious: Counter to intuition that the model "needs everything," the metadata index (~80 tokens median per skill, per a swirlai measurement of Anthropic's 17 official skills) is enough for the model to decide to load more. Effectively unbounded knowledge at a small fixed context cost.
  4. Worked example: A 4,000-line "agent manual" becomes a 40-line SKILL.md (name, description, when-to-use, pointers) plus references/ files. At boot the agent carries ~80–250 tokens of metadata instead of thousands; a PDF-forms procedure only loads when the user asks about forms.
  5. Failure modes: Bad descriptions → the skill never triggers (silent capability loss); over-splitting raises coordination cost; security risk from loading untrusted skills.
  6. Adoption cost: Low–med. Adopted across Claude Code, OpenAI Codex CLI, Gemini CLI, GitHub Copilot, Cursor; marketplaces index hundreds of thousands of skills. Requires restructuring your file and writing good trigger descriptions.

Technique 2 — Prompt/instruction caching (ESTABLISHED; highest ROI for repeated reads)

  1. One-liner: Don't shrink the tokens — stop paying full price for them by caching the stable prefix.
  2. How it works: Anthropic/OpenAI cache a prompt prefix; identical prefixes hit the cache at a fraction of input cost. Per Anthropic's pricing docs, "cache read tokens are 0.1 times the base input tokens price," while "5-minute cache write tokens are 1.25 times the base input tokens price; 1-hour cache write tokens are 2 times." Anthropic uses explicit cache_control breakpoints (up to 4), 5-min default TTL (1-hr optional); OpenAI does it automatically. Caching covers tools, system, and messages in that order up to the breakpoint.
  3. Why non-obvious: It directly targets the "paid every turn" problem without touching behavioral fidelity at all — the tokens are byte-identical. It reframes the whole problem: a large always-on file may be fine if it's cached and stable.
  4. Worked example: ProjectDiscovery reports cutting LLM costs 59% (later 66–70%) via prefix caching, using three breakpoints and a 1-hr TTL on static system content to keep the cache warm across users. Anthropic's launch announcement (anthropic.com/news/prompt-caching, Dec 17 2024) reports caching lets customers reduce "costs by up to 90% and latency by up to 85% for long prompts"; their own test of chatting with a cached 100K-token book ran in 2.4s vs. 11.5s uncached (a 79% latency cut).
  5. Failure modes: Any change to the prefix (or dynamic content placed too early) busts the cache — "consistent cache misses on exactly the content that benefits most." Caching is orthogonal to attention/context-rot: a cached huge file is cheap but still pollutes attention.
  6. Adoption cost: Low. Mostly a structural change: put stable content first, volatile content last, and mark breakpoints. Note the write penalty (1.25×–2× base) means caching only pays off when the prefix is reused.

Technique 3 — Token-level prompt compression: LLMLingua family (ESTABLISHED tooling; SITUATIONAL for always-on instructions)

  1. One-liner: Use a small model to delete low-information tokens from a prompt, keeping what the big model needs.
  2. How it works: LLMLingua (EMNLP 2023, Microsoft) uses a small LM (GPT-2/LLaMA-7B) to score token perplexity and prune, with a budget controller and iterative token-level compression; LLMLingua-2 (ACL 2024 Findings) reframes it as BERT-scale token classification distilled from GPT-4, task-agnostic and 3×–6× faster. Claims up to 20× compression with minimal loss; LongLLMLingua reports +17.1% at 4× on some tasks.
  3. Why non-obvious: Compressed prompts look like gibberish to humans but the LLM still "recovers the essential information." Removing low-signal tokens can improve accuracy by reducing middle-of-context noise.
  4. Worked example: A verbose retrieved-context block compressed 50% before sending to GPT-4/Claude, cutting cost ~50–80% on high-volume pipelines.
  5. Failure modes: This is the trap-rich technique. Independent 2026 evaluation ("Prompt Compression in the Wild," arXiv:2604.02985) found LLMLingua-2 unsuitable for structured data: passage-retrieval accuracy dropped below 50% because paragraph-numbering cues are destroyed, and few-shot classification (TREC, LSHT) dropped up to 52% because class-indicative patterns get pruned. Token-level pruning can break code syntax. Adds a compression-model call (latency).
  6. Adoption cost: Med. Open-source, integrates with LangChain/LlamaIndex, but you must eval per task and it's risky on instruction/format-critical content. Best on verbose reference/RAG context, not on hard rules or schemas.

Technique 4 — Learned compression: gisting & soft prompts (EMERGING/EXPERIMENTAL)

  1. One-liner: Train the model to compress a fixed prompt into a few reusable "gist" tokens.
  2. How it works: Gisting (Mu, Li, Goodman, Stanford, arXiv:2304.08467) trains an LM via a modified attention mask to compress prompts into "gist" tokens cached and reused — up to 26× compression, ~40% FLOPs reduction, "minimal loss in output quality." Related: AutoCompressors (summary vectors, ~30:1), ICAE.
  3. Why non-obvious: The compression is into soft embeddings, not text — far denser than any natural-language rewrite, and cacheable.
  4. Worked example: A fixed instruction prefix used across millions of calls compressed to a handful of gist tokens, saving context space on every call.
  5. Failure modes: Soft prompts are model-specific and require retraining per model change — "falls short in maintaining transferability," and unusable on black-box APIs (Claude/GPT) where you can't inject custom embeddings. Risk of overfitting.
  6. Adoption cost: High. Needs model access + fine-tuning. Mostly relevant to teams running open-weight models they control.

Technique 5 — Context distillation (EMERGING; the "throw away the prompt" approach)

  1. One-liner: Fine-tune the model to behave as if the long prompt were present, then delete the prompt.
  2. How it works: Introduced by Anthropic in Askell et al. 2021, "A General Language Assistant as a Laboratory for Alignment" (arXiv:2112.00861), whose Section 2.1 is literally titled "Context Distillation." You "finetune a model p_θ(X) with a loss given by L(θ) = D_KL(p_0(X|C) || p_θ(X))" — i.e., train the model to match the output distribution of the prompted model so the context C can be removed. Snell, Klein & Zhong 2022, "Learning by Distilling Context" (arXiv:2209.15189, UC Berkeley) generalized it to internalize three signals: abstract instructions, step-by-step reasoning, and concrete examples.
  3. Why non-obvious: It converts prompt tokens into weights — zero ongoing context cost — and Askell et al. note it can "potentially allow for the use of prompts that exceed the size of the context window."
  4. Worked example: An always-on "house style + safety preamble" distilled into a fine-tune so new sessions start with the behavior baked in and a near-empty system prompt. Snell et al. report distilling concrete examples "outperforms directly learning with gradient descent by 9% on the SPIDER Text-to-SQL dataset."
  5. Failure modes: Requires training; behavior is frozen at distill time (any rule change = re-distill). Askell et al. caution that for "many but not all of our evaluations context distillation performs about as well as prompting," and report a small "alignment tax" (e.g., on zero-shot LAMBADA) — so it is not guaranteed lossless. Not possible on black-box APIs.
  6. Adoption cost: High. Fine-tuning pipeline + re-distillation discipline. Best for very stable, very high-volume behaviors.

Technique 6 — Structural rewriting: dedup, "right altitude," linters-not-prose (ESTABLISHED, low-risk)

  1. One-liner: Rewrite for density — deduplicate, delete what the model already knows or can infer, and offload deterministic rules to tools.
  2. How it works: Remove restated rules, generic boilerplate ("you are a helpful assistant"), and anything inferable from the codebase/README; replace style rules with a linter/hook; phrase at the "right altitude" (specific enough to guide, general enough not to be brittle). Token-efficient phrasing and shorthand can cut size further.
  3. Why non-obvious: "Be concise" instructions can raise quality: the OPSDC reasoning-compression result cited by Rephrase found ~57–59% token reduction with +9–16 points accuracy on math; one production ticket-classifier hit 55% token reduction "with no loss of instruction fidelity."
  4. Worked example: Replacing 200 lines of code-style guidance with "run npm run lint" via a hook (hooks enforce at ~100% vs. ~70% for prose instructions, per DataCamp's summary of Claude Code internals).
  5. Failure modes: Shorthand (pos=approve) breaks in zero-shot/user-facing contexts; over-aggressive dedup can remove reinforcing repetition that actually anchors behavior.
  6. Adoption cost: Low. Manual editing + moving rules to deterministic tooling.

Q1.2 — Which techniques cut size the most, by dimension

  • Structure (reorg/dedup): Technique 6 — moderate, low-risk; the safest first pass.
  • References/pointers: Technique 1 (progressive disclosure) — largest practical reduction of always-on tokens; turns thousands of inline tokens into ~80–250 metadata tokens.
  • Externalization (on-demand files/tools): Technique 1 + file-as-memory (Technique 8); plus offloading rules to linters/hooks (Technique 6).
  • Summarization: situational — strong for verbose reference text, dangerous for edge-case/format content (see traps).
  • Encoding/compression: Techniques 3–5. Token-level (LLMLingua) for verbose RAG; learned (gisting/distillation) only if you control weights. Caching (Technique 2) is the orthogonal multiplier — it doesn't shrink tokens but eliminates ~90% of their repeated cost.

Q1.3 — How practitioners PROVE nothing was lost

The consensus method is evaluation-driven iteration against a frozen baseline, not eyeballing:

  • Frozen regression suite + differential evaluation: Maintain a compact "golden set" of high-value cases (core workflows + previously observed failures) and re-run on every change to prompt, model, or corpus, comparing to prior baselines with the same oracles (arXiv:2601.17292). Version golden sets with code; refresh from production traffic (Braintrust).
  • Behavioral diffing / data slicing: Tools like RETAIN (EMNLP 2024 demo) use data-slicing to surface where outputs diverge between prompt versions, then drill into failure cases — purpose-built for prompt/model migration.
  • A/B output comparison + LLM-as-judge: Side-by-side scoring of old vs. new prompt outputs; set temperature 0 for reproducibility (flaky evals otherwise).
  • Stop-rule probing (practitioner test): Put a distinctive rule the file should enforce, start a fresh session, give a task that should trigger it; if violated, the file is too long, the rule is buried, or it isn't loading (Blink).
  • Behavioral-consistency testing: Because identical inputs can yield divergent outputs, test semantic behavior across paraphrases (e.g., 100 equivalent task descriptions, expect ≥95% functional correctness) rather than exact output equivalence (arXiv:2508.20737).

Contested/important: "Better" prompts can silently trade off behaviors. A reproducible 2026 study (arXiv:2601.22025) found replacing task-specific prompts with generic rules dropped extraction pass-rate 100%→90% and RAG compliance 93.3%→80% on Llama-3 while improving instruction-following — proof that aggregate "no regression" can hide per-capability regressions. Always slice by capability.

Q1.4 — Compressions that look safe but silently change behavior (the traps)

  • Removing "redundant" examples that anchor behavior. Few-shot examples are "pictures worth a thousand words" (Anthropic); pruning them can collapse edge-case handling. In LLMLingua-2 evals, compressing few-shot examples dropped classification up to 52% by removing class-indicative patterns.
  • Summarizing away edge cases. Summarization rewrites content and "risks hallucinated details or lost specifics"; verbatim/extractive compaction is safer because surviving tokens are identical (Morph; Factory's benchmark on 36K coding messages found detail-accuracy — file paths, line numbers — the key differentiator, not compression ratio).
  • Reordering that changes salience. Because of the U-shaped bias, moving a rule from the start/end into the middle measurably reduces compliance even with identical text.
  • Deleting reinforcing repetition. Repetition that looks redundant can be what holds compliance "under load"; terminal/late constraints are most vulnerable to being dropped (arXiv:2603.23530 found formatting compliance drops 2–21% under added load, worst for terminal constraints).
  • Destroying structural cues. Token-level compression can strip numbering/delimiters that the task depends on (passage retrieval fell below 50%).
  • Over-compressing safety. Safety constraints are high-risk to externalize because if the trigger fails to fire, the rule is simply absent that turn.

TOPIC 2 (W1·Q2): Tiered / lazy context loading at session start

Framing

The anti-pattern is booting by reading large history/log files in full just to recover recent context — paying for stale tokens and inviting context rot. The converged answer is a tiered memory hierarchy (MemGPT/Letta's "LLM-as-OS": in-context core memory ≈ RAM, external recall/archival ≈ disk) plus just-in-time retrieval of references.

Q2.1 — Minimal read-set at boot that preserves correctness

A correct-but-cheap boot loads only:

  1. Core instructions / persona / safety (the always-on inline set from Topic 1).
  2. A compact rolling/hierarchical summary of prior sessions (architectural decisions, open threads, key facts).
  3. The recent raw "tail" (last N turns/messages at full fidelity).
  4. Lightweight pointers/indices (file paths, stored queries, memory keys, a NOTES.md/TODO) — not their contents.

Anthropic's compaction in Claude Code is the canonical reference: it summarizes the conversation and reinitializes "with this compressed context plus the five most recently accessed files." Mem0 layers conversation/session/user/org memory and ranks user memories first, then session, then raw history.

Q2.2 — Deciding what to defer vs. load eagerly

  • Eager: anything needed for correctness on the likely next action — core rules, the summary, the recent tail, the current task/goal.
  • Defer: anything whose need is conditional — reference docs, older history, domain playbooks, large tool outputs. Anthropic's heuristic: a hybrid strategy — "retrieve some data up front for speed, and pursue further autonomous exploration at its discretion"; the right autonomy level "depends on the task," with less-dynamic domains (legal/finance) favoring more upfront retrieval. Claude Code drops CLAUDE.md upfront but uses glob/grep to pull files just-in-time, "bypassing stale indexing."

Q2.3 — Detecting when deferred context is needed mid-task (triggers/signals)

  • Explicit reference: the user or task names a file/entity/ticket → load it.
  • Retrieval/self-assessment miss: the agent judges current context insufficient and emits a "trigger token" to search deeper tiers (hierarchical-memory designs do exactly this: satisfy from Tier 1, else recurse to Tier 2/3).
  • Tool error / dead-end: a failed action signals missing context.
  • Metadata cues: file names, folder hierarchy, timestamps, file sizes act as relevance signals guiding what to pull (Anthropic).
  • Heartbeats / continuation: MemGPT's request_heartbeat chains tool calls to fetch external memory before responding.

Q2.4 — Graceful degradation for a non-expert user (cheap boot)

  • Always answer from the tier you have, and explicitly fall back: subagent/memory designs should "gracefully return 'no information found'" rather than hallucinate (Microsoft Copilot Studio guidance; test with out-of-domain queries).
  • Self-refresh on demand: a /catchup-style routine rebuilds context by reading changed files only when needed.
  • Progressive summaries keep a usable answer available even when raw history is evicted; preserve operational state (variable names, file paths, IDs) in the summary so workflows resume (arXiv:2602.21351).
  • For non-experts, the boot should not require them to specify what to load — defaults + automatic triggers do the work.

Q2.5 — Cheapest reliable way to recover the recent tail

  • Append-only event log + checkpointing: persist state to external storage so the agent resumes "after interruption or failure without replaying full history" (Airbyte; Redis "Long-Horizon AI Agents"). Read only the slice after the last checkpoint.
  • Sliding window: retain the last N turns at full resolution, drop/summarize the rest (cheapest; established).
  • Provider session state: OpenAI's previous_response_id / Agents SDK sessions and Responses API chaining recover continuity without resending history.
  • Caveat (contested/real): naive summarization-middleware can cause "unbounded checkpoint growth" if it never trims state.messages (LangChain deepagents issue #2876) — recovery design must actually evict, not just summarize.

Techniques (6-part format)

Technique 7 — Compaction (ESTABLISHED)

  1. One-liner: Summarize the conversation near the window limit and restart from the summary.
  2. How it works: Pass message history to the model to distill architectural decisions, unresolved bugs, key facts while discarding redundant tool outputs; continue with summary + few most-recent files (Anthropic).
  3. Why non-obvious: It's the first lever and handles all context growth, not just tool results; "tune the prompt by maximizing recall first, then precision."
  4. Worked example: Claude Code compaction = summary + five most recently accessed files.
  5. Failure modes: "Overly aggressive compaction can lose subtle but critical context whose importance only becomes apparent later"; summarization can hallucinate.
  6. Adoption cost: Low–med; built into several agent frameworks.

Technique 8 — Structured note-taking / file-system-as-memory (ESTABLISHED)

  1. One-liner: The agent writes notes to disk (NOTES.md/TODO) and reads them back later.
  2. How it works: Persist progress outside the window; pull back in as needed. Anthropic shipped a file-based memory tool (public beta, Sonnet 4.5 launch).
  3. Why non-obvious: Minimal-overhead persistent memory; "Claude playing Pokémon" maintained tallies across thousands of steps and resumed after context resets.
  4. Worked example: A migration agent records "done/next/blocked" after each module; boot reads only that file.
  5. Failure modes: Notes can drift/staleness; agent must be disciplined about updating.
  6. Adoption cost: Low.

Technique 9 — Tiered memory stores: MemGPT/Letta & mem0 (ESTABLISHED tooling, EMERGING field)

  1. One-liner: OS-style memory hierarchy with self-editing and retrieval across tiers.
  2. How it works: MemGPT/Letta: in-context core memory + external recall/archival, with memory-edit tools and recursive summary on overflow. mem0: extract→consolidate (ADD/UPDATE/DELETE/NOOP)→retrieve via vector/graph; layered conversation/session/user/org.
  3. Why non-obvious: The agent manages its own memory paging. The Mem0 paper (arXiv:2504.19413, ECAI 2025) reports that on the LoCoMo benchmark "Mem0 attains a 91% lower p95 latency and saves more than 90% token cost" (p95 1.44s vs. 17.12s for full-context, ~6,956 vs. ~26,000 tokens per retrieval), while scoring 26% higher on LLM-as-Judge accuracy than OpenAI Memory (66.9% vs. 52.9%).
  4. Worked example: "User prefers aisle seats" stored once, retrieved on later travel queries instead of re-reading all history.
  5. Failure modes: Retrieval quality is the bottleneck ("what happened last Monday" retrieves poorly); multi-hop accuracy trade-offs; memory bloat without consolidation; PII risk (mem0 is "retrievable by design").
  6. Adoption cost: Med; SDK integration in an afternoon (mem0 self-host ~20 min), but tuning retrieval and governance is ongoing.

Technique 10 — Rolling / hierarchical summaries (ESTABLISHED pattern, contested reliability)

  1. One-liner: Condense older history into multi-granularity summaries (turn→session→topic).
  2. How it works: Recent messages at full resolution; older compressed by a fast model; recursive summarization folds new evictions into existing summaries; hierarchical trees keep coarse-to-fine summaries with child pointers.
  3. Why non-obvious: Hierarchy lets boot load a coarse summary and recurse only into the relevant subtree (MemTree, HMO).
  4. Worked example: Boot loads a one-paragraph session summary; only on a specific question does it expand the relevant sub-summary into raw detail (HiAgent restores detailed action-observation pairs for a retrieved subgoal).
  5. Failure modes: "Rolling summaries are seductive because they feel clean — they're not"; lossy summaries silently drop constraints; hard truncation "discards content by recency rather than relevance."
  6. Adoption cost: Low–med.

Technique 11 — Sub-agent context isolation (ESTABLISHED)

  1. One-liner: Spawn fresh-context subagents for focused work; return only a distilled summary.
  2. How it works: Orchestrator keeps a high-level plan; each subagent explores in its own clean window (tens of thousands of tokens) and returns ~1,000–2,000 tokens (Anthropic multi-agent research system).
  3. Why non-obvious: Keeps the lead agent's window clean; "work should only be split when context can be truly isolated" (else a lossy "telephone game").
  4. Worked example: A research lead spawns per-source subagents; only condensed findings re-enter the main context.
  5. Failure modes: Coordination overhead; parent must manage dependency graph; splitting by problem type causes information loss across handoffs.
  6. Adoption cost: Med; emerging research (e.g., DACS, arXiv:2604.07911) refines per-agent isolation but it's not yet standardized.

Technique 12 — RAG-at-boot / just-in-time retrieval (ESTABLISHED, with caveats)

  1. One-liner: At session start, retrieve only context relevant to the first message; fetch more on demand.
  2. How it works: Index history/docs in a vector store; use the user's first message as the query to pull top-k; maintain a structured user profile that persists across sessions (Propelius).
  3. Why non-obvious: Replaces "read everything" with "read what's relevant now"; mirrors human use of external indices (Anthropic).
  4. Worked example: Support agent boots by retrieving the user's profile + last unresolved ticket, not the full transcript archive.
  5. Failure modes: Retrieval misses; "lost in the middle" if too many chunks; Chroma shows even retrieved long context rots; embeddings may surface stale memories.
  6. Adoption cost: Med; standard RAG infra plus eval.

Recommendations (staged, with thresholds that change them)

Topic 1 — shrinking the always-on file:

  1. First (this week, low risk): Turn on prompt caching and reorder the file so stable content is the cached prefix and volatile content is last. This alone removes ~90% of the repeated cost with zero behavior change. Then do a structural rewrite (Technique 6): delete boilerplate, dedup, move style/lint rules to hooks. Threshold to go further: if /context-style inspection still shows the file eating >10–15% of the window, or instruction count exceeds ~100–150, continue.
  2. Second (1–2 weeks): Apply progressive disclosure (Technique 1). Split into a minimal inline core (rules, persona, safety, output contract, 2–5 canonical examples, trigger index) plus on-demand reference/skill files. This is the highest-leverage size cut. Benchmark: inline core under ~300 lines / well under the model's instruction-following threshold.
  3. Third (only if justified by volume): Consider token-level compression (Technique 3) on verbose reference blocks — never on hard rules, schemas, or few-shot classifiers. Consider gisting/context distillation (Techniques 4–5) only if you self-host weights, behavior is very stable, and call volume is huge.
  4. Always: Build a frozen, capability-sliced regression suite before any change and re-run on every edit. Treat any per-capability regression (not just aggregate) as a blocker. Use verbatim/extractive compaction over abstractive summarization when fidelity of specifics (paths, IDs, numbers) matters.

Topic 2 — lazy boot:

  1. First: Replace "read history files in full" with a fixed minimal read-set — core instructions + rolling summary + recent tail (last N turns) + pointers. Threshold: if boot reads more than a few thousand tokens of history, you're over-reading.
  2. Second: Add append-only logging + checkpointing so the tail is recovered by reading only the post-checkpoint slice; add explicit triggers (named references, retrieval misses, tool errors) to pull deferred context just-in-time.
  3. Third: If sessions span users/long horizons, adopt a tiered memory store (Letta or mem0) for cross-session memory; add sub-agent isolation for high-volume sub-tasks. Benchmark to adopt mem0-style memory: multi-session workloads where full-context retrieval exceeds ~20K tokens/call.
  4. Guardrails: Prefer logs+checkpoints over rolling summaries for correctness-critical tail recovery (summaries are lossy); always test out-of-domain/empty-memory paths so a cheap boot degrades to "I don't have that yet — fetching" rather than confident hallucination.

Caveats

  • Vendor-reported numbers are marketing-adjacent. The 90% cost / 85% latency (caching) and 90% token / 91% latency (mem0) figures come from the vendors' own announcements/papers and specific benchmarks (LoCoMo, a 100K-token book); your workload will differ. Treat them as upper bounds to validate, not guarantees.
  • Context-rot thresholds are not hard cliffs. Chroma/observers note degradation is continuous and model-dependent (often noticeable around 300–400K tokens for 1M-window models, but earlier for harder semantic tasks). "Your context window is big enough" is the wrong mental model; signal-to-noise is.
  • Compression fidelity is task-specific and contested. The same technique that's lossless on summarization can collapse on structured/few-shot tasks. There is no universal safe compression ratio.
  • Learned compression and context distillation are not black-box-API options and freeze behavior at training time; Askell et al. explicitly note an "alignment tax" and that distillation matches prompting only on "many but not all" evals.
  • Some sources are practitioner blogs, not peer-reviewed; I have anchored the load-bearing claims to primary sources (Anthropic engineering posts, arXiv papers, vendor docs) and flagged secondary ones inline.
  • The field is moving fast (Agent Skills became an open standard only in Dec 2025; sub-agent isolation refinements like DACS are 2026 preprints). Re-validate tooling specifics before committing.

Consolidated Source Map

Anthropic primary (context engineering, skills, memory):

  • Effective context engineering for AI agents — anthropic.com/engineering/effective-context-engineering-for-ai-agents
  • Equipping agents for the real world with Agent Skills — anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills
  • Agent Skills docs — platform.claude.com/docs/en/agents-and-tools/agent-skills/overview
  • When to use multi-agent systems — claude.com/blog/building-multi-agent-systems-when-and-how-to-use-them
  • Context engineering cookbook — platform.claude.com/cookbook/tool-use-context-engineering-context-engineering-tools

Prompt caching:

  • Anthropic prompt caching docs (pricing multipliers) — platform.claude.com/docs/en/build-with-claude/prompt-caching
  • Anthropic prompt caching launch (90%/85%, 100K book 2.4s vs 11.5s) — anthropic.com/news/prompt-caching
  • ProjectDiscovery 59% cost cut — projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching

Prompt/context compression (research):

  • LLMLingua / LLMLingua-2 / LongLLMLingua — microsoft.com/en-us/research/project/llmlingua; github.com/microsoft/LLMLingua
  • Gisting (Mu, Li, Goodman) — arxiv.org/abs/2304.08467
  • Context distillation: Askell et al. 2021 (arXiv:2112.00861, §2.1 "Context Distillation"); Snell, Klein & Zhong 2022 (arXiv:2209.15189, "Learning by Distilling Context")
  • Prompt Compression in the Wild — arxiv.org/abs/2604.02985
  • Prompt Compression survey — arxiv.org/abs/2410.12388

Position / degradation effects:

  • Lost in the Middle (Liu et al., TACL 2024); Found in the Middle — arxiv.org/abs/2406.16008
  • Context Rot (Chroma; Hong, Troynikov, Huber, July 2025) — trychroma.com/research/context-rot
  • IFScale (Jaroslawicz et al., instruction density) — arxiv.org/abs/2507.11538
  • Prospective Memory Failures — arxiv.org/abs/2603.23530

Evaluation / regression:

  • Risk-based test framework — arxiv.org/abs/2601.17292
  • RETAIN (EMNLP 2024 demo) — aclanthology.org/2024.emnlp-demo.31
  • When "Better" Prompts Hurt — arxiv.org/abs/2601.22025
  • Braintrust LLM eval guide — braintrust.dev/articles/llm-evaluation-guide

Memory systems / boot / lazy loading:

  • MemGPT/Letta — letta.com/blog/agent-memory; arXiv MemGPT (Packer et al.)
  • mem0 (LoCoMo: 91% lower p95 latency, >90% token savings) — arxiv.org/abs/2504.19413; docs.mem0.ai
  • Hierarchical memory (HMO, MemTree, HiAgent arXiv:2408.09559)
  • OpenAI Agents SDK session memory — cookbook.openai.com/examples/agents_sdk/session_memory
  • Redis long-horizon agents — redis.io/blog/long-horizon-ai-agents-memory-state-infrastructure
  • Sub-agent isolation (DACS) — arxiv.org/abs/2604.07911
  • CLAUDE.md practices — humanlayer.dev/blog/writing-a-good-claude-md

State of the art as of today (Topic 1)

As of June 2026, the established, low-risk path to shrinking an always-on file is structural, not lexical: move rarely-used content behind progressive-disclosure pointers (Agent Skills, now a cross-vendor open standard adopted by Claude Code, OpenAI Codex CLI, Gemini CLI, GitHub Copilot, and Cursor), keep a minimal high-salience inline core, and use prompt caching so the stable prefix costs 0.1× the base input price. Token-level compression (LLMLingua family) is mature tooling but situational — strong on verbose RAG/reference text, dangerous on structured, format-critical, or few-shot-dependent instructions, where independent 2026 evals show 50%+ accuracy collapses. Learned compression (gisting) and context distillation (Askell 2021; Snell 2022) are powerful for teams that control model weights but inapplicable to black-box APIs and frozen at training time with a measurable "alignment tax." The field consensus, echoed by Anthropic, is "find the smallest set of high-signal tokens," validated by frozen regression suites and capability-sliced eval rather than by aggregate scores — because the best-documented danger is silent per-capability regression from changes that look safe.

State of the art as of today (Topic 2)

Lazy/tiered boot has converged on an OS-style memory hierarchy + just-in-time retrieval. The cheap-correct boot loads core instructions, a rolling/hierarchical summary, the recent raw tail, and pointers — deferring everything else and pulling it on explicit reference, retrieval miss, or tool error. Compaction, structured note-taking/file-as-memory, and sub-agent context isolation are established and shipping in production agents; tiered memory stores (MemGPT/Letta, mem0) productize the pattern with large reported token/latency savings (mem0 cites a 91% lower p95 latency and >90% token cost savings on LoCoMo), though retrieval quality, multi-hop accuracy, and memory governance remain open problems. The strongest cross-cutting evidence (Chroma's context rot, lost-in-the-middle, IFScale) says the payoff isn't just cost: a lean, well-ordered boot measurably improves reliability. The main contested area is summarization reliability — rolling summaries are lossy and can silently drop still-active constraints — so append-only logs plus checkpoints (read only the post-checkpoint slice) are the most defensible way to recover the recent tail.

Previous
Previous

Claude :: Week10 :: Special Series :: AI Task Delegation Research :: Lazy-Loading / Deferred-Context Startup Design for Long-Running AI Agents

Next
Next

ChatGPT :: Week10 :: Special Series :: AI Task Delegation Research :: Smart Startup Design for Lazy-Loading Agent Context