ChatGPT :: Week10 :: Special Series :: AI Task Delegation Research :: Shrinking Re-read Context in AI Agents Without Losing Fidelity
-
Deep Research request. Be thorough, cite sources, prioritize reliable information from the last ~24 months.
OBJECTIVE: Identify the most effective and most NON-OBVIOUS techniques for shrinking a large
instruction/configuration file that an AI agent re-reads at the start of every single turn (so its size is
paid repeatedly), WITHOUT losing any behavioral fidelity.
Research and answer ALL of the following up front (do not pause to ask me; state any assumptions you make):
1. What categories of content can be moved out and loaded on demand vs. what must stay inline to preserve behavior?
2. Which concrete techniques cut size the most — across structure, references/pointers, externalization, summarization, and encoding/compression?
3. How do practitioners PROVE nothing was lost after compression (regression/parity/eval methods)?
4. Which "compressions" look safe but silently change model behavior (the traps)?
5. What current tools, papers, or established patterns address this (e.g. prompt/context compression, context distillation, instruction caching)? Name them with sources.
SOURCES & RECENCY: Favor credible, recent (≈last 24 months) sources; cite every non-obvious claim with a link;
distinguish established practice from experimental; flag thin/contested evidence.
OUTPUT: For each distinct technique, use: (1) Idea one-liner (2) How it works (3) Why non-obvious (4) Concrete
worked example (5) Failure modes (6) Adoption cost (low/med/high + why). End with a Source map and a
"State of the art as of today" paragraph.
CONSTRAINT: Answer generically (universal AI-agent design). Do not ask me for proprietary details.
Shrinking Re-read Context in AI Agents Without Losing Fidelity
Assumptions and headline answers
Assumptions: the agent has a stable “control plane” of instructions or configuration, can persist some state outside the prompt, and may use tools or retrieval. “Behavioral fidelity” means preserving the agent’s decisions, refusals, tool choices, schema conformance, and task outcomes—not reproducing identical wording token-for-token. That is how current agent-eval practice treats correctness: the important check is usually the end state, required tool use, and trace constraints, not verbatim transcript similarity. citeturn26view0turn26view1turn27view1
The highest-confidence answer today is not “find one magical compressor.” It is to split context into a tiny invariant inline core plus progressively disclosed modules, make repeated prefixes cache-friendly, checkpoint or persist session state instead of replaying logs, use rolling summaries plus a recent tail for continuity, and retrieve older or bulky material only when it is actually needed. Recent vendor guidance increasingly frames this as context engineering: treat context as a finite resource, keep only the smallest high-signal token set inline, and retrieve or summarize the rest just in time. citeturn30view1turn31view2turn19view2turn23search1
For pure prompt compression, the strongest recent empirical finding is that extractive, query-aware compression is usually safer than abstractive rewriting when you care about preserving behavior. A 2024 characterization found extractive methods often outperformed abstractive summarization and token-pruning baselines, sometimes enabling up to 10× compression with minimal degradation; a 2025 multi-dataset study found moderate compression can even improve long-context performance by removing noise; and a 2026 production trial found that moderate compression saved money, while aggressive compression sometimes backfired because output length grew. citeturn20view0turn20view1turn20view4turn20view5
The most non-obvious trap is that some things that feel like compression are really silent behavior changes. Common failure modes include: forgetting to resend top-level instructions when using response-chaining APIs, abstractive summaries that weaken modal force or drop negative constraints, “stateful” APIs that simplify engineering but still bill prior tokens, aggressive compression that lowers input tokens but increases output tokens, and progressive-disclosure systems whose routing metadata becomes too vague or too truncated to trigger the right module. citeturn37view1turn37view2turn37view0turn20view4turn17view9
What must stay inline and what should defer
What must stay inline is the control plane: hard safety or business invariants, the role/authority structure, required output schema or strict tool schema, the current user task, unresolved work, and any immediate negative constraints or corrections that would change the next action if omitted. OpenAI’s docs explicitly distinguish higher-priority developer instructions from user input, recommend reserving absolute rules like “ALWAYS” and “NEVER” for true invariants, and provide Structured Outputs so formatting requirements can be enforced by schema instead of long prose. Anthropic’s compaction guidance likewise treats user corrections, exact identifiers, errors, and in-progress state as “always preserve” material. citeturn16view2turn16view3turn16view4turn35view0
What should be loaded eagerly at boot—but not necessarily inline forever—is the minimal read-set for correctness: the inline control plane, the current task or latest user turn, a compact summary of prior work, the recent tail of interaction, unresolved commitments, and a thin layer of durable memory such as stable preferences or open tasks. Anthropic’s long-horizon guidance describes compaction as preserving architectural decisions, unresolved bugs, implementation details, and recent working context; OpenAI’s long-term-memory cookbook recommends injecting only the relevant subset of durable memory after trimming, with stable facts promoted into structured fields and volatile facts kept as notes with recency weighting or TTLs. citeturn31view2turn32view2
What should defer to on-demand loading is everything bulky, low-probability, or reconstructible: long procedures, long examples, full skill bodies, large tool schemas, old logs, raw tool outputs, archival documents, and broad knowledge bases. Current skills systems in both OpenAI and Anthropic use progressive disclosure: the agent starts with skill metadata and only reads the full SKILL.md body when it decides to use that skill; both platforms also now support loading tools on demand instead of front-loading their full definitions. Anthropic’s context guidance also calls clearing raw tool results one of the safest “light-touch” compactions. citeturn15view6turn15view7turn29view2turn29view3turn34view0turn29view1turn31view2
What should usually never be replayed raw is filler and already-consumed execution exhaust: pleasantries, acknowledgments, duplicate examples, stale or superseded turns, and full raw tool results that have already been distilled into state. Anthropic’s compaction and memory guidance repeatedly recommends omitting filler, weighting recency, clearing old tool results, and re-injecting system context separately instead of redundantly copying it into summaries. citeturn35view0turn35view2turn23search0turn25view3
Techniques that shrink recurring instruction files
Progressive-disclosure modules and scoped instruction files — established. Idea. Keep only a short routing layer inline—module names, descriptions, and scope—then load full instructions only when the task matches. How it works. Current skills systems on both OpenAI and Anthropic expose metadata at startup and load the full skill body later; AGENTS.md adds another layer by letting nearby, narrower-scope files override broad defaults. Why non-obvious. The “compression” is not lexical—it is architectural. The file body does not have to fight for startup tokens if the routing metadata is strong enough. Worked example. Instead of a 3,000-token monolithic config that includes billing policy, code-style rules, PDF workflows, and escalation steps every turn, keep a 250-token core plus short module descriptors such as “billing-escalation,” “repo-style,” and “pdf-extract”; when the task is document extraction, only the PDF module body loads. Failure modes. Vague descriptions lead to missed triggers; if you create too many skills, descriptions may be shortened to fit a budget and routing reliability falls; hidden dependencies between modules can also cause silent misses. Adoption cost. Medium: you must rewrite monoliths into clean modules and treat descriptions as part of the routing contract. citeturn15view6turn15view7turn17view9turn29view2turn29view3turn29view0
Schema-first output and tool contracts — established. Idea. Replace large prose blocks about formatting and parameter correctness with machine-enforced schemas. How it works. Structured Outputs can guarantee response shape against JSON Schema; strict tool use can guarantee tool-call schema conformance; detailed tool descriptions carry semantic constraints more compactly and more reliably than loose prose. Why non-obvious. Many “instruction files” are bloated not by behavior rules but by repetitive formatting language like “always return valid JSON with these fields.” That prose can often move into the protocol layer. Worked example. Instead of a 400-token section that says the model must output {"decision","reasons","citations","next_action"} and never omit keys, define the schema in the API call and reduce the inline prompt to the decision policy itself. Failure modes. Schemas constrain shape, not judgment; if the schema is over-rigid it can force awkward workarounds; and provider support varies. Adoption cost. Low to medium: the implementation is straightforward, but you must separate “what the model should decide” from “what shape the answer should take.” citeturn16view4turn25view1turn25view0turn16view3
Prefix-stable prompt caching — established, and often the biggest immediate win. Idea. Do not rewrite the prompt first; make the repeated prefix byte-for-byte stable so the provider can cache it. How it works. OpenAI and Anthropic both document prefix-based prompt caching. The common pattern is to place all static content—tools, system instructions, stable context, examples—at the beginning, put variable user content at the end, and avoid unnecessary edits to the cached prefix. Why non-obvious. This does not reduce token count, but it can drastically reduce the repeated cost of a large prompt, which is exactly the pain point in “paid every turn” architectures. Worked example. Freeze the first 6,000 tokens of instructions/examples/tools as a stable prefix; move timestamps, user turns, and dynamic context to the end; keep tool order and section order fixed. Failure modes. Exact-prefix requirements are unforgiving; moving one block or toggling tool-choice settings can invalidate the cache; and on OpenAI, chaining with previous_response_id does not carry prior top-level instructions automatically, so those still need to be resent if they matter. Also, provider-managed state may simplify orchestration without eliminating token billing for prior context. Adoption cost. Low: mostly prompt hygiene and request-shape discipline. citeturn16view0turn16view1turn15view3turn25view2turn37view1turn37view2turn37view0
Deferred tool loading and tool search — established and highly non-obvious. Idea. Treat tool definitions as a searchable catalog, not startup baggage. How it works. Both Anthropic and OpenAI now support loading deferred tools at runtime. Anthropic’s tool-search docs describe initially showing only the search tool and non-deferred tools, then loading the 3–5 relevant tools when needed; OpenAI’s tool-search docs describe deferring functions, namespaces, or MCP servers and injecting loaded tools at the end of the context to preserve cache. Why non-obvious. In many real agents, the worst context bloat comes from tool schemas, not English instructions. Anthropic reports a typical multi-server setup can spend roughly 55k tokens on tool definitions before work begins and that tool search usually reduces that by more than 85%; their embeddings cookbook reports 90%+ context reduction for large tool libraries. Worked example. Keep crm, billing, and docs as high-level namespaces visible at boot; when the task mentions a refund and order history, only the refund and order functions load. Failure modes. Poor namespacing or weak descriptions make discovery brittle; deferring too aggressively can hurt first-turn latency or miss tools; and some provider-specific features have compatibility constraints. Adoption cost. Medium: you need a clean tool catalog, namespacing, and routing descriptions. citeturn34view0turn34view1turn29view1turn25view0turn25view3
Query-aware extractive compression of static background text — strongest current general-purpose compressor. Idea. If a long document must sometimes be present, delete low-value spans while preserving original wording for the spans that matter. How it works. Recent prompt-compression work increasingly favors extractive, often query-aware approaches. LLMLingua-2 frames compression as token classification to preserve faithfulness and reports 2×–5× compression with 1.6×–2.9× end-to-end latency gains; LongLLMLingua adds question-aware compression and document reordering for long-context settings; a broader 2024 characterization found extractive compression often beat abstractive compression and token pruning; and a 2025 empirical study found moderate compression can improve long-context tasks by cutting noise. Why non-obvious. People often assume “summarize” is safest. The recent evidence says the opposite for fidelity-sensitive tasks: keeping original surface form for selected spans is often safer than paraphrasing. Worked example. A 20-page policy manual is indexed, then compressed only for the current query “Can I refund after shipment if the item is digital?” The compressor preserves the exact clauses on digital goods, post-shipment exceptions, and escalation thresholds, rather than rewriting them. Failure modes. Open-ended, synthesis-heavy tasks may still need broader context; aggressive compression can increase output length and total cost; some query-aware methods must recompress for each question, which reduces cache reuse. Adoption cost. Medium: the machinery is heavier than summarization, but the fidelity/cost trade-off is currently the best-supported. citeturn21view1turn21view3turn22view2turn22view3turn20view0turn20view1turn20view4turn20view5
Fine-tuning and prompt baking for stable behavior; latent context distillation as experimental. Idea. If instructions encode stable behavior rather than changing facts, move that behavior out of the prompt and into model weights or adapters. How it works. OpenAI explicitly recommends fine-tuning to reduce prompt-engineering tokens when the real problem is behavioral consistency rather than missing context, and describes “prompt baking” as logging prompt traces from a pilot and pruning them into a training set. Newer research goes further: Generative Context Distillation and later latent-memory work compress prompts into lightweight parameter additions or modular adapters. Why non-obvious. The shortest prompt is the one you no longer have to send, but that only works for stable behavior. Worked example. A support agent’s 1,500-token style-and-policy block is replaced with a fine-tuned model trained on representative support traces, while changing refund policies still come from retrieval. Failure modes. Non-representative training data can hard-code the wrong behavior; retrieval can still be necessary for changing facts; and recent latent-memory methods are promising but remain model-specific, harder to inspect, and much less established than fine-tuning or extractive compression. Adoption cost. High: dataset work, evaluation maintenance, retraining, and vendor/model coupling. citeturn36view2turn36view3turn36view0turn20view3turn14search1turn14search2
Techniques that shrink boot-time history loading
For the “boot by rereading long logs” problem, the best generic answer is: boot from state, not from transcript. The minimal read-set is the inline control plane, a compact handoff summary, the unresolved task list, durable memory notes that are relevant now, and a short recent tail. Everything older should be reachable through retrieval or checkpoints, not replayed by default. Anthropic’s own long-horizon guidance describes compaction, note-taking, and sub-agents as the main strategies; OpenAI’s memory cookbook similarly recommends injecting only curated memory relevant to the new session. citeturn31view2turn32view2
Checkpoints and persistent conversation state — established. Idea. Resume from a saved state object or checkpoint instead of replaying raw logs. How it works. LangGraph persists graph state as checkpoints at every step, organized into threads; OpenAI’s Conversations API stores items inside a durable conversation object and previous_response_id lets later turns reference prior state; in both cases, the system can recover continuity from saved state rather than scanning files linearly. Why non-obvious. Teams often inherit transcript replay from stateless chat patterns even when their framework already has a better persistence primitive. Worked example. After a long support workflow, save the agent’s thread checkpoint containing open ticket IDs, last successful tool results, unresolved escalations, and a short handoff summary. On restart, load that checkpoint and only fetch older transcript slices if the next task explicitly references them. Failure modes. On some APIs, statefulness does not remove model-side billing for prior context; top-level instructions may still need explicit resend; and checkpoint formats can become migration liabilities. Adoption cost. Medium: easier if your framework already supports checkpoints, harder if your app is built around flat files. citeturn16view8turn16view9turn15view4turn37view1turn37view2turn37view0
Rolling summary plus recent tail — established and usually the cheapest reliable history strategy. Idea. Keep a dense rolling summary of older history and append only the recent tail at boot. How it works. Anthropic’s compaction recipes preserve user intent, completed work, errors, corrections, active work, pending tasks, and key references, while weighting recent messages more heavily and omitting filler. Their engineering blog says Claude Code continues after compaction with compressed context plus the most recently accessed files. Why non-obvious. The key is not “write a short summary.” It is “force-preserve the fields whose loss causes behavioral drift”: identifiers, negative constraints, error messages, and in-progress state. Worked example. Instead of replaying a 30,000-line build log and chat transcript, boot with a 900-token handoff containing exact file paths, failing test names, the user’s latest correction, the active branch, and the last five relevant messages or files. Failure modes. If you paraphrase user corrections, the model reverts; if you summarize only for topical relevance, you lose state; if compaction is reactive, users feel the pause. Adoption cost. Low to medium: the mechanism is simple, but summary-prompt design matters a lot. citeturn32view0turn35view0turn35view2turn35view3turn31view2
Memory distillation and curated memory injection — established pattern, still operationally tricky. Idea. Pull durable facts, preferences, and long-lived commitments out of history and store them separately as notes or structured memory. How it works. OpenAI’s long-term-memory cookbook describes a lifecycle of memory distillation, consolidation, and injection: stable facts can be promoted into structured fields, volatile ones remain notes with recency or TTL, and consolidation must do deduplication, conflict resolution, and forgetting before only the relevant subset is injected into the next session. Anthropic’s guidance describes structured note-taking and file-based memory outside the context window for the same purpose. Why non-obvious. Durable memory is not “summary of everything.” It is a curated store of what should survive many sessions. Worked example. From a multi-turn onboarding workflow, extract just the canonical customer ID, seat preference, active risk flags, unresolved task list, and last confirmed constraints; inject only those at next boot. Failure modes. Context poisoning, duplicate facts, stale notes, and over-aggressive pruning. Adoption cost. Medium: the write path is easy; the consolidation and forgetting policy is where most teams stumble. citeturn32view2turn31view2turn15view5
RAG at boot with contextual retrieval and reranking — established, but only as good as the retrieval stack. Idea. Boot cheap, then retrieve only the prior context that is relevant to the new task. How it works. Retrieval APIs, contextual embeddings, contextual BM25, and reranking all improve the odds that the few chunks you load are the right ones. Anthropic reports Contextual Retrieval reduces retrieval failures by 49%, and by 67% when combined with reranking; their contextual-embeddings cookbook reports roughly 35% lower top-20 retrieval failure and a jump from about 87% to about 95% Pass@10 on a codebase benchmark. LangChain’s ContextualCompressionRetriever patterns show the same idea operationally: retrieve broadly, then rerank and return only compressed high-signal docs. Why non-obvious. The smart move is often not to compress logs first, but to index them well enough that you do not need to look at most of them. Worked example. At boot, read only the current ticket and a handoff summary. If the user says “continue the shipping-refund case from yesterday,” retrieve the chunks tied to that ticket ID and rerank them against the current request before injecting anything. Failure modes. Retrieval miss equals silent amnesia; bad chunking or ranking drowns the model in noise; and if the real problem is stable behavior rather than missing facts, RAG can make results worse. Adoption cost. Medium: retrieval stacks are mature, but retrieval quality becomes a second optimization problem. citeturn16view5turn15view9turn34view2turn17view0turn17view1turn17view2turn36view0
Subagents and side-context isolation — established for complex agents, especially non-expert UX. Idea. Let worker agents read the noisy stuff in their own context and return only a distilled result to the main agent. How it works. Anthropic’s subagent docs explicitly recommend this when a side task would flood the main conversation with logs or search results; their context-engineering blog says subagents can consume tens of thousands of tokens and return a 1,000–2,000-token summary, keeping the lead agent’s window clean. Why non-obvious. Sometimes the best “compression” is architectural isolation: don’t let exploratory junk enter the core context in the first place. Worked example. The main agent starts with a short summary and current user request. If boot confidence is low because the user references an old bug, a log-analysis subagent scans the archived logs and returns “root cause, exact failing test, last attempted fix, remaining uncertainty.” Failure modes. Important details may die in the worker summary; orchestration bugs can hide the need for raw evidence. Adoption cost. Medium to high: very effective, but you must design coordination and evals for the summaries. citeturn17view8turn15view8turn31view1turn31view2
A practical answer to “how should the system decide what to defer?” is a simple utility score: eager-load items that are high-probability, high-consequence if missing, and cheap to keep; defer items that are bulky, lower-probability, or cheaply retrievable. Recent memory guidance suggests using stability, drift, contextual variance, confidence, recency, and TTL as the main signals. For “how do you detect deferred context is needed mid-task?”, the best triggers are references that cannot be grounded in current context—unknown IDs, “continue from before” language, contradictions, missing evidence for tool decisions, or the model’s need to use a tool because the answer is not already in context. That trigger set is partly an inference, but it is strongly aligned with current vendor patterns for just-in-time retrieval and tool use. citeturn32view2turn25view1turn31view0turn19view2
To make cheap boot degrade gracefully for non-expert users, do the expensive work in the background when possible: build summaries proactively, use automatic context compaction before the window is full, and silently retrieve older context when signals fire. Anthropic’s session-memory cookbook recommends background summary updates specifically to avoid user-visible pauses, and its automatic compaction flow turns large tool-heavy histories into a fresh summary window when thresholds are crossed. citeturn32view0turn32view1
How to prove parity and avoid silent regressions
The current best practice is to evaluate outcomes, traces, and costs together. Anthropic’s eval guidance defines a task, multiple trials, graders, transcripts, and the final environment outcome; it recommends state checks, tool-call assertions, transcript constraints, unit tests, and LLM rubrics depending on the agent. LangSmith describes the same stack in operational terms: curated datasets, evaluators that may be human, code-based, LLM-as-judge, or pairwise, then offline benchmarking and regression tests before online monitoring. OpenAI’s Evals framework and recent Codex guidance make the same point: treat skills and prompts like something you can test against traces and artifacts, not like vibes. citeturn26view0turn26view1turn26view2turn27view0turn27view1turn27view2turn33view0
A high-confidence parity protocol looks like this. First, build a “must-not-break” suite from real production traces and edge cases. For each case, define success in layers: final state, required tool calls, hard formatting/schema checks, and any style or groundedness rubric. Run the baseline and compressed system on the same suite with multiple trials because agent behavior is stochastic. Compare pass rate, failure types, tool traces, latency, input tokens, and output tokens. Anthropic explicitly recommends regression suites with nearly 100% pass rates for already-solved cases, and its evals post treats output length and trace metrics as first-class tracked metrics. A recent OpenAI post on skill evals similarly recommends small but focused prompt sets, deterministic graders, JSON schemas for rubric outputs, and artifact inspection to catch regressions quickly. citeturn26view2turn26view3turn33view0
The traps worth taking seriously are the ones that look safe. First, abstractive summaries of instructions can silently change normative force—“must” becomes “usually,” exceptions disappear, and user corrections get paraphrased. Anthropic’s compaction guidance is blunt: preserve user corrections verbatim, preserve exact identifiers, preserve in-progress state, and weight recent messages heavily. citeturn35view0turn35view2turn35view3
Second, stateful APIs are not a complete substitute for instruction management. On OpenAI, previous_response_id does not carry previous top-level instructions, and previous input tokens in the chain are still billed as input tokens. That means response chaining can simplify app state without guaranteeing fidelity or cost savings unless your stable instructions are explicitly resent and your repeated prefix is cache-friendly. citeturn37view1turn37view2turn37view0
Third, “compress more” is not a safe production heuristic. The 2026 production task-orchestration trial found moderate compression on the Pareto frontier, but aggressive compression was dominated because output length expanded enough to erase input savings. If you judge only by input-token reduction, you can think a compressor is winning while total cost or task quality is getting worse. citeturn20view4turn20view5
Fourth, RAG is not a universal fix. OpenAI’s optimization guide explicitly shows a case where adding RAG to an already fine-tuned behavior problem reduced accuracy by adding noise. If the stable, repeated prompt is mostly behavioral policy, fine-tuning or modularization may be safer than retrieval. If the problem is missing facts, retrieval is exactly the right tool. The distinction matters. citeturn36view0turn36view1turn36view2
Fifth, progressive disclosure has its own control-plane bottleneck. If your skill or tool descriptions are too vague, routing breaks; if there are too many skills, descriptions may be truncated; if tool descriptions are too thin, the model will choose poorly. In other words, once you compress the body, the metadata becomes the new behavior-critical surface. citeturn17view9turn25view0turn29view2
Source map and state of the art
The most reliable established patterns today come from vendor docs and operational frameworks: prompt caching with prefix stability, structured outputs and strict tool schemas, persistent conversation state and checkpoints, compaction and rolling summaries, long-term memory notes with consolidation/injection, retrieval with contextualized chunks and reranking, progressive-disclosure skills, AGENTS.md-style scoped guidance, and deferred tool loading via tool search. These are no longer niche ideas; they are increasingly first-class platform features because they reduce cost, latency, and context bloat without asking the model to decode novel compressed representations. citeturn16view0turn15view3turn16view4turn16view8turn32view0turn32view1turn32view2turn34view2turn29view0turn29view1turn29view3
The most important recent research for true prompt compression is: the 2024 characterization showing extractive methods often outperform abstractive and pruning baselines; LLMLingua-2’s “faithful” token-classification approach; the 2025 broad empirical study showing moderate compression can help in long contexts; the 2025 information-preservation paper arguing downstream task scores alone are insufficient and proposing more holistic evaluation; the 2026 production randomized trial showing output-token expansion can erase savings; and recent experimental work on structured symbolic rewriting and latent/context distillation. The first five are immediately useful even if you never deploy the exact models; the last group is promising but not yet something I would call universal best practice. citeturn20view0turn21view1turn20view1turn20view7turn20view4turn20view6turn20view3turn14search1turn14search2
State of the art as of today: the best generic design is not a giant compressed super-prompt. It is a layered system: a small inline core for invariants and the current task; progressive-disclosure modules or skills for procedures; structured schemas instead of verbose formatting prose; prefix-stable caching to cut repeated cost; checkpoints or conversation objects instead of log replay; rolling summaries plus a recent tail for continuity; retrieval and reranking for deferred context; and regression suites that verify end state, tool trace, and cost. For true lexical compression, favor query-aware extractive methods over abstractive rewriting; for very stable repeated behavior, consider fine-tuning or prompt baking; and treat latent adapter-style context distillation as experimental until it has broader cross-model validation and stronger operational tooling. citeturn30view4turn31view2turn16view0turn29view1turn16view8turn32view0turn34view2turn20view0turn36view2turn20view3
Open questions and limitations. The vendor docs are strong on patterns but naturally biased toward their own stacks; several 2026 compression papers are promising but not yet widely replicated; and cross-provider cost behavior remains unstable because caching, state persistence, and billing semantics differ materially. The safest universal advice is therefore architectural rather than model-specific: keep the control plane tiny, keep bulky knowledge out of startup context, and require every compression step to earn its place with regression data. citeturn37view0turn20view4turn20view6turn26view2