Claude :: Week11 :: Special Series :: AI Task Delegation Research :: The Model-Agnostic Hand-Off Packet: Designing Lossless Single-Task Delegation Between Different AIs

Jun 29

Written By Richard Ketelsen

Deep Research request. Be thorough, cite sources, prioritize reliable information from the last ~24 months. OBJECTIVE: Design the ideal MODEL-AGNOSTIC "hand-off packet" format for delegating a single self-contained task from one AI to a DIFFERENT AI and folding the result straight back into a workflow — trivially and losslessly. Research and answer ALL of the following up front (do not pause to ask me; state any assumptions you make): 1. What should go INTO the packet, and how should the answer come BACK, to make re-ingestion trivial? 2. What commonly makes a delegated AI result hard to re-ingest? 3. How do you keep the packet model-agnostic (works across ChatGPT/Gemini/Claude/etc.)? 4. How do you bound scope so the receiving AI doesn't wander off-task? 5. Do any current standards/formats exist for inter-AI or agent-to-agent task exchange? Cite them. SOURCES & RECENCY: Favor credible recent sources; cite non-obvious claims; established vs. emerging; flag uncertainty. OUTPUT: Per format/technique use the 6-part format. End with a Source map + "State of the art as of today." CONSTRAINT: Generic framing only; proceed without asking for private details.

TL;DR

The best 2026 design is a self-describing "envelope" packet: a human-readable, delimited task brief (task statement, scope boundaries with explicit non-goals, inputs, success criteria, an explicit output contract, and acceptance tests) that instructs the receiving model to emit its answer inside a single unambiguous delimiter (an XML-style <result> tag wrapping a JSON payload). This pattern survives the widest range of model behavior and makes re-ingestion a one-regex-plus-validate operation.
No existing standard covers the human copy-paste case; protocols like Anthropic's MCP, Google's A2A, and IBM's ACP are real, fast-maturing, and well-adopted, but they standardize programmatic agent/tool wiring (JSON-RPC/REST envelopes, capability discovery), not the prompt-level contract you paste between ChatGPT, Gemini, and Claude. For programmatic hand-off, use the provider-native structured-output/JSON-Schema features (all three major vendors now guarantee schema conformance via constrained decoding); for cross-UI hand-off, use the envelope-and-contract prompt pattern.
Model-agnosticism comes from degrading gracefully: write the contract in prose + JSON Schema (universally understood), put format rules at both the top and bottom, use one delimiter convention, never depend on a single vendor's "JSON mode" switch, and always validate + repair on re-ingestion because no provider is deterministic even at temperature 0.

Key Findings

The single most reliable cross-model output container is an XML-style envelope tag wrapping a JSON payload. JSON describes the data; the tag marks the territory. The tag absorbs the model's urge to add preambles ("Sure, here's the data…") and postambles, and lets you watch a stream for </result> to know the payload is complete. Extraction becomes a single regex plus json.loads().
Re-ingestion fails for predictable, catalogued reasons — chatty preambles/postambles, markdown code fences, schema drift (extra/renamed/nested fields), type drift (string instead of number), refusals that don't match your schema, truncation at token limits, and mixing chain-of-thought reasoning with the deliverable. Every one has a known mitigation.
Structured-output features now exist natively in all three major providers and genuinely guarantee schema conformance via constrained decoding/grammars — OpenAI Structured Outputs (GA August 2024), Google Gemini responseSchema, and Anthropic Structured Outputs (public beta November 13, 2025, now GA). But they guarantee format, not correctness, and break on refusals and truncation. These only apply to the API path, not the chat-UI copy-paste path.
The agent-interoperability protocol landscape consolidated dramatically in 2025–2026. MCP (tools/context), A2A (agent-to-agent, now absorbing IBM's ACP), and AGNTCY all moved under the Linux Foundation, with MCP donated to the Agentic AI Foundation on December 9, 2025. These give us battle-tested schemas to imitate (A2A's Task/Message/Part/Artifact model is the clearest reference) even when we can't use the wire protocol itself.
Scope-bounding is a prompt-engineering discipline, not a protocol feature. Explicit non-goals, "do only X" framing, output length caps, single-responsibility decomposition, and an explicit escalation path for ambiguity (return a needs_clarification status rather than guessing) are what stop a receiving model from wandering off-task or over-delivering.

Details

Part 1 — Packet Contents & Return Format

What goes INTO the packet. Drawing on "context engineering" practice — the 2025 discipline of deliberately assembling the full information package an LLM needs — a self-contained single-task packet should contain, in order. (The term was first posted by Shopify CEO Tobi Lütke on X on June 18, 2025 — "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM" — amplified by Andrej Karpathy on June 25, 2025 ["the delicate art and science of filling the context window with just the right information for the next step"], and endorsed by Simon Willison on June 27, 2025.)

Role/persona framing — one line ("You are a senior copy editor"). Anthropic recommends setting role via the system prompt to anchor consistent tone/format.
Task statement — a single imperative sentence. Single-responsibility: one task per packet.
Scope boundaries — explicit in-scope and, critically, non-goals ("Do NOT rewrite headings; do NOT add new sections").
Input data/context — wrapped in its own delimiter so it can't be confused with instructions.
Constraints — length caps, tone, allowed vocabulary, libraries/versions, "use only the data provided."
Success criteria / acceptance tests — a checklist the output must satisfy ("valid JSON; ≤120 words; every id echoed back").
Output contract — the exact schema and the exact envelope, stated last (recency matters — models weight the end of the prompt).
Examples — one or two few-shot input→output pairs as a "grammar anchor," kept format-identical to the contract. (Note: examples can hurt reasoning models, per OpenAI's function-calling guidance, so use judiciously.)
Escalation path — what to do under ambiguity (emit status: "needs_clarification" with questions, rather than guessing).

How the answer comes BACK. The return contract should specify:

A single self-describing envelope: <result>…</result> wrapping a JSON object, OR pure JSON if you control decoding via an API.
A status field (ok / needs_clarification / refused / partial) so the orchestrator can branch without parsing prose.
Reasoning quarantined from the deliverable — either omitted, or placed in a separate <scratchpad>/commentary field that re-ingestion ignores. The OpenAI Agents SDK community documents the exact failure this prevents: models "suddenly insert Markdown fencing," and the fix is a second field where "any extraneous stuff the LLM decides to spout" can go while the clean payload stays isolated.
Echoed correlation keys (e.g., a task_id and any input ids) so the result folds back into the workflow without manual matching — this mirrors how the OpenAI Chat Completions API requires a call_id linking a tool result to its call.

Human copy-paste case vs. programmatic case. For the chat-UI case (Claude→ChatGPT→Gemini by hand), the envelope-tag-around-JSON pattern is essential because you have no API-level schema enforcement; the tag is what lets you reliably scrape the answer out of a chat window. For the programmatic case, prefer the provider-native structured-output APIs (below), which constrain generation at the token level and eliminate the parsing problem entirely — falling back to the envelope pattern only when crossing a provider that lacks the feature.

Part 2 — Re-Ingestion Failure Modes and Mitigations

Failure mode	What it looks like	Mitigation
Preamble/postamble	"Sure! Here's your JSON:" … "Hope that helps!"	Envelope tag + regex extraction; prefill (older models); "output only the `<result>` block, nothing before or after."
Markdown fences	```json … ``` wrapping the payload	Strip fences in a cleaning step; extraction cascade (try raw parse → fenced → first `{…}`).
Schema drift	extra/renamed/nested fields	Native structured outputs with `additionalProperties:false`; validate with Pydantic/Zod post-parse.
Type drift	`"25"` vs `25`, array vs scalar	JSON Schema types + strict mode; business-logic validation layer.
Refusals	safety refusal that doesn't match schema	Detect the provider's `refusal` field (OpenAI) or `stop_reason:"refusal"` (Anthropic); branch on a `status` field.
Truncation	output cut at token limit; unbalanced braces	Set generous `max_tokens`; brace-balancing repair heuristic; detect `finish_reason`/`stop_reason:"max_tokens"`.
Reasoning bleed	chain-of-thought mixed into deliverable	Quarantine reasoning to a separate field/tag or disable it.
Non-determinism	different output for identical input	Accept it (no provider guarantees determinism at temp 0); design re-ingestion to be robust to variation; validate-and-repair.
Verbosity / over-delivery	model adds unrequested sections	Output length caps + explicit non-goals; "do only X."

The practitioner consensus (DEV Community, n1n.ai, Tetrate, JSON-extraction guides) is a multi-layered defense: (1) constrain at generation time where possible (native structured outputs); (2) extract with a fence-and-tag-tolerant cascade; (3) repair (balance braces); (4) validate against a schema (Pydantic/Zod); (5) branch on refusal/status. "Never trust the LLM output directly, even with structured output. Always validate."

Part 3 — Model-Agnosticism

Provider idiosyncrasies that matter for a portable packet:

System-prompt handling differs. OpenAI and Anthropic expose a dedicated system role; the contract should not depend on a system slot existing, because in a copy-paste chat UI there is only the user turn. Put role framing inline.
Structured-output mechanisms differ. OpenAI uses response_format: {type:"json_schema", strict:true} and strict:true on function tools. Gemini uses responseMimeType:"application/json" + responseSchema. Anthropic has no generic "JSON mode" switch — historically you forced schema via tool use with input_schema, and as of November 2025 via the dedicated Structured Outputs feature. A portable packet therefore cannot rely on any single switch; it states the schema in the prompt as the lowest common denominator and lets each runtime additionally enforce it natively.
Formatting preferences differ. Anthropic explicitly recommends XML tags to structure prompts and outputs ("When your prompts involve multiple components like context, instructions, and examples, XML tags can be a game-changer… Having Claude use XML tags in its output makes it easier to extract specific parts of its response by post-processing"); Gemini docs recommend XML-style tags or Markdown headings as delimiters and warn that few-shot property ordering must match the schema's ordering; GPT is more flexible with Markdown. XML tags are the safest cross-model choice because they're the most explicitly endorsed and the easiest to regex.
Context windows differ, so a portable packet keeps context lean (context engineering: "relevance first," "compression over completeness") — and because Chroma's 2025 "Context Rot" report (testing 18 models including GPT-4.1, Claude Opus 4, Gemini 2.5, and Qwen3) found that "model performance consistently degrades with increasing input length… their performance grows increasingly unreliable as input length grows," brevity also improves quality.
Instruction-following idiosyncrasies: Claude "tends to over-explain unless boundaries are clearly defined"; Gemini 3 is "direct and efficient" by default and needs explicit instruction for more detail. The packet should set explicit verbosity and format expectations rather than assume a default.

Graceful degradation rules: (1) prose + JSON Schema, never a vendor-only construct; (2) one delimiter convention throughout; (3) restate the output format at top and bottom; (4) always include an extraction-cascade + validation step on the receiving side; (5) keep examples format-identical to the contract.

Part 4 — Scope Bounding

Techniques, ranked by leverage:

Single-responsibility decomposition — one task per packet. Every multi-agent framework (OpenAI Agents SDK, CrewAI, LangGraph) converges on "give each specialist a narrow job."
Explicit non-goals — the highest-signal, most-skipped element. Underspecification research (arXiv 2505.13360, "What Prompts Don't Say") shows prompts routinely omit boundary conditions, causing silent failures.
"Do only X" framing + acceptance criteria — a "Definition of Done" checklist, mirroring the AGENTS.md pattern of binary, verifiable completion conditions.
Output length caps — directly curb over-delivery and verbosity.
Escalation/refusal path for ambiguity — instruct the model to return a needs_clarification status instead of guessing. This addresses both over-refusal (models refuse ambiguous-but-benign requests) and over-confident guessing; a structured clarification path is better than either.
Guardrails — in programmatic settings, input/output guardrails (OpenAI Agents SDK) or validation layers reject off-contract outputs before they reach the workflow.

Part 5 — Existing Standards & Formats (6-part structure each)

A. Model Context Protocol (MCP)

What it is: Open standard from Anthropic (announced Nov 2024) for connecting AI applications to external tools/data via a client-server protocol.
How it works: JSON-RPC 2.0 messages over STDIO/HTTP; servers expose tools/resources/prompts the model can discover and invoke.
Strengths: Massive adoption; solves the N×M integration problem; model-agnostic; now vendor-neutral under the Linux Foundation.
Weaknesses: It's about tool/context plumbing, not single-task prompt hand-off between chat models; can be token-heavy; security surface (prompt injection, tool-chaining attacks; CVE-2025-49596 in MCP Inspector).
Maturity: Established and dominant. Adopted by OpenAI (Mar 2025), Google DeepMind (Apr 2025), Microsoft. Per Anthropic's December 9, 2025 announcement, MCP has "over 97 million monthly SDK downloads, 10,000 active servers and first-class client support across major AI platforms like ChatGPT, Claude, Cursor, Gemini, Microsoft Copilot, Visual Studio Code." It was donated that day to the Agentic AI Foundation (co-founded by Anthropic, Block, and OpenAI, with support from Google, Microsoft, AWS, Cloudflare, and Bloomberg).
Relevance: Indirect. Not a hand-off-packet format, but the ecosystem your packet may travel through; imitate its self-describing, JSON-structured discipline.

B. Agent2Agent (A2A)

What it is: Open protocol (Google, announced Apr 9 2025) for autonomous agents from different vendors/frameworks to discover, delegate, and coordinate.
How it works: AgentCard (capability discovery) + a Task (stateful unit of work) advanced by Messages (turns with a role), each composed of Parts (TextPart/FilePart/DataPart), with results returned as immutable Artifacts. JSON-RPC/gRPC/HTTP+REST bindings; SSE streaming.
Strengths: The cleanest existing schema for task delegation; explicit separation of communication (Messages) from output (Artifacts); multi-modal; Linux Foundation governance.
Weaknesses: Heavyweight for a single copy-paste task; enterprise/programmatic focus; doesn't address semantic alignment of business terms.
Maturity: Emerging but rapidly adopted; spec v0.3.0; IBM's ACP merged into it under the Linux Foundation; AGNTCY interoperable with it.
Relevance: High as a design template. The Task/Message/Part/Artifact decomposition and the rule that "Messages SHOULD NOT be used to deliver task outputs; Results SHOULD BE returned using Artifacts" directly informs separating your instructions from your <result> payload.

C. Agent Communication Protocol (ACP)

What it is: REST-based open standard from IBM Research (BeeAI), for agent-to-agent/agent-to-human messaging.
How it works: REST+JSON surface, MIME-typed multipart messages, async-by-default with sync supported, SSE streaming, offline discovery.
Strengths: Lightweight, HTTP-native, usable with cURL; local-first/edge friendly.
Weaknesses: Now wound down — the ACP team merged it into A2A under the Linux Foundation; users are directed to A2A migration paths.
Maturity: Effectively deprecated/absorbed (2025).
Relevance: Historical/cautionary; its REST+JSON task-and-artifact framing lives on in A2A.

D. AGNTCY (Internet of Agents)

What it is: Open-source collective (Cisco/Outshift, LangChain, Galileo; + LlamaIndex, Glean), launched Mar 2025, for agent discovery/identity/messaging/observability.
How it works: OASF (Open Agent Schema Framework) for agent description; Agent Connect Protocol; SLIM secure messaging.
Strengths: Broad backing (75+ companies by Jul 2025); interoperable with A2A and MCP; Linux Foundation.
Weaknesses: Infrastructure-layer, not a single-task prompt format; young.
Maturity: Emerging.
Relevance: Low-to-indirect; reinforces JSON-Schema-based self-description (OASF).

E. FIPA-ACL and KQML (historical)

What they are: 1990s–2000s agent communication languages. KQML (DARPA Knowledge Sharing Effort, ~1990); FIPA-ACL (FIPA founded 1996, IEEE since 2005).
How they work: Speech-act "performatives" (inform, request, achieve, propose) wrapping content, with mandatory fields (sender, receiver, content, ontology) and formal semantics.
Strengths: Rigorous formal semantics; standardized interaction protocols (contract-net, auctions); conceptual ancestors of today's protocols.
Weaknesses: Never widely adopted; semantics underspecified (KQML) or too formal (FIPA); "the sad truth is that programmers do not care about semantics."
Maturity: Legacy/academic.
Relevance: Conceptual — the performative idea maps to your packet's explicit intent/status fields; a cautionary tale that over-formalization kills adoption.

F. OpenAI Structured Outputs / Function Calling

What it is: API feature guaranteeing outputs match a developer-supplied JSON Schema.
How it works: response_format:{type:"json_schema", strict:true} or strict:true on a function tool; constrains decoding so required keys/types/enums are honored; SDKs accept Pydantic/Zod; a refusal field signals safety refusals.
Strengths: Eliminates schema-violation parsing failures on the API path; the evolution of older "JSON mode" (which guaranteed valid JSON but not your schema).
Weaknesses: OpenAI-only; requires additionalProperties:false and all fields required (optionals via null union); doesn't cover the chat-UI path.
Maturity: Established (GA Aug 2024).
Relevance: High for the programmatic path; the gold standard for lossless re-ingestion when you control the API.

G. Google Gemini Structured Output

What it is: Gemini's native schema-constrained JSON output.
How it works: responseMimeType:"application/json" + responseSchema (JSON Schema / Pydantic / Zod); Gemini 3 can combine structured outputs with tools and Search grounding.
Strengths: Native and reliable; type-safe; multi-provider parity.
Weaknesses: Property ordering in prompt/examples must match the schema or output degrades; Gemini-only.
Maturity: Established (since Gemini 1.5 Pro).
Relevance: High for the programmatic path; same gold-standard role on Google.

H. Anthropic Structured Outputs / Tool Use

What it is: Schema-guaranteed output for Claude.
How it works: Historically via tool use with input_schema + forced tool_choice; since Nov 13 2025 a dedicated Structured Outputs feature (output_config.format; strict:true tool use) compiling JSON schemas into a grammar that constrains output; now GA.
Strengths: "Always valid — no more JSON.parse() errors. Type safe — guaranteed field types and required fields. Reliable — no retries needed for schema violations." Strict tool use prevents wrong types/enums in tool calls.
Weaknesses: Claude-only; guarantees format not correctness; broken by refusals (stop_reason:"refusal") and truncation (max_tokens); incompatible with response prefilling and citations.
Maturity: Emerging→established (public beta Nov 13 2025, now GA).
Relevance: High for the programmatic path; completes "all three majors now guarantee schema conformance."

I. JSON Schema / Pydantic / Zod

What they are: The shared schema vocabulary (JSON Schema) and its code-level bindings (Pydantic in Python, Zod in TypeScript).
How they work: Define types/required fields/enums/constraints; all three providers accept JSON Schema; SDKs auto-convert Pydantic/Zod objects and deserialize/validate responses.
Strengths: The universal lingua franca of structured output; portable across every provider; doubles as runtime validator.
Weaknesses: Schema must be designed to each provider's strict-mode constraints; doesn't catch semantic errors.
Maturity: Established, foundational.
Relevance: Central. Your packet's output contract should be a JSON Schema (embedded in prose for the UI case, supplied natively for the API case).

J. AGENTS.md

What it is: An open Markdown convention — a "README for agents" — giving coding agents project context.
How it works: Plain Markdown at repo root (no schema/YAML); build/test commands, conventions, boundaries, a "Definition of Done."
Strengths: Released by OpenAI in August 2025 and adopted by more than 60,000 open-source projects; cross-tool (Cursor, Copilot, Codex, Windsurf, Claude Code); governed under the Agentic AI Foundation with Anthropic/Google/Microsoft/OpenAI as platinum members.
Weaknesses: Persistent project context, not a single-task hand-off; freeform (no enforced schema); LLM-generated versions can reduce task success and raise cost ~20% (Gloaguen et al. 2026).
Maturity: Established convention.
Relevance: Medium — a proven model for human-readable, agent-targeted instruction files, and its binary "Definition of Done"/explicit-boundary patterns map directly onto your acceptance-tests and non-goals sections.

Additional concepts

Context engineering — the 2025 reframing from "prompt" to "the full information package." Principles: relevance first, provenance/trust, compression over completeness. More context can hurt (Chroma's 2025 "Context Rot" report). Your packet should be minimal-but-complete.
Envelope pattern — XML tag boundary + JSON payload; the production-tested shape for surviving preambles, sign-offs, fences, and streaming.
Delimiter design — use one consistent, descriptive, regex-friendly delimiter; XML-style tags are the cross-model safe choice.
Idempotent/deterministic prompting — a goal you cannot fully reach: OpenAI, Anthropic, and Google all document that temperature 0 is only "mostly deterministic" (batching, hardware routing, model-snapshot drift). Use seed where available, pin model snapshots, and design re-ingestion to tolerate variation.
Stripping preamble/postamble — envelope extraction cascade; response prefilling (older Claude/OpenAI); or native structured outputs that prevent preamble at generation time.

How multi-agent frameworks pass tasks/results (reference schemas)

OpenAI Agents SDK — explicit handoffs implemented as auto-generated transfer_to_<agent> tools; input_type (a Pydantic schema) carries structured handoff metadata; input_filter controls what history transfers; output_type forces structured (Pydantic) results. Recommended practice: strict_json_schema=True, single-responsibility agents, Pydantic validation.
LangGraph — state is a first-class typed object passed through a directed graph; checkpointing and reducers merge updates; the most production-adopted in 2026.
CrewAI — role-based; task outputs passed sequentially between agents.
AutoGen/AG2 — conversation transcript as shared state (now in maintenance mode as Microsoft shifts to its Agent Framework).
Common pattern across all: decompose, give each agent a narrow contract, transform one agent's structured output into the next's input — exactly the single-task packet model, validated by code between steps for determinism.

Recommendations

Stage 1 — Adopt the envelope-and-contract packet for the copy-paste case now. Use this template:

<task>
ROLE: <one line>
GOAL: <one imperative sentence>
IN SCOPE: <bullets>
NON-GOALS: <bullets — be explicit>
<input> … </input>
CONSTRAINTS: <length cap, tone, "use only data above">
ACCEPTANCE TESTS: <checklist the output must pass>
OUTPUT CONTRACT: Reply with ONLY a <result> block wrapping a JSON object
matching this schema: { "status": "ok|needs_clarification|refused",
"task_id": "<echo>", "payload": { … }, "notes": "<optional, ignored on ingest>" }
Output nothing before <result> or after </result>.
</task>

Re-ingest by regex-extracting <result>…</result>, parsing JSON, and validating against the schema.

Stage 2 — For programmatic hand-off, switch to provider-native structured outputs. Define the contract once as a JSON Schema (or Pydantic/Zod) and enforce it natively: OpenAI strict:true, Gemini responseSchema, Anthropic Structured Outputs. Keep the envelope pattern only as the fallback when crossing a provider/model that lacks the feature.

Stage 3 — Always run the receiving-side defense stack: extraction cascade → brace-repair → schema validation (Pydantic/Zod) → branch on status/refusal. Never json.loads() raw output.

Stage 4 — Bound scope and quarantine reasoning in every packet: single task, explicit non-goals, length cap, a needs_clarification escape hatch, and reasoning kept out of the deliverable field.

Stage 5 — Pin for reproducibility where it matters: fixed model snapshot, seed if available, low temperature — but treat outputs as non-deterministic and make ingestion idempotent.

Benchmarks that change the recommendation: If you control both endpoints' APIs and both support native structured outputs → skip the envelope, use schemas directly (lossless). If either endpoint is a chat UI → the envelope pattern is mandatory. If tasks become multi-turn/stateful or cross organizational boundaries → graduate to A2A's Task/Artifact model rather than hand-rolling. If you need persistent project context rather than a one-shot task → use AGENTS.md instead.

Caveats

No standard exists for the exact thing asked (a model-agnostic single-task hand-off packet for the human copy-paste case). The recommended design synthesizes prompt-engineering best practice with schema standards; it is not a ratified spec.
Structured outputs guarantee format, not truth. They prevent parsing failures, not hallucinations or wrong answers; always keep semantic validation.
Determinism is unattainable. All three majors explicitly state temperature 0 is not fully deterministic.
Anthropic response prefilling is deprecated on Claude 4.6+ models and is incompatible with structured outputs — don't build a portable packet that depends on it.
Adoption/figures move fast. MCP/A2A/ACP statuses and download counts are from 2025–2026 sources and reflect a consolidating, still-shifting landscape (notably ACP's merger into A2A). Some adoption numbers come from vendor blogs and secondary trackers and should be treated as directional.
Source mix. Primary sources (provider docs, protocol specs, arXiv) are weighted most heavily; practitioner blogs (DEV, Medium) corroborate patterns but are individual experience, not authority.

Source map

Protocols (high reliability — official specs/orgs): Anthropic MCP announcement & AAIF donation (Dec 9, 2025); modelcontextprotocol.io; a2a-protocol.org spec + GitHub; IBM Research ACP; Linux Foundation AGNTCY; agntcy.org/docs. Security comparative analysis: arXiv 2511.03841.
Structured outputs (high — official docs): OpenAI Structured Outputs & function-calling guides; Google Gemini structured-output docs; Anthropic Structured Outputs / strict tool use docs; Microsoft Learn (Azure OpenAI).
Prompt/format technique (high-medium): Anthropic "Use XML tags," "Prefill Claude's response," prompting best practices; Gemini prompt-design docs; CodeSignal model-specific formatting.
Envelope & re-ingestion (medium — practitioner, corroborated): DEV Community "JSON or XML Tags for LLM Output"; n1n.ai; Tetrate; arunabh.me; Pockit Blog structured-output guide.
Context engineering (medium-high): arXiv 2510.26493, 2601.21557, 2604.04258; Thoughtworks; Comet; Chroma "Context Rot" report (trychroma.com/research/context-rot); Lütke/Karpathy/Willison June 2025 posts.
Determinism (high — arXiv + vendor docs): arXiv 2408.04667; keywordsai.co; unstract.com.
Frameworks (medium): OpenAI Agents SDK docs (handoffs/orchestration); multi-framework comparisons (Galileo, DataCamp, gurusup).
Historical ACLs (high — academic): OBJS ACL comparison; arXiv 2505.00749 (Coral); FIPA/KQML papers.
AGENTS.md (medium): agents.md; ASDLC.io; Augment Code; Blake Crosley.

State of the art as of today (June 29, 2026)

The best current design for a model-agnostic single-task hand-off packet is a two-mode contract built on one shared JSON Schema. Author the task as a lean, context-engineered brief — role, one-sentence goal, explicit in-scope/non-goals, delimited input, constraints, acceptance tests, and an escalation path — and specify a single self-describing return envelope: an <result> tag wrapping a JSON object that carries a status, an echoed task_id, the payload, and a quarantined notes/reasoning field. In the human copy-paste mode, the XML-envelope-around-JSON shape is what makes the answer reliably scrapable across ChatGPT, Gemini, and Claude despite their differing verbosity and formatting defaults; re-ingestion is a regex plus schema validation. In the programmatic mode, the same JSON Schema is enforced natively by each provider's now-mature structured-output feature (OpenAI strict, Gemini responseSchema, Anthropic Structured Outputs), making hand-off genuinely lossless. Portability comes from never depending on any one vendor's switch, restating the format contract at top and bottom, using one delimiter convention, and always running an extraction→repair→validate→branch stack on receipt because no model is deterministic. The heavyweight interoperability protocols (MCP for tools, A2A for stateful cross-org agent delegation) are the right graduation path when a single pasted task grows into a persistent, multi-turn, or multi-organization workflow — and A2A's Task/Message/Part/Artifact schema is the best existing template to imitate. There is, as yet, no ratified standard for the simple "paste one task between two chatbots" case; the envelope-and-schema pattern is the de facto state of the art.

Ketelsen.ai 2.0ClaudeKetelsen.ai 2.0 Week 11

Richard Ketelsen