An MCP server post-mortem: context vs. protocol

If you are exposing an MCP server in front of a REST API, two things from our own experience that are worth passing on:

Code change and tool-description change have to land together. Reshape a list response without updating the description, and you trade a loud overflow failure for a silent one — the agent stops at the thin record, never realises get_bug exists, answers from incomplete data.
Log result_size_bytes per tool call, and smoke-test against production-shaped data, not synthetic fixtures. A response can be type-correct and still wrong about its size, and dev fixtures hide the 99th-percentile costs that broke us in production.

Both are aspects of the now-familiar idea that an MCP server is a context translator, not a protocol translator — the BFF / overfetching pattern with the consumer swapped from a UI client to an LLM. The principle itself is not new; what may be new is the specific pairing of the code change with the tool-description change, and the habit of logging output byte size on every call.

The story below — list_bugs(limit=3) returning 61,621 bytes for three records, the agent harness saving the overflow to disk and then grepping field values from the file, one question producing four tool calls — is one incident on our own backend. Illustrative, not a benchmark.

The incident

The JSONL log line our server wrote for that first call:

{
  "tool": "list_bugs",
  "args": { "limit": 3 },
  "duration_ms": 505,
  "result_status": "ok",
  "result_count": 3,
  "result_size_bytes": 61621,
  "upstream_url": "GET /api/v1/reports"
}

Three records. ~20.5 KB per record, returned to the agent verbatim. From the server's point of view, a clean success.

From the agent's point of view, less so:

[list_bugs]
  OUT: Error: result (61,621 characters across 236 lines)
       exceeds maximum allowed tokens. Output saved to disk.
       For targeted searches: use grep on the file directly.

Thought: Output is huge (61 KB for 3 records). Delegating to a sub-agent.
Agent: Extract 3 latest records summary
[Read]   …/tool-results/list_bugs-….txt
[Grep]   "^ \"(id|title|status|priority|created_at)\":" …

To produce a five-row table, the agent had to: hit the overflow ceiling, spawn a sub-agent, read the saved file, grep for fields, and reassemble. Four tool calls plus orchestration overhead, on top of the original. On three records — limit: 20 would simply have failed past any reasonable budget.

One observation worth making before the diagnosis: the recovery sequence above is harness-specific, and the difference can be sharp even within a single vendor. After the incident, we ran a controlled overflow (a tool returning ~80 KB to match the original case) through four MCP clients — Claude Code, Claude Desktop, Goose (Block), and Cursor — and got two distinct behaviours. Only Claude Code gated the tool result by character count and went into the save-to-disk recovery dance described above. The other three, including Claude Desktop from the same vendor, had no such gate: each one fed the whole payload straight to its underlying model, which read it and answered correctly. The "silent degradation" we were worried about — raw-injection leading to a confidently wrong answer instead of a clear failure — did not appear at 80 KB in any of the three non-gating clients. We expect it would appear at sizes large enough to actually stress the model's own context budget; that we did not test here.

A reader could fairly observe that this finding partly undercuts the drama of the opening incident: three of the four clients we tested would have absorbed our 80 KB without overflow, so the failure was somewhat client-specific. That is true, and it doesn't weaken the case for the projection fix. Even when a harness absorbs the full payload without complaint, the agent still pays for every token of it on every turn the result remains in context. The fix is about keeping the context budget lean; preventing overflow is a downstream consequence, not the primary motivation.

Why it happened

GET /api/v1/reports?limit=3 doesn't return summaries. It returns full records — every field on every row. For real production records that includes the full description, captured browser/network telemetry, stack trace, and replay metadata. ~20 KB each; three of them came to 60+ KB.

The MCP server's handler did the obvious thing:

const data = await client.request('GET', path, { params });
return { data, resultCount: data?.data?.length ?? 0 };

A perfectly correct REST proxy. That's why it broke.

The mismatch is structural. REST list endpoints are designed for UIs that want to minimise round-trips and join client-side — a thin-record response would be considered underpowered by such a consumer. Agent-facing tools operate under the opposite pressure: the agent reads every byte you return into its context window, and pays the cost twice — in dollars and in attention. Agents also cannot skim the way a human can: a developer in Postman scans the JSON visually and ignores irrelevant fields, while the agent must consume the entire payload before it can decide what is relevant.

So list_bugs the REST endpoint and list_bugs the MCP tool look like the same operation. They are not.

The fix

Three changes, ordered by impact.

1. Project list-mode results to a thin record

The largest contributor — remove heavy fields from list-mode tools before returning. An agent performing triage does not need network logs to decide which record warrants closer attention. It needs an ID, a title, status, priority, the timestamps, and a project handle. If an entry merits further investigation, the agent follows up with get_bug and pays the full payload cost on that one record only.

const LIST_FIELDS = [
  'id', 'title', 'status', 'priority',
  'created_at', 'updated_at', 'project_id',
] as const;

function thinRecord(r: unknown): Record<string, unknown> {
  if (!r || typeof r !== 'object') return {};
  const src = r as Record<string, unknown>;
  const out: Record<string, unknown> = {};
  for (const k of LIST_FIELDS) if (k in src) out[k] = src[k];
  return out;
}

For calibration: on our production-shaped records, this projection takes each record from roughly 20 KB (full payload including description, console array, network logs, replay metadata) to roughly 280 bytes (just the allowlisted fields). The per-record reduction is governed entirely by how heavy the dropped fields are on your particular records; the principle transfers, the size delta does not.

The tool description was updated alongside the code change to make the contract explicit: "Returns thin records — for full content, follow up with get_bug." That string is not documentation. It is the primary natural-language signal the agent uses when choosing which tool to dispatch — the tool name and the input schema also contribute, but the description is what disambiguates list_bugs from get_bug when both are plausible candidates. A description that says "returns full details" trains the agent to treat list-mode and detail-mode as interchangeable. The response-shape change alone, without the description update, would have traded the overflow failure for a worse failure mode: the agent stops at the thin record, does not recognise that get_bug is the appropriate next call, answers from incomplete information, and does not request the remaining detail. The code change and the description change must land together.

The trade-off this hides: thin projection imposes extra round-trips. If the agent triages 20 candidates and decides to inspect 5 of them in detail, the cost is 5 additional get_bug calls — precisely the N+1 problem the REST list endpoint was designed to avoid. The trade-off reverses between the two consumers: a UI is latency-bound (round-trips dominate, byte size is nearly free), an agent is budget-bound (token cost dominates, while sequential 2–3 KB calls are absorbable).

2. Bounded excerpts for search results

Search hits had the same problem in a different form. The intelligence service returned full descriptions inside each hit. We projected the hits to the same thin set, and then synthesised a bounded excerpt:

const EXCERPT_MAX = 240;

function makeExcerpt(s: string): string {
  const flat = s.replace(/\s+/g, ' ').trim();
  if (flat.length <= EXCERPT_MAX) return flat;
  const cut = flat.slice(0, EXCERPT_MAX);
  const lastSpace = cut.lastIndexOf(' ');
  return (lastSpace > EXCERPT_MAX - 40 ? cut.slice(0, lastSpace) : cut) + '…';
}

The whitespace collapse matters more than it appears — production descriptions are multi-paragraph with indentation, and verbatim slicing produces an excerpt full of \n\n and wasted bytes. The right order is: flatten the text first, then cut at a word boundary.

A limitation specifically for search: this is head-of-string truncation. If the query matches at character 1500 of a 3000-character description, the excerpt is the first 240 characters and does not contain the match — the agent receives a hit whose excerpt does not show why it is a hit. A query-aware snippet (locate the match position, centre the excerpt around it) is the correct answer for search hits. We have not yet implemented it.

3. Compact JSON

Cosmetic but inexpensive. The dispatch layer was calling JSON.stringify(data, null, 2) — pretty-printed with two-space indentation. Indentation is not free for an LLM consumer the way it is for a human reader: whitespace consists of tokens as well, and these are tokens that consume budget without contributing any meaning. Drop the second argument, and add a regression test that asserts the serialised output contains no newlines (the compact-versus-pretty signature is unambiguous on that one byte).

Why this surfaced only in production

The reason this manifested on the first production call, and not at any point during local development, is that our dev fixtures contained thin records. Synthetic descriptions of two sentences. No captured console. No network logs. A "record" in the fixture weighed perhaps 1 KB.

If we had been our own first user — running the MCP server against the dev stack — list_bugs(limit=3) would have returned approximately 3 KB. Manageable. Nothing to flag. We would have shipped the same code unchanged, considered the task complete, and waited for a customer to discover the failure mode in production.

The general form: dev data hides 99th-percentile costs. Anything that is bounded in production but unbounded in principle (free-text descriptions, uploaded files, captured ambient telemetry) tends to weigh almost nothing in fixtures and almost everything in real use. Type checking, unit tests, even integration tests against mocked upstreams — none of these surface the problem. The shape is correct. The size is wrong.

Two process changes that follow from this:

Run smoke tests against production-shaped data, not against synthetic fixtures. Either replay anonymised real payloads through the test suite, or point the local build at a small staging instance with realistic data before declaring the work complete.
Log byte sizes, not only success and failure. The only reason we detected this so quickly is that the behavioural logger records result_size_bytes on every tool call. The 61,621 was not an error as far as the agent was concerned — the call had succeeded. It was the size that broke things, and the size is visible only if you instrument for it.

The instrumentation pattern itself is small — { tool, args_size_bytes, result_size_bytes, duration_ms, error_class } per call, JSONL append-only, daily-rotated — and it is worth copying into any MCP server you build. Type signatures will catch the wrong-shape bug. Behavioural logs are the only place where the wrong-size bug surfaces early enough to be fixed cheaply.

Code at apex-bridge/bugspotter-mcp, MIT.