Short answer: Chunk the schema at the type boundary, embed each type with its description and its field signatures, and filter by source before you search. Everything else is detail.
This is the post I wish I'd had when I first tried to make an AI agent write a GraphQL query against a schema it had never seen. A year and 300+ production graphs later, the recipe below is what survived contact with reality.
The problem, stated precisely
You have a GraphQL schema. Your AI agent needs to write a query against it. The schema is too large to paste into the prompt, and even if it fit, stuffing 40k tokens of SDL into context degrades the agent's ability to answer the actual question.
You need a retrieval layer: given a natural-language intent, return the 5-10 most relevant types and fields. Then the agent writes the query from that narrow slice.
Most existing GraphQL MCP servers skip this layer entirely. They wrap an endpoint so the agent can execute operations, which is useful once you know what to send, but it assumes the hard part (understanding the schema) has already happened. The recipe below is about building that missing layer.
Easy to state. Three things make it surprisingly hard:
- Chunk boundaries. A type definition isn't a paragraph. Split it wrong and you destroy meaning.
- Field-level vs type-level retrieval. Both matter, for different questions.
- Multi-tenant filtering. If you run this for more than one team, a query against team A must never return team B's types.
Chunking: type is the unit
After trying sentence-level, field-level, and paragraph-level chunking, type-level won on every benchmark.
The chunk for each type looks like:
type Invoice @key(fields: "id")
"""
Represents a customer invoice. Immutable once issued. Line items are
aggregated at issue time.
"""
fields:
id: ID! # unique invoice identifier
number: String! # human-readable invoice number
customer: Customer! # customer who received the invoice
lineItems: [LineItem!]! # frozen at issue time
totalAmount: Money! # sum of line items
issuedAt: DateTime! # when the invoice was finalized
status: InvoiceStatus! # DRAFT | ISSUED | PAID | VOID
Three things matter:
- Type description goes first. The description is the most information-dense part. Putting it at the top matters because many embedding models weight earlier tokens more heavily.
- Fields are included, not just listed. The type and description of each field are part of the chunk. Otherwise a query like "show me line items on an invoice" wouldn't match because
lineItemsas a bare field name isn't semantically close to "line item." - Enum values are inlined. When the type references an enum, include the enum values in the chunk text. This is the single biggest change we made that improved retrieval on filter-related questions.
Embedding choice
We benchmarked eight embedding models on a 400-question eval set against three real graphs. Results:
| Model | Dimensions | Recall@5 | Cost per million tokens |
|---|---|---|---|
OpenAI text-embedding-3-small | 1536 | 0.82 | $0.02 |
OpenAI text-embedding-3-large | 3072 | 0.88 | $0.13 |
Voyage voyage-3-large | 1024 | 0.89 | $0.18 |
Cohere embed-english-v3.0 | 1024 | 0.84 | $0.10 |
We ship text-embedding-3-small by default. The 6-point recall gap to voyage-3-large isn't worth 9x the cost for most teams. When someone hits a retrieval ceiling, we offer text-embedding-3-large as an upgrade.
Multi-tenant filtering
If you serve more than one workspace, do not use per-tenant collections unless you have under ten tenants. Use one collection with a metadata filter.
In Qdrant, that means a payload field workspace_id: "ws_abc" on every point, and every query carries a filter:
filter:
must:
- key: workspace_id
match:
value: "ws_abc"
This sounds obvious. The reason it matters: if you do it per-collection, collection creation becomes a bottleneck when teams sign up in bursts, and HNSW index rebuilding on small collections is wasteful. One collection with a filter is faster, cheaper, and scales to thousands of tenants.
The retrieval pipeline, end to end
┌────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Agent asks: │ │ Embed query with │ │ Vector search │
│ "billing plan │───▶│ text-embedding- │───▶│ in Qdrant with │
│ with pricing" │ │ 3-small │ │ workspace filter │
└────────────────┘ └──────────────────┘ └──────────────────┘
│
▼
┌────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Agent writes │◀───│ Render types as │◀───│ Top 8 types, │
│ valid query │ │ compact SDL │ │ rerank by field │
│ │ │ │ │ overlap │
└────────────────┘ └──────────────────┘ └──────────────────┘
The rerank step is a tiny heuristic, not a second model. We boost types whose fields overlap with terms in the query ("pricing" boosts BillingPlan because it has a pricing field). This adds 3-5 points of recall at near-zero cost.
What we measured in production
On GQLens workspaces over the last 90 days, before and after adding rerank-by-field-overlap:
- First-try query compile rate: 71% → 84%
- Median agent tool calls per query: 4 → 3
- P95 retrieval latency: 140 ms → 155 ms (small cost, big win)
The 13-point lift in first-try compile rate is the number that matters. That's the difference between an agent your team trusts and one they fight with.
Pitfalls we hit
Re-embedding on every schema change is too expensive. We now diff schemas and only re-embed changed types. A typical schema update touches 2-5 types, not all 400.
Interfaces confuse the agent if embedded alone. Always embed an interface alongside its implementers and mark the relationship in the chunk text. Otherwise the agent picks the interface, can't select concrete fields, and loops.
Directives matter more than we thought. Types marked @deprecated or @internal should be down-ranked in retrieval, not filtered out. The agent sometimes needs to know a field exists but is deprecated.
Where this goes next
The piece we're actively working on is hybrid search: vector recall combined with BM25 over field names. Early numbers suggest another 4-6 points of recall on exact-name queries ("give me the Invoice.status enum values"). We'll write that up once the rollout is complete.
If you're building something similar, the sequence I'd recommend:
- Start with type-level chunks and
text-embedding-3-small. - Measure first-try compile rate on a real eval set before tuning anything.
- Add rerank-by-field-overlap before you reach for a bigger model.
- Add hybrid search only when pure vector hits a ceiling on exact-name questions.
Every step should be driven by the compile-rate number, not by whether the retrieval "looks good." Retrieval that looks good and agents that write broken queries is the most common failure mode in this whole space.