How we rebuilt the GQLens AI agent: fewer tokens, fewer hallucinations, fewer steps

Short answer: We deleted the subagent layer, fused query building and validation into a single tool call, added a lighter alternative to introspection, and rewrote the system prompt to teach the model how GraphQL questions actually map to schema operations. The agent now picks the right path on the first try far more often, on roughly half the tool calls, with noticeably less prompt overhead per turn.

This post is the engineering side of the release we shipped last week. If you only want the headline, the LinkedIn version is two paragraphs. This one is for people who want to see the trade-offs.

What the old agent looked like

The previous agent had a layered architecture. The main agent could delegate to two subagents, Scout (schema exploration) and Researcher (deep type analysis), each of which carried its own system prompt and re-derived its own context window before doing work. A typical "build me a query for X" question would walk this path:

Main agent calls graphql_search to find candidate types.
Main agent calls graphql_introspect on one or more types to get field details.
Main agent constructs a query string.
Main agent calls graphql_validate to confirm the query parses against the schema.
If validation fails, main agent re-introspects and retries.

Four to six tool calls in the happy path. Eight or more when the schema had deprecated fields, ambiguous paths, or types large enough to blow past graphql_introspect's token budget.

Two structural problems sat under that workflow:

Subagents multiplied prompt overhead. Every Scout invocation rehydrated its own ~1.5k-token system prompt before doing anything useful. The model paid for the same instructions twice (once at the top level, once inside the subagent), even when the subagent only made one tool call.
Build and validate were always separate. The model had to write the query string from memory of what it had just introspected, then submit it for validation. That gap is exactly where hallucinated fields slip in.

What changed

Three things. None of them are clever; all of them removed a layer.

1. The subagent layer is gone

We deleted src/lib/graphql-subagents.ts outright. The 233 lines that defined Scout and Researcher are no longer in the codebase. The main agent now calls tools directly.

The reasoning: subagents make sense when you have genuinely independent sub-problems that benefit from their own narrow context. We didn't. Both Scout and Researcher were doing things the main agent could do equally well with the same tools, and the only thing the layer was adding was prompt tokens. So we pulled it.

2. `graphql_build_query` does build + validate + auto-fix in one call

The new tool takes a verified path from search results (for example query → inventoryLevels → quantities) and:

Generates a complete query with named variables.
Validates it against the live schema in the same call.
Auto-replaces any deprecated fields with their replacements and reports the swap as autoFixes.
Returns validation: { valid, errors, fixedQuery } so the agent can present the result directly when valid, or fix once and resubmit if not.

This collapsed the most common four-call pattern (search → introspect → build → validate) into two calls (search → build). For multi-path comparisons (think "build me three options so the user can pick"), we encourage the model to call graphql_build_query in parallel for each candidate path, which the AI SDK runtime supports natively.

3. A lighter "just give me field names" tool

graphql_introspect is powerful but expensive. On a schema with a Query root that has 500+ fields, a single deep introspect call returns a payload large enough to dominate the token budget for that turn.

We added graphql_fields(typeName), which returns just the field list for one type, with deprecation flags and replacement hints. The system prompt now points the model at graphql_fields first and reserves graphql_introspect for cases where deep nested type expansion is genuinely needed (or validation failed and the model needs to dig deeper).

4. A retrained system prompt

The prompt rewrite was the part that took the longest to get right. The version that shipped teaches the model four things explicitly:

Operation type inference. "get/list/fetch/show/find/check" → query. "create/add/update/delete/set/change" → mutation. "listen/subscribe/watch" → subscription. The model now sets operationType on its first search call almost every time, which cuts the search result set in half before reranking.
Search with 2-3 keywords, not full sentences. Vector search degrades when you embed a full natural-language question instead of the salient nouns. The prompt has explicit examples.
Always present ambiguous paths to the user via ask_user. The old prompt let the model decide when to ask. The new one is firm: if 2-5 distinct paths could answer the question, describe them all and ask. If 5+ paths exist, ask the user to narrow the domain first. Don't guess.
Think out loud before every tool call. One or two sentences of "you're asking about X, let me try Y" before calling a tool. This isn't decorative; it gives the user a cancel point if the model has misunderstood, and it gives the model a self-consistency check.

We also tightened the deprecation handling. Whenever graphql_build_query swaps a deprecated field, the prompt requires the model to surface the swap to the user in plain language: "Note: oldField is deprecated. This query uses newField instead." Silent swaps confuse users who copy the query into their own code and then can't find the field they originally asked about.

What we measured (informally)

We did not run a formal benchmark for this release. What we have is hands-on testing across a dozen schemas during the rewrite. Treat the numbers below as estimates, not as a study:

Tool calls per typical "build me a query" turn: down from ~4-5 to ~2-3.
Prompt tokens per turn: down meaningfully, mostly from removing the subagent layer and from the model preferring graphql_fields over deep graphql_introspect.
First-try valid query rate: up, mostly because the model can no longer write a query that hasn't been validated against the schema in the same step.

We'll publish hard numbers once we've run the new agent against our standard eval set for long enough to have something defensible. The point of this post is the architecture, not the leaderboard.

What we deliberately didn't do

We didn't fine-tune a model. Every change here is at the tool and prompt layer. The base model is still o4-mini via the AI SDK. Fine-tuning is a much bigger commitment, and we want to exhaust prompt and tool changes first.
We didn't add memory across turns. Each turn still starts from the messages in the thread. Persistent agent memory is a separate problem with its own privacy and consistency trade-offs, and we haven't decided how we want to handle it yet.
We didn't try to make the agent "creative." It still refuses to invent field names and still asks the user when paths are ambiguous. The win we wanted was reliability, not novelty.

Trade-offs we accepted

More user prompts. With the new "always use ask_user when paths are ambiguous" rule, the agent will sometimes interrupt a user who would have been fine with the agent picking. We think this is the right default (wrong queries cost more time than one extra click), but it does change the feel of the product. We'll watch usage and revisit.
Less context isolation. Without subagents, every tool result flows back into the main context. On schemas with very large search responses this could pressure the context window. We mitigate it by capping result sizes inside the tools and preferring graphql_fields over graphql_introspect.

Where this goes next

Two things are queued up:

Hybrid search. Vector recall combined with BM25 over field names, especially for exact-name questions like "show me the fields on Invoice." Early prototypes look promising; we'll write that up when it ships.
A formal eval harness. A fixed set of question/schema pairs with labelled "correct path" outcomes, run on every prompt or tool change. Right now we're regression-testing by hand, which is fine for a small team but won't scale.

If you've been wondering whether the AI agent in your dev tool is worth the tokens it burns, the lesson from this rewrite is the unglamorous one: most of the wins come from removing layers, not adding cleverness. Subagents that don't justify their overhead, validators that aren't co-located with builders, full introspections when a field list would do. Every one of those is a tax the user pays in latency, dollars, and wrong answers.

Try the new agent at gqlens.com.