There's a popular pattern in the AI industry right now: give a language model access to tools and let it figure out what to call. It works beautifully in demos. It falls apart in production.
At Atlas Verified, our AI agent has access to over 100 specialized tools — government compliance databases, vessel tracking APIs, sanctions screening services, trade data providers, container tracking systems, and more. Each tool exists because a compliance professional needs that specific data point to make a verification decision. The engineering challenge isn't building the tools. It's making sure the right tools get called, in the right order, with the right inputs, every single time.
This post breaks down the real-world problems of tool orchestration at scale and the engineering patterns we've developed to solve them.
The Tool Explosion Problem
When your agent has 5 tools, tool selection is trivial. The model reads descriptions for each tool, picks the right one, and calls it. But something breaks as you scale past 20, 50, 100 tools.
First, there's the context window problem. Every tool needs a description, parameter schema, and usage guidance in the system prompt. At 100+ tools, that's tens of thousands of tokens consumed before the user's question even enters the context. The model spends more time reading tool documentation than thinking about the problem.
Second, there's the selection accuracy problem. Language models are good at choosing between a handful of clearly differentiated options. They're bad at choosing between dozens of similar-sounding tools. When you have multiple tools with overlapping names for sanctions checks and screening, the model often picks the wrong one — or calls multiple tools that return overlapping data.
Third, there's the cost and latency problem. Each tool description in the prompt costs tokens. Each unnecessary tool call costs API calls to external services, many of which are paid per request. An agent that calls 8 tools when 3 would suffice is expensive and slow.
Why Static Routing Doesn't Work
The naive solution is static routing: pattern-match on the user's question and route to predetermined tool sets. "Check this company" triggers sanctions screening. "Track this container" triggers vessel tracking. "Verify this certificate" triggers USDA lookup.
This breaks immediately in practice. Consider the question: "Can you verify this organic certificate from India?" A static router might trigger USDA organic database lookup — reasonable. But a compliance professional also wants to know: Is the certifying agent sanctioned? Are there active FDA import alerts for the product from that country? Does the exporter appear in any denied party lists? Are there trade anomalies in the declared route?
A single user question can legitimately require five or more tools across completely different domains. Static routing either misses critical checks or over-triggers everything, which brings us back to the cost and accuracy problems.
Context matters enormously. "Tell me about this company" means completely different things depending on whether the user is looking at a bill of lading, an organic certificate, or having a general research conversation. The same four words require different tools in each context.
Two-Layer Architecture: Selection and Execution
Our solution separates tool selection from tool execution using two distinct model layers.
The selection layer uses a fast, inexpensive model. Its only job is to look at the user's question, the conversation context, and a compact manifest of available tools, then choose which tools should be available for this specific interaction. It doesn't call the tools — it just decides which ones the execution model should have access to.
The selection model sees a streamlined representation of each tool: a one-line description, what data it returns, and its limitations. This is far more compact than full parameter schemas, so the selection model can evaluate all 100+ tools without running into context limits. It typically selects 8-12 tools that are relevant to the current question.
The execution layer uses a more capable reasoning model. It receives only the tools that the selection layer chose, with full parameter schemas and usage guidance. Because it's working with a focused set of tools instead of the entire catalog, it makes better decisions about which to call, in what order, and with what parameters.
This separation has measurable benefits. Selection takes a fraction of a second and costs very little. The execution model sees a clean, focused context and makes fewer errors. Total tool calls per interaction dropped significantly after we introduced this architecture.
Capability-Based Filtering
Beyond the two-layer selection, every tool in our system declares its capabilities — the types of tasks it can perform. A tool might declare that it's useful for compliance checks, trade intelligence, document analysis, or shipment tracking.
When the system determines the user's intent, it filters the tool catalog to only include tools with matching capabilities. The selection model never even sees tools that are irrelevant to the current task type.
This is important for a subtle reason: language models are suggestible. If a sanctions screening tool is in the context, the model might decide to run a sanctions check even when the user is asking a simple question about shipping routes. By removing irrelevant tools before selection, we eliminate an entire class of unnecessary tool calls.
Quality Feedback Loops
Tool selection is only half the battle. The other half is ensuring the tools return useful results — and knowing what to do when they don't.
After every tool execution, our system evaluates the result quality. Common failure modes include:
Empty results. The tool ran successfully but returned no data. This could mean the entity doesn't exist in that database (a valid and informative finding) or that the query was poorly formed (a problem that should be fixed and retried).
Low-confidence matches. A fuzzy name search returns results, but the match score is below threshold. The system needs to decide whether to present these as potential matches or discard them and try a different search strategy.
Insufficient sample size. A trade data query returns results from only one data source, when multiple sources should be cross-referenced for reliability. The system recognizes this gap and can prompt additional tool calls to fill it.
When quality checks fail, the system injects guidance back into the agent's reasoning loop. This isn't a simple retry — it's a directed retry that explains what went wrong and suggests how to reformulate the query.
Progressive Result Delivery
Trade compliance verification is inherently multi-step. A user uploads a bill of lading and asks the agent to verify it. The agent needs to check the shipper against sanctions lists, verify the organic certification, look up the vessel's current position, check FDA import alerts for the commodity, and validate the tariff classification. These checks take different amounts of time.
Rather than making the user wait for all checks to complete before showing anything, we stream results progressively. As each tool returns, its results are immediately formatted and delivered to the user's interface. The user sees sanctions results appear, then vessel data, then organic certification status, each rendered as a structured card the moment it's ready.
The final synthesis step, which produces a narrative summary tying all results together, runs only after all tools have completed.
Auto-Verification: Closing the Loop
When documents are processed through our OCR pipeline, the extraction system identifies key data points — NOP IDs, company names, vessel identifiers, container numbers, port codes, HS codes. Each of these data points maps to specific verification tools.
Rather than waiting for a user to ask "verify this document," the system pre-computes verification hints. When the user opens a conversation about a document, the agent already knows which tools to call and with what parameters. This dramatically reduces time-to-verification: the agent doesn't need to read the document, identify entities, and then figure out what to check. It already has a verification plan ready to execute.
The Principles
Building this system has crystallized several principles that we think apply broadly to AI agent engineering:
Separate selection from execution. Don't make your reasoning model also be your routing model. Use cheap, fast models for classification and expensive, capable models for reasoning.
Reduce choice, increase accuracy. A model choosing between 10 well-matched tools outperforms one choosing between 100 tools every time. Invest in filtering before selection.
Validate after execution, not just before. Tool calls fail in ways you can't predict. Build quality feedback loops that detect bad results and guide retries.
Stream, don't batch. Users should see results as they arrive. Progressive delivery isn't just a UX improvement — it fundamentally changes how users interact with verification data.
Make the intent explicit. Don't let the model infer what "good" looks like. Tell it. Goal planning turns open-ended exploration into directed verification.
Atlas Verified's AI agent orchestrates 100+ specialized verification tools to automate trade compliance checks across the global supply chain.