Dynamic Tool Discovery: A Systematic Evaluation of 8 Search Strategies for AI Agents

profile picture
Karol Piekarski
PhD, Agentic AI & Data Engineer
March 5, 2026
Group
Table of content

The proliferation of tools in AI agent systems creates a context bloat problem: providing an agent with 51 tools consumes 18,492 tokens per query. This study presents a comparison of 8 tool discovery strategies for dynamic tool discovery. We evaluated keyword-based methods (regex, BM25), semantic approaches (RAG with Pinecone, Milvus, cross-encoder), and hybrid fusion (BM25+RAG with RRF) across 100 queries in both one-shot and multi-turn conversation modes using GPT-4o-mini. For tools following recommended naming conventions, keyword methods (94% accuracy) matched or outperformed semantic search (93%) and hybrid approaches (90.5%). The best-performing dynamic strategy (regex) achieved 95% accuracy with 73% token reduction compared to the all-tools baseline (99% accuracy). Our results align with Anthropic’s guidance on tool design: when tools follow recommended naming conventions (service-prefixed, action-suffixed), the naming itself provides sufficient signal for discovery, reducing the need for semantic understanding.


The Context Bloat Problem

Modern AI agent systems increasingly rely on external tools to extend their capabilities beyond pure language modeling. Production systems now commonly integrate dozens to hundreds of tools spanning messaging (Slack), productivity (Calendar, Email), code management (GitHub, Jira), entertainment (Spotify), and more. However, this capability expansion creates a fundamental scalability challenge: context bloat.

In our experimental setup, providing an agent with 51 tool definitions consumes 18,492 tokens per query on average. At scale, this represents a three-fold problem:

  1. Cost: At GPT-4o-mini pricing ($0.15/1M input tokens), processing 10,000 queries with all tools costs $28.35 versus $8.03 with optimized retrieval-a 3.5x cost multiplier.
  2. Context competition: Tool definitions compete with actual conversation content, user instructions, and retrieved documents for limited context window space.
  3. Cognitive load: LangChain’s benchmarking found “a significant decrease in performance with increased context-size, even if that context was irrelevant to the target task.”

This problem is not theoretical. Anthropic’s production systems have encountered token budgets exceeding 134,000 tokens when managing extensive MCP (Model Context Protocol) tool collections.

Research Questions

This study addresses four research questions:

  • RQ1: How do different retrieval strategies compare for tool discovery in terms of accuracy and token efficiency?
  • RQ2: Does semantic search outperform keyword-based methods for tool retrieval?
  • RQ3: Do hybrid approaches that combine lexical and semantic signals achieve better results than either method alone?
  • RQ4: How does conversation context (multi-turn dialogues) affect retrieval performance?

Contributions

This work makes the following contributions:

  1. Systematic comparison: The first rigorous evaluation of 8 retrieval strategies specifically for tool discovery, spanning keyword, semantic, and hybrid approaches.
  2. Counter-intuitive evidence: Empirical demonstration that keyword-based methods outperform semantic search for tool retrieval, contradicting common assumptions in the RAG literature.
  3. Hybrid paradox analysis: Investigation of why hybrid fusion underperforms its component methods, contrary to expectations from document retrieval research.
  4. Practical recommendations: Decision framework mapping use cases to optimize strategies, grounded in quantitative trade-off analysis.

Tool Search in Production

Anthropic has introduced the Tool Search Tool pattern for managing large tool collections, reporting:

“This represents an 85% reduction in token usage while maintaining access to your full tool library. […] Opus 4 improved from 49% to 74%.” – Anthropic

Their approach uses an LLM-in-the-loop architecture where the model itself generates search queries to discover relevant tools. This pattern inspired our investigation but differs from our focus on comparing retrieval algorithms.

Warnings Against RAG for Tools

Phil Schmid explicitly cautions against semantic retrieval for tool definitions:

“Fetching tool definitions dynamically per step based on semantic similarity often fails. It creates a shifting context that breaks the KV cache and confuses the model with ‘hallucinated’ tools that were present in turn 1 but disappeared in turn 2.” – Phil Schmid

This warning informed our hypothesis that semantic approaches might underperform keyword methods for the structured, nomenclature-dense domain of tool definitions.

Multi-Agent Comparisons

LangChain’s benchmarking of multi-agent architectures found that single-agent performance degrades significantly with increasing tool count, particularly when “distractor domains” introduce unrelated tools. Their findings suggest tool discovery is a critical bottleneck, though they did not systematically compare retrieval strategies.


Vstorm’s Methodology

Experimental Setup

Experiment Configuration

Parameter

Value

Total tools

51 across 16 categories

Test queries

100 (ranging from single to multi-tool)

Evaluation modes

One-shot, Conversation (5-turn batches)

LLM

GPT-4o-mini (OpenAI)

Embedding model

text-embedding-3-large (3,072 dimensions)

Top-k retrieval

5 tools per search

Vector databases

Pinecone, Milvus

All tools were mock implementations with realistic signatures and docstrings, simulating a production assistant with capabilities across Slack, Spotify, Calendar, Weather, GitHub, Jira, Email, Shopping, Contacts, Files, Documents, Notes, Reminders, Timezone, Translation, and Calculator domains.

Query dataset construction: 100 test queries were manually authored based on the available tool definitions. For each tool, we crafted example queries that a user might realistically give to an assistant, ranging from single-tool requests (e.g., “Check the weather in New York“) to multi-tool compositions requiring 2–3 tools in sequence (e.g., “Find the email about Q4 Planning, mark it as read, and reply“). Each query was annotated with expected tool(s) and a reference answer.

Search Strategies

We evaluated 8 strategies spanning three categories:

Strategy Implementations

Strategy

Category

Implementation Details

all_tools

Baseline

All 51 tools loaded into context

regex

Keyword

Weighted scoring: +10 phrase match, +5 name match, +2 description match

bm25

Keyword

BM25 ranking via bm25s library with PyStemmer

pinecone

Semantic

Pinecone vector search with OpenAI embeddings

milvus

Semantic

Milvus vector search with OpenAI embeddings

cross_encoder

Semantic

ms-marco-MiniLM-L-6-v2 reranker scoring all pairs

hybrid_pinecone

Hybrid

BM25 + Pinecone with RRF fusion (k=60)

hybrid_milvus

Hybrid

BM25 + Milvus with RRF fusion (k=60)

Search Strategy Characteristics

Strategy

Search Latency

External API

Infrastructure

Initialization

regex

~0ms

None

None

Instant

bm25

~10ms

None

None

~100ms (tokenization)

cross_encoder

~50–100ms

None

None

~2s (model load)

pinecone

~200–500ms

OpenAI Embeddings

Pinecone (cloud)

~500ms

milvus

~200–500ms

OpenAI Embeddings

Milvus (Docker)

~500ms

hybrid_pinecone

~300–600ms

OpenAI Embeddings

Pinecone (cloud)

~600ms

hybrid_milvus

~300–600ms

OpenAI Embeddings

Milvus (Docker)

~600ms

Metrics

Primary metrics:

  • Accuracy: Binary classification-did the agent call at least one expected tool?
  • Total tokens: Sum of input and output tokens per query

Secondary metrics:

  • Duration (seconds)
  • API requests count
  • Tool calls count

Statistical tests:

  • Two-sample t-test for mean comparisons
  • Mann-Whitney U test for non-parametric validation
  • Cohen’s d for effect size quantification

Tool Categories

The 51 tools were distributed across 16 categories:

Category

Tool Count

Example Tools

Slack

5

slack_send_message_to_channel, slack_send_direct_message, slack_get_channels

Spotify

7

spotify_play_track, spotify_get_playlist, spotify_set_volume, spotify_search_tracks

Calendar

3

calendar_get_events, calendar_schedule_meeting, calendar_find_free_slots

Email

5

email_get_inbox, email_send, email_reply, email_search

GitHub

4

github_get_issues, github_get_pull_requests, github_get_repos

Jira

5

jira_get_issues, jira_create_issue, jira_update_issue_status

Shopping

4

shopping_search_products, shopping_add_to_cart, shopping_get_order_status

Files

3

files_search, files_move, files_get_info

Documents

3

documents_share, documents_list, documents_get_content

Notes

2

notes_list, notes_create

Weather

3

weather_get_current, weather_should_bring_umbrella, weather_get_forecast

Contacts

2

contacts_get_by_company, contacts_search

Reminders

2

reminders_create, reminders_list

Timezone

1

timezone_convert

Translate

1

translate_text

Calculator

1

calculator_evaluate


Results

Accuracy Analysis

One-Shot Accuracy Rankings

Rank

Strategy

Accuracy

Correct/Total

1

all_tools

99.0%

99/100

2

regex

95.0%

95/100

2

cross_encoder

95.0%

95/100

4

pinecone

94.0%

94/100

5

bm25

93.0%

93/100

6

hybrid_milvus

92.0%

92/100

7

milvus

90.0%

90/100

8

hybrid_pinecone

89.0%

89/100

Key observations:

  1. The baseline (all_tools) achieved near-perfect accuracy (99%), establishing an upper bound.
  2. Dynamic strategies ranged from 89% to 95% accuracy-a 4-10 percentage point gap from baseline.
  3. Hybrid approaches ranked last among dynamic strategies, contradicting expectations that combining signals would improve results. Our hybrid implementation uses Reciprocal Rank Fusion (RRF): both BM25 and semantic search independently retrieve top_k × 3 candidates, then their rankings are merged using weighted 1/(k + rank) scoring with equal 50/50 weights. When the two rankers disagree—e.g., keyword correctly ranks documents_share first while semantic surfaces files_share—RRF averages the conflicting ranks, potentially demoting the correct tool below the top-k cutoff. We analyze this effect in detail in Section 5.2.
  4. The simplest keyword method (regex) tied for best dynamic performance despite minimal computational overhead.
image

Grouped bar chart comparing accuracy across one-shot and conversation modes for all strategies.

Token Efficiency

Token Usage Comparison

Strategy

Avg Tokens

Std Dev

Min

Max

Reduction vs Baseline

regex

4,986

2,702

1,976

18,311

73.0%

cross_encoder

5,643

3,374

2,319

20,049

69.5%

hybrid_pinecone

5,886

4,492

0

28,477

68.2%

hybrid_milvus

5,961

4,445

0

30,801

67.8%

milvus

5,967

4,824

0

29,435

67.7%

bm25

6,728

8,132

2,290

75,406

63.6%

pinecone

7,459

6,786

2,298

41,054

59.7%

all_tools

18,492

9,224

10,711

72,004

All dynamic strategies achieved 60-73% token reduction compared to baseline. The regex strategy was most efficient, using 3.7x fewer tokens than all_tools.

image

Stacked bar chart showing input/output token breakdown per strategy.

Statistical significance: The difference between all_tools and all dynamic strategies was highly significant (p < 0.001, Cohen’s d > 1.3 “Large” effect size for all comparisons).

Statistical Significance (Token Comparison)

Comparison

Mean Diff

T-statistic

p-value

Cohen’s d

Effect Size

all_tools vs regex

13,506

14.05

< 0.001

1.99

Large

all_tools vs cross_encoder

12,849

13.08

< 0.001

1.85

Large

all_tools vs hybrid_pinecone

12,606

12.29

< 0.001

1.74

Large

all_tools vs hybrid_milvus

12,530

12.24

< 0.001

1.73

Large

all_tools vs milvus

12,525

12.03

< 0.001

1.70

Large

all_tools vs bm25

11,764

9.57

< 0.001

1.35

Large

all_tools vs pinecone

11,033

9.64

< 0.001

1.36

Large

regex vs pinecone

-2,473

-3.39

0.001

-0.48

Small

regex vs bm25

-1,742

-2.03

0.044

-0.29

Small

Strategy Type Comparison

Aggregating by strategy type reveals clear patterns:

Performance by Strategy Type (One-Shot)

Type

Strategies

Avg Accuracy

Avg Tokens

Avg Duration

Baseline

all_tools

99.0%

18,492

5.30s

Keyword

regex, bm25

94.0%

5,857

5.60s

Semantic

pinecone, milvus, cross_encoder

93.0%

6,356

6.52s

Hybrid

hybrid_pinecone, hybrid_milvus

90.5%

5,924

6.45s

Finding: Keyword (94.0%) > Semantic (93.0%) > Hybrid (90.5%)

This contradicts the assumption that semantic understanding of query intent would improve tool matching. Keyword methods’ advantage likely stems from tool naming conventions that embed functional keywords directly in identifiers.

image

Side-by-side comparison of strategy types showing accuracy and token usage for one-shot vs conversation modes.

image

Scatter plot showing accuracy vs token cost trade-off, identifying the Pareto frontier.

Conversation Mode Impact

We evaluated multi-turn conversations where context accumulates across 5 messages.

Context Growth Rate (Message 1 → Message 5)

Strategy

Tokens Msg 1

Tokens Msg 5

Total Growth (%)

Avg Growth Rate

all_tools

18,302

20,150

+10.1%

3.67% per msg

bm25

7,412

12,343

+66.5%

17.20% per msg

hybrid_milvus

6,348

11,383

+79.3%

15.88% per msg

hybrid_pinecone

6,771

13,111

+93.6%

19.07% per msg

cross_encoder

5,344

11,266

+110.8%

21.16% per msg

milvus

6,737

14,384

+113.5%

21.33% per msg

regex

4,893

10,834

+121.4%

23.04% per msg

pinecone

5,518

14,150

+156.5%

36.00% per msg

Key insight: While dynamic strategies have higher relative growth rates, the absolute token counts remain substantially lower than baseline throughout the conversation. Growth is linear, not exponential.

image

Line chart showing cumulative token usage across conversation turns.

Accuracy by Message Position (Conversation Mode)

Message Index

all_tools

regex

pinecone

milvus

cross_encoder

bm25

hybrid_pinecone

hybrid_milvus

1

90.0%

90.0%

95.0%

100.0%

90.0%

85.0%

85.0%

85.0%

2

100.0%

100.0%

90.0%

90.0%

100.0%

95.0%

90.0%

90.0%

3

100.0%

95.0%

80.0%

85.0%

85.0%

90.0%

85.0%

85.0%

4

100.0%

95.0%

95.0%

95.0%

95.0%

90.0%

90.0%

90.0%

5

100.0%

100.0%

95.0%

95.0%

100.0%

95.0%

95.0%

95.0%

image

Line chart showing accuracy changes across conversation turns.

Accuracy Delta (Conversation – One-Shot)

Strategy

One-Shot

Conversation

Delta

Direction

milvus

90.0%

93.0%

+3pp

↑ Better

regex

95.0%

96.0%

+1pp

↑ Better

hybrid_pinecone

89.0%

89.0%

0pp

= Same

all_tools

99.0%

98.0%

-1pp

↓ Worse

cross_encoder

95.0%

94.0%

-1pp

↓ Worse

bm25

93.0%

91.0%

-2pp

↓ Worse

pinecone

94.0%

91.0%

-3pp

↓ Worse

hybrid_milvus

92.0%

89.0%

-3pp

↓ Worse

Conversation mode had mixed effects: some strategies improved (milvus, regex) while others degraded (pinecone, hybrid_milvus). This suggests conversation context can either help (by providing disambiguation cues) or hurt (by introducing noise).

image
image

Waterfall chart showing accuracy changes between one-shot and conversation modes.

image

Horizontal bar chart comparing token overhead between modes.

Failure Analysis

Accuracy by Tool Category

Hardest Categories (Aggregated Across All Strategies)

Category

Avg Accuracy

Sample Size

Best Strategy

Worst Strategy

Challenge

documents

81.2%

4

pinecone, milvus (100%)

all others (75%)

Semantic methods excel on naming-ambiguous queries

notes

77.1%

12

all_tools (100%)

hybrid_rag, milvus (66.7%)

Ambiguous with documents/files

spotify

77.7%

14

all_tools (100%)

hybrid_pinecone, milvus (64.3%)

Multi-tool queries complexity

github

90.9%

11

all_tools, regex, bm25 (100%)

hybrid_pinecone (82%)

Technical terminology

weather

95.8%

9

all_tools, regex, pinecone, milvus, cross_encoder (100%)

bm25 (88%)

timezone

95.8%

3

most (100%)

milvus (50%)

Single outlier

translate

93.8%

2

most (100%)

regex (50%)

Single outlier

slack

93.8%

24

most (100%)

hybrid_pinecone (89%)

calculator

97.5%

10

all (100%)

Unambiguous domain

calendar

100.0%

9

all (100%)

Unambiguous domain

contacts

100.0%

4

all (100%)

Unambiguous domain

email

96.5%

18

all (100%)

files

97.9%

12

all (100%)

jira

98.9%

11

all (100%)

reminders

100.0%

9

all (100%)

Unambiguous domain

shopping

100.0%

13

all (100%)

Unambiguous domain

Key insight on documents category: While overall accuracy is 81.2%, semantic methods (pinecone, milvus) achieve 100% accuracy compared to 75% for all other strategies. The critical difference appears on queries with naming ambiguity-for example, “Find the ‘Design Mockups.fig’ file” requires documents_list but non-semantic methods match files_search due to the word “file” in the query. This demonstrates that for categories with naming overlap (documents vs files), semantic understanding provides measurable benefit.

Implication: Tool naming conventions matter more than search algorithm choice. When tool names are unambiguous (e.g., slack_*, calendar_*), all strategies achieve near-perfect accuracy. The performance gap emerges only when semantic boundaries do not align with lexical signals.

Accuracy by Query Complexity

Performance vs Expected Tools Count

Expected Tools

Query Count

Best Strategy

Worst Strategy

Accuracy Range

1 tool

11

bm25, hybrid_milvus, hybrid_pinecone, milvus, pinecone (100%)

regex, cross_encoder (90.9%)

90.9–100%

2 tools

75

all_tools (98.7%)

hybrid_pinecone (89.3%)

89.3–98.7%

3 tools

14

all_tools, regex, cross_encoder (100%)

hybrid_milvus, hybrid_pinecone, milvus (78.6%)

78.6–100%

Multi-tool queries showed higher variance. The hybrid strategies degraded most severely on 3-tool queries (78.6%), while regex and cross_encoder maintained 100% accuracy.

image

Grouped bar chart showing accuracy breakdown by query complexity (1, 2, 3 expected tools).

Tool Discovery Rate

Full Discovery Rate by Strategy

Strategy

Full Discovery (%)

Partial (%)

No Discovery (%)

all_tools

78.0%

21.0%

1.0%

cross_encoder

58.0%

37.0%

5.0%

regex

56.0%

39.0%

5.0%

pinecone

53.0%

41.0%

6.0%

bm25

51.0%

42.0%

7.0%

hybrid_pinecone

51.0%

38.0%

11.0%

milvus

51.0%

39.0%

10.0%

hybrid_milvus

48.0%

44.0%

8.0%

False Positive Analysis

False Positive Rates by Strategy

Strategy

Avg False Positives

Max FP

% Queries with FP

Total FP

all_tools

0.33

2

29.0%

33

regex

1.43

5

100.0%

143

cross_encoder

1.50

5

100.0%

150

hybrid_pinecone

1.67

4

99.0%

167

bm25

1.69

7

100.0%

169

hybrid_milvus

1.70

4

99.0%

170

milvus

1.70

5

97.0%

170

pinecone

1.80

5

100.0%

180

Dynamic strategies consistently retrieved ~1.5 irrelevant tools per query (from top-k=5), but this did not significantly impact accuracy – the LLM correctly ignored irrelevant tools.

Most Frequently Retrieved False Positives:

Tool

False Positive Count

Why Retrieved

search_tools

695

Agent’s search function (expected overhead)

spotify_search_artists

52

Triggered by any music-related query

spotify_add_to_playlist

43

Triggered by “add” + music context

slack_search_users

40

Triggered by any “send to person” query

jira_get_projects

37

Triggered by project/issue terminology

files_search

32

Triggered by “find”/”search” + document context

The search_tools meta-tool dominates false positives (695 calls across all strategies), confirming that dynamic agents actively utilize the search capability ~7 searches per query across all strategies on average.

Worst-Performing Queries

Analysis of queries that failed across multiple strategies reveals systematic failure patterns:

Hardest Queries (Sorted by Failure Count)

Query (truncated)

Expected Tools

Failures

Passed

Root Cause

“Find the note about ‘Project Ideas’…”

notes_search, notes_create

7/8

all_tools

notes confused with documents/files

“Find ‘Design Mockups.fig’ file…”

documents_list, slack_send_dm

6/8

pinecone, milvus

Word “file” triggers files_* over documents_*

“Find ‘casual’ playlist, Shuffle it…”

spotify_get_playlist, notes_create

5/8

all_tools, bm25, cross_encoder

Mock pattern (shuffle → notes) confuses keyword

“Check if ‘Deep Focus’ in playlist…”

spotify_get_playlist, spotify_add_to_playlist, spotify_search

5/8

all_tools, regex, cross_encoder

Multi-tool Spotify queries

“Check if ‘Study Session’ track exists…”

spotify_search, spotify_play, notes_create

4/8

all_tools, regex, pinecone, cross_encoder

Complex 3-tool chain

Key failure patterns:

  1. Naming collision: Queries containing “file” when documents_* tools are expected
  2. Multi-tool complexity: 3+ tool queries have higher variance
  3. Mock patterns: Non-standard workarounds (e.g. using notes_create to simulate shuffle) confuse retrieval

Strategic implications:

  • all_tools baseline passes most edge cases (7/8 strategies failed → all_tools passed)
  • Semantic methods (pinecone, milvus) uniquely pass the documents queries where keyword fails
  • Cross-encoder shows robustness on complex Spotify queries despite being semantic

Discussion

Why Keyword Beats Semantic

Three factors explain the counter-intuitive superiority of keyword methods:

1. Tools are designed artifacts with explicit nomenclature.

Unlike natural documents where concepts may be expressed in varied ways, tool names and descriptions are deliberately crafted with discoverable keywords: slack_send_message, calendar_schedule_meeting, spotify_play_track. These explicit signals favor exact matching.

2. Semantic similarity ≠ functional equivalence.

A query about “documents” is semantically similar to “files” and “notes,” but these map to distinct tools with different capabilities. Semantic embeddings capture topical relatedness rather than functional boundaries, leading to cross-category confusion.

3. The “shifting context” problem applies to all dynamic strategies.

Phil Schmid warns about “shifting context” where “tools present in turn 1 disappear in turn 2,” confusing the model. While he attributes this to semantic retrieval specifically, our data suggests this is inherent to any dynamic tool selection—keyword methods included. In conversation mode, bm25 degraded by -2pp and pinecone by -3pp, showing that both sparse and dense retrieval suffer when the available toolset shifts between turns. The fundamental issue is not the retrieval algorithm but the dynamic nature of the toolset itself.

The Hybrid Paradox

Hybrid approaches were expected to combine the precision of keyword matching with the recall of semantic search. Reality proved otherwise:

Observed: Hybrid (90.5%) < Semantic (93.0%) < Keyword (94.0%)

However, the paradox is nuanced. Per-query analysis reveals asymmetric behavior:

Hybrid Strategy

Better than Component

Worse than Component

Net Effect

hybrid_pinecone vs pinecone

3 queries

3 queries

Neutral

hybrid_milvus vs milvus

7 queries

2 queries

+5 net wins

hybrid_milvus slightly outperforms milvus overall (92% vs 90%), likely because it wins more of the disagreement cases.

Hypothesis: Error amplification via RRF (applies primarily to hybrid_pinecone)

RRF assumes both rankers contribute complementary signals. However, when both rankers err in different directions, fusion produces a compromised ranking that inherits errors from both. In tool retrieval, where correct matches require high precision, this error blending is particularly harmful. Notably, RRF is rank-based and discards the original retrieval scores entirely—a different fusion method that incorporates score normalization or adaptive weighting could yield different results (see alternative architectures below).

Consider a query “share the budget document with Alice”:

  • Keyword search correctly ranks documents_share high
  • Semantic search surfaces files_share and email_send (semantically related to “sharing”)
  • RRF fusion averages the ranks, potentially demoting the correct tool

Revised recommendation: For Milvus users, hybrid_milvus may be preferable to pure milvus despite lower aggregate accuracy, as it wins on more individual queries.

Alternative architectures to explore:

  1. Learned weighting: Replace fixed 50/50 weighting with query-dependent weights learned from training data.
  2. Cascade architecture: Use keyword search first, falling back to semantic only when keyword retrieval returns low-confidence results.
  3. Query routing: Classify queries upstream and route to appropriate retriever based on query type (action-oriented → keyword, conceptual → semantic).

Practical Recommendations

Strategy Selection Guide

Use Case

Recommended

Accuracy

Tokens

Cost/100q

Justification

Maximum accuracy

all_tools

99%

18,492

$0.28

Worth cost if tools < 15

Production scale

regex

95%

4,986

$0.08

Best accuracy/token trade-off

Semantic queries

cross_encoder

95%

5,643

$0.09

No API dependency

Zero external cost

bm25

93%

6,728

$0.11

Fully local, fast init

Self-hosted vector

milvus

90%

5,967

$0.10

Vector search without cloud

Cost Analysis at Scale:

Queries

all_tools Cost

regex Cost

Savings

100

$0.28

$0.08

$0.20 (71%)

10,000

$28.35

$8.03

$20.32 (72%)

1,000,000

$2,835

$803

$2,032 (72%)

Note on Embedding Costs: Pinecone/Milvus strategies incur additional OpenAI embedding API costs (~$0.13/1M tokens for text-embedding-3-large). At ~50 tokens per search query, this adds ~$0.0065 per 100 queries-negligible compared to LLM inference costs. However, for fully offline deployments, consider bm25 or cross_encoder which require no external APIs.

Cross-Model Cost Comparison:

The cost savings from dynamic tool discovery become even more significant with larger, more expensive models. Below we compare costs per 100 queries across four frontier models:

Model Pricing

Model

Input ($/1M tokens)

Output ($/1M tokens)

GPT-4o-mini

$0.15

$0.60

GPT-4o

$2.50

$10.00

Claude 3.5 Sonnet

$3.00

$15.00

Gemini 1.5 Pro

$1.25

$5.00

Cost per 100 Queries by Model and Strategy

Strategy

Avg Input

Avg Output

GPT-4o-mini

GPT-4o

Claude 3.5 Sonnet

Gemini 1.5 Pro

regex

4,865

122

$0.08

$1.34

$1.64

$0.67

cross_encoder

5,511

132

$0.09

$1.51

$1.85

$0.76

hybrid_pinecone

5,754

132

$0.09

$1.58

$1.93

$0.79

hybrid_milvus

5,825

136

$0.10

$1.59

$1.95

$0.80

milvus

5,841

126

$0.10

$1.59

$1.94

$0.79

bm25

6,587

141

$0.11

$1.79

$2.19

$0.89

pinecone

7,308

151

$0.12

$1.98

$2.42

$0.99

all_tools

18,356

136

$0.28

$4.73

$5.71

$2.36

Savings from regex vs all_tools by Model

Model

all_tools

regex

Savings

Savings %

GPT-4o-mini

$0.28

$0.08

$0.20

71%

GPT-4o

$4.73

$1.34

$3.39

72%

Claude 3.5 Sonnet

$5.71

$1.64

$4.07

71%

Gemini 1.5 Pro

$2.36

$0.67

$1.69

72%

Cost Projection at Scale (1M Queries)

Model

all_tools

regex

Savings

GPT-4o-mini

$2,835

$803

$2,032

GPT-4o

$47,300

$13,400

$33,900

Claude 3.5 Sonnet

$57,100

$16,400

$40,700

Gemini 1.5 Pro

$23,600

$6,700

$16,900

The savings scale linearly with model cost. For production deployments using Claude 3.5 Sonnet at 1 million queries, dynamic tool discovery saves approximately $40,700-making the optimization not just beneficial but essential for cost management.

image

Bar chart showing cost projection at scale (100 to 1M queries).

image

Normalized heatmap showing multi-metric performance profile.

Tool Design Guidelines

Our category accuracy analysis reveals that tool naming conventions impact retrieval accuracy more than algorithm choice. Categories with unambiguous names achieve 100% accuracy across all strategies, while ambiguous names (documents, notes, files) cause systematic failures.

Recommended naming practices:

  1. Use distinctive prefixes: Avoid semantic overlap between categories
    • Bad: documents_list, files_list, notes_list (confusable)
    • Good: gdocs_list, dropbox_files_list, notion_notes_list (distinctive)
  2. Embed action verbs in function names:
    • Bad: get_user_info (vague)
    • Good: slack_lookup_user_profile (specific)
  3. Include expected query keywords in descriptions:
def spotify_play_track(track_name: str):
    """
    Play a song, music, or audio track on Spotify.
    Use this for: play music, start song, listen to track, queue audio           """
    ...
  1. Avoid synonyms across tool boundaries:
    • If files_share exists, don’t create documents_share
    • Consolidate or use explicit disambiguation: files_share_link vs docs_grant_access

When NOT to use dynamic tool search:

Scenario

Recommendation

< 15 tools

Use all_tools (simplicity > optimization)

Mission-critical accuracy

Accept 99% baseline cost

Ambiguous tool naming

Fix naming before adding search

High-stakes decisions

Human-in-the-loop for tool selection


Limitations & Future Work

Limitations

  1. Mock tools: Our 51 tools were simulations with realistic signatures but no production API integrations. Real-world tool descriptions may vary in quality and consistency.
  2. Single LLM: All experiments used GPT-4o-mini. Results may differ for Claude, Llama, or other models with different tool-calling behaviors.
  3. Fixed top-k=5: We did not optimize the retrieval depth. Larger top-k might benefit semantic methods at the cost of more tokens.
  4. English only: Tool names, descriptions, and queries were exclusively in English. Performance in multilingual settings is unknown.
  5. Binary evaluation metric: Accuracy (any expected tool called) may not capture partial success in multi-tool scenarios where 2/3 required tools were correctly invoked.

Future Work

  1. Adaptive retrieval: Dynamically adjust top-k based on query complexity or retrieval confidence scores.
  2. Query expansion: Augment user queries with synonyms or LLM-generated paraphrases before retrieval.
  3. Tool design study: Quantify the impact of naming conventions on retrieval accuracy. Does send_slack_dm outperform dm_user? What is the optimal description length?
  4. Multi-model evaluation: Replicate experiments across Claude 3.5/4, Llama 3, Mistral, and other capable models.
  5. Hybrid architecture variants: Evaluate cascade, routing, and learned-weighting alternatives to RRF fusion.
  6. Confidence-based fallback: Detect low-confidence retrieval and automatically fall back to all_tools for specific queries.

Conclusion

This study presents the first systematic comparison of retrieval strategies for tool discovery in AI agent systems. Our key findings:

  1. Keyword methods outperform semantic search: regex/BM25 (94% avg) > RAG/Milvus/cross-encoder (93% avg) for tool retrieval—a result that may seem counter-intuitive given the general trend toward semantic approaches in information retrieval.
  2. Hybrid fusion underperforms components: RRF-based hybrid (90.5%) ranked last among dynamic strategies, suggesting error amplification rather than complementary gains.
  3. regex achieves optimal trade-off: 95% accuracy with 73% token reduction ($0.08 vs $0.28 per 100 queries).
  4. Tool design matters more than search sophistication: Categories with naming ambiguity (e.g., documents vs files) showed the largest performance gaps between strategies, highlighting that clear tool naming conventions matter more than retrieval algorithm choice.

Practical advice: Start with keyword matching. Invest engineering effort in tool naming conventions rather than sophisticated retrieval pipelines. For most production systems, regex provides the best accuracy-cost trade-off while remaining trivially simple to implement and debug.

Last updated: March 5, 2026

The LLM Book

The LLM Book explores the world of Artificial Intelligence and Large Language Models, examining their capabilities, technology, and adaptation.

Read it now