Choosing the right LLM model for the job

We get it; it’s hard to decide which large language model should be used for a specific business application. Many new ones are coming out each month, and without being technically versed, it’s hard to see the forest for the trees. That is why every business-oriented decision-maker can count on our help in discerning — should you rely on Claude Sonnet, Gemini Flash, GPT-4o, or Llama, what are their strengths and weaknesses in real-life business applications, and how such applications differ from what you see in what you see in existing benchmarks.
What we’re excited about when thinking of models?
From a pragmatic perspective, it’s more important how a model supports a business use case and its costs, from how high its evaluation metrics are positioned in the model comparisons. It is a different vantage point to what benchmarks show. But let’s start with how to read the benchmarks that we are exposed
LLM Benchmarks: How are models typically compared?
Recently, xAI and Anthropic released new models: Grok 3 and Claude 3.7 Sonnet. Let’s take a closer look at their release notes, focusing on the benchmarks used to measure model performance. Both companies evaluated their models using eight different benchmarks, but only three — MMMU, AIME’24, and GPQA — were common across both reports.
One of the biggest challenges in benchmarking large language models is the sheer variety of available tests, many of which are frequently updated. Different models excel in different areas, so companies strategically select benchmarks that highlight their model’s strengths and demonstrate competitiveness. This is why benchmark results often vary between releases.
Another notable aspect is what to compare models with. For example, Anthropic included o1 and o3-mini in its evaluation, while xAI opted not to benchmark against them (at least not for non-thinking versions). Or the fact that although Grok-3 scores higher in Think-mode from OpenAI’s o1, yet it is typically not directly listed in comparisons together with the OpenAI model for for some reason. So, the picture that the benchmarks paint is often not complete or even skewed.
The Challenge of Data Contamination in Benchmarks
Another key issue in LLM benchmarking is data contamination — when a model has been trained on the same data used for evaluation metrics. Since LLMs learn from massive web-scraped datasets, they may encounter benchmark questions during the training, leading to artificially high scores from memorization rather than reasoning.
This raises concerns about the benchmark reliability of natural language processing models. If a model has seen parts of a test before, its performance doesn’t necessarily reflect real-world problem-solving ability. Older, widely used benchmarks are especially vulnerable, as newer models are more likely to have been exposed to them.
Ranking Models Through Head-to-Head Comparisons
Given the limitations of traditional benchmarks, LMArena (formerly LMSYS Arena) takes a different approach by ranking LLMs through direct output comparisons. Instead of relying on static test sets, it presents two anonymized model responses to human evaluators, who select the better one. This blind evaluation removes biases tied to model names and allows for a more practical assessment of real-world LLM performance. LMArena uses these comparisons to calculate an ELO score.
Recently, OpenAI released GPT-4.5, a highly anticipated but controversial model. Early benchmark results, especially on reasoning tasks, were underwhelming despite its large scale. However, prominent researchers, including Andrej Karpathy, argued that GPT-4.5 feels superior in real-world use, introducing the idea of “taste” in model outputs.
This highlights a key limitation of benchmarks: they don’t always capture the subjective quality of a model’s responses, reinforcing the need for human evaluation alongside numerical scores.
Later, LMArena rankings validated these subjective impressions, with GPT-4.5 holding the top spot.
LMArena captures a broader context by relying on human judgment and subjective preferences. However, it has limitations—for advanced tasks like complex coding problems and mathematical reasoning, traditional benchmarks remain more reliable for objective evaluation.
Benchmarking models with ARC-AGI
ARC-AGI is a crucial benchmark for model evaluation, together with its advanced reasoning capabilities. ARC-AGI consists of unique training data and evaluation tasks. Each task contains input-output examples. The puzzle-like inputs and outputs present a grid where each square can be one of ten colors. A grid can be any height or width between 1×1 and 30×30.
Most models struggle with this benchmark. While these visual puzzles are intuitive for humans, they pose significant challenges for LLMs. However, last year, OpenAI achieved a breakthrough with its o3 model—the first general model to attain a respectable performance, surpassing 75%.
Ranking Systems vs Real-life Performance
The ranking systems are a good indication of what to initially expect from the model, although it does not offer a definitive answer on how models perform in specific business scenarios. When the rubber meets the road, the decision to use a certain model for a task is usually a situation of hit or miss. That is why AI consultancies, such as Vstorm, collect project-based experiences to better determine which model is more suitable for a task and which might struggle.
As a rule of thumb, OpenAI 4o still is a baseline for general tasks, with 4o mini being a cheaper (and faster) alternative.If your project involves technical tasks or coding, Claude 3.7 usually performs better. It’s also an excellent option when tackling complex decisions. For tasks requiring deeper reasoning, Claude 3.7 Thinking and o3-mini are good starting points. In the case of resource-hungry tasks, tools like Artificial Analysis make it easy to compare AI models, especially if you’re concerned about costs and response times.
If speed is critical, however, Gemini 2.0 Flash is a reliable pick, it’s even used as a default option by platforms like ElevenLabs [5]. For low-latency use cases, it’s also worth testing Groq, which utilizes specialized hardware to run LLMs from various providers. Our experience shows that these models are indeed fast, although sometimes this speed comes with a trade-off in quality.
Chinese alternatives, such as DeepSeek, rank highly in benchmarks but don’t provide significant advantages over the ease of deploying the OpenAI models for commercial applications. Additionally, these alternatives come with various considerations that often make US-based customers hesitant to integrate them into their solutions. Therefore, we typically prefer established US providers like OpenAI, Anthropic, and Google. These companies generally offer more reliable APIs, experience less downtime, and support advanced features such as prompt caching.

Talk to a knowledgeable expert, not a chatbot
Bart, PhD economist and our co-founder, is ready to leverage his hands-on experience of:
- Entrepreneurial, and C-level roles
- Exited and supported startups
- Executive Consulting background
to discuss your project on a 20-minute introductory call.
Dr. Bart Gonczarek
Vice President
References:
[2] https://www.anthropic.com/news/claude-3-7-sonnet
[3] https://x.com/lmarena_ai/status/1896590146465579105
The LLM Book
The LLM Book explores the world of Artificial Intelligence and Large Language Models, examining their capabilities, technology, and adaptation.
