An AI Engineer’s Guide to LLM Selection
An AI Engineer’s Guide to LLM Selection
As an AI engineer working closely with AI agents and agentic frameworks, I can confidently say that the demand for AI agents and the need to use AI are increasing rapidly across industries. But despite all this progress, one very simple challenge keeps coming up again and again.
In the talks I’ve given and the teams I’ve worked with, the most common question I get isn’t about complex agent architectures and use cases. It is about how to decide which LLM to use.
People often get lost in endless catalogues, confusing benchmarks, and marketing noise. The result of this is bloated, inefficient, and unnecessarily expensive systems. That’s why I put together a simple guide to help developers, product managers, and even curious newcomers understand how to choose the right type of model for the job. Here’s my three-tier framework for approaching LLM selection.
Tier 1: The Titans (For Complex Reasoning & Deep Attention)
These are the largest, most powerful, and most expensive models available. Think of flagships like Gemini 2.5 Pro, GPT-5, or Llama 4.
Their primary strength is not just information recall; it's their profound capacity for multi-step reasoning, understanding deep ambiguity, and maintaining a coherent "thought" process over long and complex contexts. You should only use these most advanced models when the task’s success is critically dependent on reasoning and attention.
Here are two example use cases for you to understand the concept better:
- Use Case 1: Strategic Financial Analyst. An agent designed to read three 100-page quarterly earnings reports, cross-reference them with current market news, and produce a novel, multi-paragraph analysis of the company's future risks and opportunities. This requires synthesizing vast, disparate data and generating a unique insight, not just a summary.
- Use Case 2: Autonomous Codebase Refactor. A developer agent given an entire legacy software repository and the goal: "Refactor this module to be more efficient and adhere to our new coding standards." The model must understand the logical flow of thousands of lines of code, identify dependencies, and rewrite components without breaking the system.
Tier 2: The Specialists (For Conditional Logic & Applied Knowledge)
These are the high-performing "workhorse" models. They are fast, highly capable, and represent the perfect balance of intelligence and cost. This category includes models like Gemini 2.5 Flash, GPT-4o, Llama 3.2, and Mistral Large.
These models still possess excellent reasoning, but they shine in tasks that are bounded, follow clear conditions, and rely on a vast well of common knowledge. They are ideal for most applications.
Here are two example use cases for you to understand the concept better:
-
Use Case 1: Advanced Customer Support Agent. A support bot that goes far beyond a simple FAQ. It must understand a customer's frustrated email, perform a multi-step action (e.g., "Check order status," "Verify warranty," "Check inventory"), and then, based on company policy (conditional logic), decide whether to issue a refund, send a replacement, or escalate to a human.
-
Use Case 2: Dynamic Content Generation. A marketing agent that generates 10 different versions of a social media ad campaign. It is given a product description and a target audience (e.g., "Gen Z, interested in sustainability"). The model must creatively apply its common knowledge of that demographic while strictly adhering to the brand's tone of voice.
Tier 3: The Sprinters (For High-Speed, Specific Tasks)
This is the most overlooked and, in many ways, the most important category. These are the small, fast, and highly efficient models. This category includes models like GPT-4o nano, Google’s smallest Gemma 3 models (e.g. 270M) or any other SLM (Small Language Model) with a stricter context window.
You use a Sprinter when reasoning is not required. These models are built for speed, low cost, and a small footprint (many can even run on-device). They are perfect for simple, high-volume generation and classification tasks.
Here are two example use cases for you to understand the concept better:
-
Use Case 1: Email and Data Tagger. A simple function that reads an incoming email and applies a category tag: "Sales Lead," "Support Request," or "Spam." The model doesn't need to understand the nuance of the email; it just needs to perform basic pattern recognition. Using a Titan here would be like using a supercomputer to power a calculator.
-
Use Case 2: On-Device Text-to-JSON. A mobile app feature that takes a user's unstructured note (e.g., "meeting with marketing tmrw at 10 to discuss new project") and converts it into a structured JSON object
{"event": "Meeting", "team": "Marketing", "date": "tomorrow", "time": "10:00"}to populate a calendar. This needs to be instantaneous and work without an internet connection.
Parallels with Traditional Machine Learning
If you have a background in classic machine learning, this framework might sound familiar. The logic is the same.
Tier 1 (Titans) are like Deep Neural Networks. You use them for immensely complex, high-stakes problems where accuracy is the only priority, and you're willing to pay the high computational cost.
Tier 2 (Specialists) are like a Random Forest or Gradient Boosting Machine. They are powerful, versatile, and the go-to default for a huge range of structured problems. They provide a fantastic balance of performance and efficiency.
Tier 3 (Sprinters) are like Linear or Logistic Regression. They are incredibly fast, cheap, and easy to interpret. For a simple, well-defined problem, they are not just a "good" choice; they are the correct choice.
Putting It All Together: The Multi-Agent Team
In this section, I will share an example project where all three tiers are working together to solve a problem. Of course, you do not have to design a multi agent team yourself to get the most out of these tiers but representation is always key to understanding concepts better.
Imagine a system that runs an online store.
-
The Front Desk (Tier 3): A Phi-3.5 Mini model acts as the initial "router." It reads all 10,000 incoming customer emails per day. 90% are simple requests like "Where is my order?" The Sprinter answers these instantly by calling a shipping API. When it detects a complex, angry, or unusual complaint, it forwards the ticket to a specialist.
-
The Service Manager (Tier 2): A Gemini 2.5 Flash model receives the escalated ticket. Its job is to solve the problem. It uses Retrieval-Augmented Generation (RAG) to pull the customer's order history and company policy. It understands the customer's nuance and follows the conditional logic to authorize a full refund and offer a 10% discount coupon.
-
The Strategist (Tier 1): A Gemini 2.5 Pro model activates only once per week. It is not customer-facing. Its job is to read all the logs from the Tier 2 Service Manager, along with sales data and supplier reports. It performs deep, complex reasoning to identify systemic problems, concluding: "Products from Supplier_X in the Northeast region have a 45% higher complaint rate for 'broken on arrival.' This correlates with our new shipping partner in that zone. I recommend we renegotiate the packing standards or switch partners."
Conclusion
Bigger doesn’t always mean better. The most capable systems of the future won’t rely on one giant model trying to do everything. Instead, they’ll work more like well organized teams, with different specialists handling different jobs, each doing what they do best.
Berke Pağnıklı - AI Engineer at Mamentis
