Every product roadmap in 2026 has an AI column. Most of those entries should not exist. The reflexive answer to a hard problem has become “add a model,” and the reflex is producing systems that are slower, more expensive, less explainable, and less reliable than the deterministic alternatives they replaced. This piece is the framework for saying no, and for proposing the boring solution that actually ships.
The argument is not that AI is overhyped. It is that AI is a specific tool with a specific cost structure and a specific failure mode, and applying it to problems that do not need it is the same category of error as using a database for a configuration file. The discipline is in matching tool to problem, and the framework below is six tests for whether a feature should use AI at all.
Test One: Is Deterministic Logic Available?
If the problem can be expressed as a finite set of rules, use a rules engine. Tax calculation, eligibility scoring against a published policy, fraud detection against a regulator-approved rule set, content moderation against an explicit category list: all of these are better served by deterministic systems than by language models. The rules are auditable, the failure modes are enumerable, and the cost per evaluation is measured in microseconds.
The temptation is to use a model because the rules are tedious to write. That tedium is the work. A rule that takes a week to specify and review will run for a decade with predictable behavior. A model that takes a week to prompt-engineer will drift, require evaluation infrastructure, and surface novel failure modes every time the underlying weights change. Choose tedium over surprise.
Test Two: Is the Tolerance for Ambiguity Low?
Language models trade precision for flexibility. They produce plausible answers across a wide input distribution, at the cost of occasional confident errors on inputs that look ordinary. In domains where a wrong answer carries asymmetric cost, that tradeoff inverts. A search system that returns the wrong document is recoverable. A pricing engine that returns the wrong price is a P&L event.
The test is whether the system can tolerate an answer that is wrong in a way the user cannot detect. Calculators cannot. Compliance systems cannot. Medical dosing cannot. For these, the deterministic implementation is not just safer; it is the only defensible choice in an incident review.
Test Three: Is There a Regulatory Constraint?
Several domains have explicit or de facto bans on AI as the system of record. Medical diagnosis without physician sign-off is regulated by the FDA in the US and the MDR in the EU. Legal advice without attorney review crosses unauthorized practice of law lines in most jurisdictions. Credit decisions in the US are subject to ECOA’s adverse action notice requirements, which demand a specific, individualized reason that a black-box model cannot reliably produce. The EU AI Act assigns high-risk classification to AI systems used in employment, credit scoring, education, law enforcement, and migration, with conformity assessment requirements that most teams have not budgeted for.
If your feature falls into a regulated category, the question is not whether to use AI but whether you can afford the compliance overhead of using AI. Often the answer is no, and the right move is a human-in-the-loop workflow with the AI assisting rather than deciding, or a deterministic system with no AI at all.
Test Four: Are the Stakes High and Is Explainability Low?
High-stakes decisions need explanations. Explanations from neural networks are post-hoc, approximate, and often wrong about the actual computation. SHAP values, attention visualizations, and chain-of-thought traces all produce something that looks like an explanation, but none of them give you the kind of explanation a regulator, a court, or a customer support team actually needs.
The test is whether the organization can defend a decision in a written complaint response. If the answer requires saying “the model decided this based on patterns in the training data,” the system is not deployable in a high-stakes context. Use a simpler model whose decision boundary you can describe in a paragraph, or a rules engine whose logic is the explanation.
Test Five: Is There a Real-Time Latency Budget?
Frontier model inference takes hundreds of milliseconds at minimum, often seconds. Even with smaller models served on dedicated infrastructure, you are looking at tens of milliseconds for a single inference, plus tail latency that is materially worse than database queries. For interactive UIs with a 100-millisecond budget, for high-frequency trading, for ad bidding, for game server tick loops, the latency budget excludes language models entirely.
The pattern that works is precomputation: run the model offline, store the results, and serve from a key-value store at request time. This is how production search and recommendation systems use embeddings. It is also how most successful AI features in latency-sensitive products are structured. If precomputation is not possible because the input space is unbounded, the latency constraint is telling you to use a different approach entirely.
Test Six: Is the Data Quality Sufficient?
Models trained or grounded on bad data produce bad outputs. The aphorism is correct, and the corollary is the test: if your data is inconsistent, incomplete, or untrusted, fix the data first. A retrieval-augmented system on a corpus full of contradictions, outdated documents, and misclassified records will produce contradictory, outdated, and misclassified answers, and the AI layer will obscure the root cause.
The pragmatic move is often to invest the quarter in data cleanup, governance, and search infrastructure before the AI feature. The cleaned data improves every downstream system, AI or not. The AI feature, deferred, then has a chance to succeed instead of becoming the visible failure mode of an underlying data problem.
The Cheaper Tools You Should Use Instead
When the framework rules out AI, the alternatives are usually older, smaller, and faster. They are also better understood, easier to staff, and more durable.
- Rules engines. Drools, OpenL Tablets, or a hand-rolled decision table. Auditable, fast, deterministic. The right answer for compliance, eligibility, pricing, and policy enforcement.
- Search. Elasticsearch, OpenSearch, Meilisearch, or Typesense with proper analyzers and tuning. The right answer for “find the document” problems before considering RAG.
- Classical machine learning. Gradient-boosted trees with XGBoost or LightGBM, logistic regression with interpretable coefficients. Train in minutes, explain to a regulator, deploy in a serving framework that costs cents per million predictions.
- Heuristics with a feedback loop. A scoring formula reviewed quarterly with engineering and product. Beats the model on most ranking and prioritization problems for the first eighteen months of a product’s life.
- Human-in-the-loop workflows. A queue, a UI, and trained reviewers. Slower per item, but produces clean labels that can train a future model when the volume justifies it.
Where AI Is Actually the Right Answer
The tests above are exclusionary, not dismissive. AI earns its place in problems with high ambiguity tolerance, unbounded input distributions, soft failure modes, and explainability requirements that the system can satisfy with citation and human review. Summarization of internal documents, drafting assistance for human writers, code completion, semantic search across heterogeneous corpora, conversational interfaces over structured APIs: these play to the actual strengths of language models. The framework is meant to surface those cases by eliminating the ones where the cheaper tool is correct.
A useful internal exercise is to take the current AI roadmap and run each item through the six tests as a written review. The features that pass on every test are the ones to fund first; they are the cases where AI provides leverage that no other tool can match. The features that fail one test are candidates for redesign, often by narrowing the scope so that the AI handles only the genuinely ambiguous portion while a deterministic system handles the rest. The features that fail two or more tests are candidates for cancellation, with the saved budget redirected to the deterministic alternatives that will actually ship.
Recommendation
Run every proposed AI feature through the six tests before approving the budget. If the feature fails any test, propose the deterministic alternative and quantify the difference in cost, latency, and reliability. Reserve AI for problems where the alternatives genuinely cannot work, and require an explicit justification that names which alternatives were rejected and why. The discipline pays off in lower run costs, faster systems, and an engineering culture that ships solutions instead of demos.
When This Framework Applies, and When It Does Not
This framework applies to production engineering decisions in regulated, latency-sensitive, or high-stakes contexts. It is the wrong frame for research, for early-stage product exploration where the goal is to learn what is possible, or for internal tools where the worst case is a developer ignoring a bad suggestion. In those contexts, the cost of trying an AI approach is low and the upside is real. The framework is for the moment when the prototype is about to become the system of record, and the question shifts from “can we” to “should we.”