Analytic Hierarchy Process (AHP): Structured alternative to AI decisions • Michel Libera Blog

The quiet problem: we are outsourcing our decisions to language models

Large language models are no longer just tools for drafting text. They have quietly become decision-making partners. Microsoft’s Work Trend Index reported that nearly half of Copilot usage involves decision-making or activities surrounding it (weighing options, summarising trade-offs, suggesting next steps). The model doesn’t just write the email, it picks which message to send.

This shift matters because of something psychologists call automation bias. LLMs speak with fluency and apparent confidence. They produce coherent, well-structured prose that feels like the output of a thoughtful expert. That fluency hijacks our skepticism. Formally, the human still decides. In practice, the decision often collapses into whatever the model said first.

The research is starting to catch up with the intuition.
A 2025 study by Gerlich found that heavy reliance on AI assistants leads to cognitive offloading. People think less critically, get overwhelmed, and just go along with whatever the AI says.
A 2025 Nature paper found that delegating decisions to AI can shift moral norms. People act differently, and judge others differently, when an algorithm is involved.
According to Reuters, an Ipsos BVA survey from 2026 found that nearly half of young Europeans aged 11–25 turn to chatbots with personal and emotional problems.
We aren’t just offloading spreadsheet decisions. We’re offloading the human ones.

What makes this insidious is that LLMs are stepping into the parts of the decision process that precede the formal choice: selecting which information matters, interpreting it, assessing risk, generating options, and often recommending the final answer. By the time the human “decides,” the decision has effectively already been made. From the outside, the process looks unchanged: a person thought it through and made a call. But the substance has shifted. A few patterns show how this plays out. Rubber-stamping is the most visible: the human formally approves a decision but doesn’t reconstruct the reasoning behind it, the approval becomes a formality. Decision homogenisation is subtler, millions of people consulting similar models receive similar interpretive frames, and diversity of thought quietly collapses toward whatever the median training corpus suggested. The third, and probably the most dangerous, is the illusion of understanding: LLMs are exceptionally good at producing the feeling of having thought something through. A user can walk away convinced they reasoned carefully when in reality they imported a off-the-shelve frame.

So the interesting question isn’t should we use LLMs for decisions? That ship has sailed. The question is: how do we use them in a way that preserves human judgment instead of quietly replacing it?

What decision-making actually is?

Before talking about how to use AI well, it’s worth being precise about what we’re using it for. A decision is an act of choice. It draws on information from the past and present, but its consequences always live in the future. The decision-making process typically unfolds in four phases: identifying the problem, working out possible solutions, selecting one, and putting it into practice. A decision isn’t real until it’s carried out.

Real decisions are made under constraints. Herbert Simon’s foundational work made the point that decision-makers almost never have complete information about their alternatives, and even when they do, the human mind has limited capacity to analyse, understand, and remember it all. Simon called this bounded rationality. We don’t optimise. We search until we find an option that’s good enough, then stop.

This is where decision theory enters. The discipline is built on two ideas: preferences and prospects (or options). When we say someone prefers option A over option B, we mean they judge A to be more desirable or choice-worthy.

Preference is inherently comparative. It’s a relation between options, not a property of one option in isolation.

This is easy to miss, but it’s the foundation of everything that follows. You don’t “have a preference for coffee” in the strict sense; you have a preference for coffee over tea, or over nothing. Every preference is a comparison, even when one side of the comparison is implicit. This is why decision theory builds everything from pairs of options rather than absolute scores. It’s also why, later on, pairwise comparison turns out to be such a natural building block.

A rational preference ordering is typically required to satisfy two axioms:

Completeness - for any two options, the agent can say which is at least as good, or that they’re equally good.
Transitivity - if B is at least as good as A, and C is at least as good as B, then C is at least as good as A.

Transitivity is the more interesting of the two, and it’s worth seeing why. Imagine picking a phone. You prefer B over A because of the camera. You prefer C over B because of the battery life. Transitivity says you should therefore prefer C over A. But if you end up preferring A over C, say because the screen looks nicer in direct comparison, your preferences cycle: A < B < C < A. Each choice feels reasonable on its own, but together they don’t add up to a coherent ranking.

Why is that a problem? The classical answer is the money pump argument. Suppose you own A. I offer to swap A for C plus a small fee, since you prefer C. You agree. Then I offer to swap C for B plus a small fee, since you prefer B. You agree. Then I offer to swap B for A plus a small fee, since you prefer A. You agree, and you’re back where you started, three fees poorer. I can repeat this indefinitely. Intransitive preferences are exploitable preferences. A coherent ranking, however small the differences, protects you from this.

To work with preferences mathematically, we need a way to turn this kind of ranking into numbers. That’s what a utility function is: an assignment of numbers to options such that more preferred options get higher numbers. The function doesn’t tell you what to want, it’s a mathematical record of what you’ve already said you want.

There are two flavours of utility function, and the difference matters.

An ordinal utility function records only the order. If you prefer Paris over Berlin, and Berlin over Rome, an ordinal utility might assign Paris = 3, Berlin = 2, Rome = 1. But it could equally assign Paris = 100, Berlin = 99, Rome = 1. The ordering is preserved, so it’s the same ordinal utility for these purposes. The numbers carry no information about how much more you prefer one option to another. This is fine for picking the top of a list, but useless if you need to reason about trade-offs, risk, or expected value. You can’t meaningfully average ordinal utilities.

A cardinal utility function records also the distances. The gaps between numbers are meaningful: if Paris = 10, Berlin = 8, Rome = 2, that tells us your jump from Rome to Berlin is much larger than your jump from Berlin to Paris. The classical way to build cardinal utilities is through indifference points under uncertainty. Imagine you’re offered a coin flip: heads you get Paris, tails you get Rome. Would you take this gamble, or would you accept Berlin for sure? If you’re roughly indifferent at a coin flip, Berlin sits halfway between Rome and Paris on your scale. If you’d only take the gamble at 75% Paris / 25% Rome, Berlin sits three-quarters of the way up. By varying the probabilities until you reach indifference, we pin down exactly where Berlin lives between Rome and Paris in terms of how much you value it. Do this for every option, and you have a cardinal utility function.

The point of this detour is simple: humans have spent decades building rigorous, mathematically grounded tools for making good decisions. These tools have known properties, known limitations, and known failure modes. When we let a language model frame our choices, we trade a method whose biases we understand for one whose biases we don’t.

Using LLMs well: appropriate reliance, not overreliance

The literature on AI decision support draws a sharp line between overreliance and appropriate reliance. Overreliance is the rubber-stamp pattern: the model decides, the human signs. Appropriate reliance is something else entirely. The model helps the human see options, surface assumptions, and stress-test reasoning, but the human stays in the driver’s seat.

Reviews of AI decision-support research converge on a few design principles for getting this right:

The system should reinforce user autonomy, not undermine it.
It should help users know when to trust the output and when to question it.
It should keep the user in the role of active decision-maker, not passive approver.

The natural conclusion: where possible, use mathematically validated methods to structure the decision itself, and use the LLM in a supporting role. The math is deterministic and inspectable. The LLM is fluent and useful, but its biases are unknown and its confidence is unjustified. Combining them well means letting each do what it’s actually good at.

AHP: a structured method for deciding in line with your own preferences

The Analytic Hierarchy Process (AHP), developed by Thomas Saaty in the 1970s, is one of the cleanest examples of this kind of structured method. Its core mechanic is pairwise comparison: instead of trying to score every option directly on every criterion, you compare two things at a time and say which you prefer, and by how much.

Pairwise comparison is, at its heart, the construction of a utility function.

That’s worth pausing on. Everything we said earlier about preferences being comparative, and about utility functions being numerical records of preference, comes back here. AHP doesn’t ask you to invent scores out of thin air, it asks you to do the one thing decision theory says preference actually is: compare two options at a time. The numerical utility comes out of those comparisons, not before them.

Concretely, AHP works in four steps, and it’s worth walking through them because the structure is doing real work.

1. Decompose the problem into a hierarchy. At the top is your goal (say: “choose a laptop”). Below it sit the criteria you care about (price, performance, battery, weight). At the bottom sit the alternatives (Laptop X, Y, Z). This step alone forces a clarity most decisions never get.

2. Pairwise-compare the criteria. For each pair, you say how much more important one is than the other, on Saaty’s 1–9 scale (1 = equally important, 3 = moderately more important, 5 = strongly more important, up to 9 = extremely more important). Is price more important than battery? Three times more? Five times? You answer this for every pair, filling in a matrix.

3. Pairwise-compare the alternatives under each criterion. Now, for each criterion separately, you compare the alternatives the same way. Under “battery,” how much better is Laptop X than Laptop Y? Under “price,” how much better is Y than Z? Each criterion gets its own comparison matrix.

4. Aggregate. AHP runs an eigenvector calculation on each matrix to extract a weight vector, the relative importance implied by your pairwise judgments. Multiplying criterion weights by alternative weights gives a final score for each option. The highest score is the recommended choice.

The consistency check is the most underrated step, running alongside steps 2 and 3. If your pairwise judgments are perfectly transitive, say price is 2× battery, battery is 3× weight, so price should be 6× weight, the matrix is mathematically consistent. But humans don’t reason that cleanly. You might have said price is only 4× weight, which contradicts the 6× implied by the chain. AHP computes a consistency ratio that quantifies how much your judgments deviate from perfect transitivity. Saaty’s rule of thumb is that a ratio below 0.1 is acceptable; above that, you should revisit your comparisons. The check doesn’t tell you which judgment is wrong, that’s your call, but it tells you that something in your stated preferences doesn’t add up, and gives you a chance to fix it before the final ranking comes out. This is the money pump check from earlier, made operational.

The crucial property is this: AHP forces the user to articulate their own preferences, criterion by criterion, comparison by comparison. The output reflects your values, weighted by your judgments. The method doesn’t tell you what to want. It tells you what your preferences imply, and warns you when they don’t quite add up.

Compare that to asking an LLM “which option should I choose?“. The LLM will produce a confident answer drawn from a frame you didn’t construct, weighted by criteria you didn’t specify, reflecting biases you can’t audit. AHP makes the same problem solvable in a way you can actually verify.

It’s also worth distinguishing AHP from heuristic approaches like brainstorming, lateral thinking, or the Delphi method. Heuristics are useful, sometimes essential, but they’re explicitly approximate. An algorithm, by contrast, is a precise recipe. AHP is an algorithm. It will give you the same answer for the same inputs, and you can inspect every step.

A Claude Skill for AHP

The natural next step is operational: package AHP as a tool the LLM can invoke, rather than something the LLM tries to do in its head. A Skill (in the Claude sense: a folder of instructions and scripts the model loads when relevant) is a clean vehicle for this. The structure is small but deliberate:

ahp-decision/
├── SKILL.md                          ← main workflow (7 steps)
├── scripts/
│   └── ahp_solver.py                 ← eigenvector, CR, aggregation, sensitivity
└── references/
    ├── saaty_scale.md                ← mapping natural language → 1–9 scale
    ├── converting_data.md            ← handling hard data (price, battery)
    └── example_walkthrough.md        ← full conversation example (laptop choice)

// The full Skill is public at: github.com/michellibera/LLM-Skills

Each piece has a specific job, and the separation matters more than it might look at first.

SKILL.md is the orchestration layer. The document Claude reads when the Skill activates. It defines the seven-step workflow the model walks the user through:

Frame the decision. Confirm what the user is actually choosing between, and why. Push back on vague framings (“I want a better job” → “compared to what specifically?”).
Identify alternatives. Get a concrete, finite set of options on the table. Help the user surface candidates they may have dismissed too early, but never invent alternatives for them.
Identify criteria. Help the user articulate what they care about. This is where the LLM’s breadth genuinely helps. Surfacing dimensions the user may have forgotten, but the final list is the user’s call.
Pairwise-compare the criteria. Walk through every pair, in natural language, mapping the answers onto Saaty’s scale via saaty_scale.md.
Pairwise-compare the alternatives under each criterion. Same mechanic, one criterion at a time. Where hard data exists (price in złoty, battery in hours), defer to converting_data.md rather than asking for subjective comparisons.
Compute and report. Hand the matrices to ahp_solver.py. Surface the ranking, the criterion weights, and the consistency ratio. If CR > 0.1, flag the most inconsistent judgments and offer to revisit them.
Sensitivity check. Show how robust the ranking is — what happens if a criterion weight shifts by 10%? If the top choice flips easily, the decision is less settled than the number suggests.

The seven steps are written as instructions to Claude, not as a script the user sees. The user just talks. The structure happens around them.

scripts/ahp_solver.py is the deterministic core, handling the math the LLM should never do in its head. It builds comparison matrices, computes eigenvectors, calculates consistency ratios, aggregates across criteria, and runs sensitivity analysis. Pulling this into a script is about auditability: a reviewer can read 80 lines of Python and verify what produced the recommendation. They cannot do the same with an LLM’s internal computation.

references/saaty_scale.md solves a translation problem. Users don’t think in Saaty numbers. They say things like “battery matters a lot more than weight, but not insanely more.” This file maps natural-language intensities onto the 1–9 scale consistently. Without it, the same phrasing could produce different numbers across sessions.

references/converting_data.md handles criteria with objective measurements. If you’re comparing laptops on price, you have actual numbers and no need for subjective comparisons. This file converts hard data into pairwise ratios directly, falling back to elicited comparisons only for genuinely subjective criteria.

references/example_walkthrough.md is a full worked conversation, a user choosing between three laptops from framing through final ranking. It shows how the conversation should feel: where to slow down, where to push back, how to phrase questions without sounding like an interrogation.

The division of labour across these files mirrors the division of labour across the whole approach: the LLM handles language, exploration, and clarification; the math handles the decision; the references handle translation between the two; the user supplies the preferences. No part of the actual choice gets quietly absorbed into the model’s latent priors.

The full Skill is available on GitHub: github.com/michellibera/LLM-Skills.

A web app

A chat interface is fine for exploring the idea, but pairwise comparison is fundamentally a UI problem. Asking “how much do you prefer A to B?” twenty times in a conversation is tedious. Doing it with sliders, a visible matrix, and a live-updating consistency indicator is much better.

A companion web app offer:

Visual matrix entry - sliders or 1–9 scale buttons for each pairwise comparison, with the inverse populated automatically.
Live consistency feedback - a visible indicator that updates as the user enters comparisons, highlighting which judgments are creating the most inconsistency.
Sensitivity analysis - show how the final ranking shifts if a single criterion’s weight changes. This is where users often discover that their decision is more robust (or more fragile) than they thought.

Web application available at: BestDecision

Where this approach falls short

AHP is not a silver bullet, and pretending otherwise would repeat the exact error this article is arguing against.

Criterion selection is still subjective. AHP weights the criteria you give it. If you forget a relevant dimension, no amount of mathematical rigour will surface it. This is precisely where an LLM’s range can help, but also precisely where its blind spots can hide.
The 1–9 scale is a modelling assumption. Saaty’s scale is plausible but not the only choice, and small changes in the scale can shift outcomes.
Rank reversal. Adding or removing an alternative can sometimes change the ranking of the remaining options. There are AHP variants that address this, but it remains a real critique.
Many decisions aren’t decomposable. Emotional, creative, or deeply contextual choices may resist the kind of clean hierarchical structuring AHP requires. Forcing them into the framework can give a false sense of rigour.
Garbage in, garbage out. If the user’s pairwise judgments are themselves shaped by an LLM’s framing, AHP just launders that influence through a mathematical filter.

Alternatives worth knowing about include MAUT (multi-attribute utility theory), TOPSIS (ranking by distance from an ideal solution), ELECTRE and PROMETHEE (outranking methods), and Bayesian decision analysis for problems where uncertainty dominates. Each has its niche. AHP’s main appeal is that it is simple enough to actually use and rigorous enough to actually trust.

Where this could go

A few directions feel genuinely promising:

Group AHP. Aggregating preferences across multiple stakeholders, with the method exposing where disagreement is concentrated rather than hiding it under an average.
Decision archives. Storing past AHP analyses so users can see how their preferences and frames have shifted over time. This is the kind of metacognition LLMs alone don’t encourage.

Closing thoughts

The genuine risk of LLMs in decision-making isn’t that they’ll give us bad answers. It’s that they’ll give us plausible answers, fluently, and we’ll forget that an answer plausibly framed is not the same as a decision properly made.

The way out isn’t to refuse the tool. It’s to use it in the place it actually belongs, as a fluent collaborator in framing, exploring, and stress-testing, while keeping the act of choosing inside a structure we can inspect. AHP isn’t the only such structure, but it’s a good one: simple enough to use, rigorous enough to trust, and transparent enough that the decision still belongs to the person making it.

A good decision-support system should leave you understanding your own preferences better than you did before you started. If you walk away from a tool feeling that it decided, the tool failed. No matter how good the recommendation was.