Reference-Class Forecasting in Plain English

In 2008 a small group of academics and practitioners began running a quiet experiment on infrastructure projects in the United Kingdom. They took the planners' bottoms-up estimate for each project — the timeline, the budget, the projected benefits — and laid it next to the historical distribution of how comparable projects had actually performed.

The bottoms-up estimates were, almost without exception, optimistic. Sometimes by a factor of two. The historical distribution was a better predictor of the new project's outcome than any of the project-specific analyses the team had done. The method now sits in the UK Treasury's Green Book as official guidance for large public projects. Bent Flyvbjerg, the academic most associated with it, has spent the last twenty years documenting the pattern across hundreds of megaprojects in dozens of countries.

The method is called reference-class forecasting. It is, in our judgment, one of the three or four most useful ideas in applied decision-making, and it is one of the least practiced in private-sector business. This essay is an attempt to explain it the way a CFO would actually need to use it on a Monday morning, without the academic vocabulary and without the megaproject framing that makes most executives assume it doesn't apply to them.

The inside view and the outside view

Daniel Kahneman and Dan Lovallo's 1993 paper Timid Choices and Bold Forecasts draws the distinction that does most of the work here. Every forecast can be made in one of two ways.

The inside view is what teams do by default. They look at the specific project — the scope, the resources, the team, the obstacles they can name — and they build a forecast from the inside out. The product manager estimates the engineering tasks. The engineering lead estimates the integrations. The PM rolls them up. The estimate is built from the details of this project. It feels rigorous, because it is grounded in specifics.

The outside view is the opposite. Instead of looking at the project's interior, you look at the project's exterior — the historical class of projects that resemble this one — and you ask what that distribution says about the likely outcome. You ignore, deliberately, what is special about this project. You treat it as a sample drawn from a population, and you ask what the population says.

The inside view feels like the correct way to forecast. It is the way every team naturally wants to do it. It is also, on average, badly wrong, because every team underestimates the things they cannot yet see — the dependencies that haven't surfaced, the people who haven't quit yet, the integrations that haven't been scoped — and the inside view contains no mechanism to correct for what it doesn't yet know.

The outside view contains exactly such a mechanism. The historical distribution already contains, in its dispersion, all the things that went wrong in past projects of this type. You do not need to enumerate them. You only need to acknowledge that this project is, until proven otherwise, a member of the same class.

A concrete example

Let us make this concrete. We worked with a software organization rolling out a new internal platform. The team — competent, senior, well-resourced — estimated six months from kickoff to full rollout. The estimate was bottoms-up. Each workstream had a Gantt. Each Gantt rolled up to the program plan. The number was six months.

We asked them to do the outside view. The company had run, in the previous five years, eleven internal platform rollouts of comparable scope. The actual durations were: 9, 11, 14, 8, 19, 12, 16, 10, 22, 13, and 14 months. The median was thirteen. The mean was about fourteen. Not a single one of the eleven had come in under nine months. None had come in at the new project's estimate of six.

The team's reaction was the universal one. But this one is different. The platform was newer. The team was stronger. The architecture was cleaner. The dependencies were better understood. Each of these claims was, in isolation, true. Each of them was almost certainly also true of at least one of the eleven previous projects.

The rollout took fourteen months. It came in within one month of the reference-class median. The bottoms-up estimate, derived from the careful Gantts, missed by 130%. The outside view, derived from the dumb arithmetic of looking at the history, missed by less than a month.

This is not an unusual story. It is the modal story. It is what reference-class forecasting almost always produces, in almost every domain we have ever seen it applied: a number meaningfully better than the bottoms-up, derived from less work, requiring only the willingness to use it.

The "but this one is different" objection

The objection, when it surfaces, is always the same. This one is different. Sometimes the objection is articulated. More often it is felt — a quiet certainty in the room that the reference class, however statistically interesting, does not apply here.

The point is not that the project isn't different. The point is that every over-running project's team thought theirs was different too. The teams behind the 19-month and 22-month rollouts in the example above each had, at kickoff, a coherent story for why their project would be the one that broke the pattern. They were wrong. Not because they were unintelligent — they were quite intelligent — but because the project's specialness, however real, was not specialness that overcame the structural factors that drive these projects long.

The reference class does not claim your project will go like the average. It claims that, in the absence of decisive evidence to the contrary, your project should be priced as a draw from the same population. The burden of proof is on the team claiming the exemption, not on the team applying the base rate.

This is where the discipline lives. Reference-class forecasting does not require the team to abandon the inside view. It requires the team to anchor the inside view against the outside view, and to require that any deviation from the reference class be defended with evidence specific enough that the next team, looking at this project ten years later, will recognize it as evidence rather than as wishful thinking.

A useful question, in our engagements, is: what would the team behind the 22-month project have said, at kickoff, about why they would be the one to come in under twelve? When you can answer that question — and recognize the answer in your own current story — you have done the work the outside view requires.

How to actually do it

The mechanics are simpler than the vocabulary suggests. There are four steps.

Define the reference class. What other projects, run by this company or by comparable companies, are of similar scope, similar complexity, similar risk profile? The class should be large enough to give you a distribution — usually at least five, preferably more than ten — and similar enough that the comparison is honest. Defining the class is harder than it sounds; the temptation is always to draw the boundary tightly enough that the current project's specialness is preserved. Resist.

Get the historical distribution. For each project in the class, what was the bottoms-up estimate at kickoff, and what was the actual outcome? You are looking for the ratio of actual to estimate, sometimes called the uplift. In our experience, software rollouts cluster around 1.5x to 2.5x; M&A integrations cluster around 1.8x to 3x; large product launches are more variable but rarely come in below the bottoms-up.

Adjust the new project's estimate against the distribution. The simplest version is to take the bottoms-up and multiply by the median uplift from the reference class. The more honest version is to present the new project's likely outcome as a range derived from the historical distribution — say, the 20th to 80th percentile of the reference class — and to let the room reason about where in the range this particular project belongs.

Document the case for any deviation. If the team believes this project belongs at the optimistic end of the reference class, write down why. Specifically. Falsifiably. In a way that the next team, looking at the document in three years, can score against the actual outcome. This step is the one that converts reference-class forecasting from a one-time exercise into a learning system.

What it changes

A leadership team that adopts this discipline will, in our experience, see two changes within twelve months.

The first is that bottoms-up estimates become more honest. The team producing them knows they will be compared against a reference class, and the anchoring pulls the bottom-up toward realism. The estimates do not become pessimistic; they become less optimistic, which is closer to truth.

The second is that the conversation about each project shifts. Instead of arguing about whether the bottoms-up Gantt is right, the room argues about whether this project is, in fact, an exception to the reference class. That conversation is much more productive. It surfaces the specific reasons for optimism that are credible, and it dissolves the ones that aren't. It produces a forecast the team can defend to itself.

A CFO who installs this on top of the planning rhythm we have described elsewhere will find that the company's overall forecast accuracy improves in a way that, in retrospect, looks like it should have been obvious. It was. It is the kind of obviousness that takes the discipline of doing it to see.

We help leadership teams build the reference-class libraries that make this practical — turning the company's own history into the most useful forecasting asset it owns. If your next major commitment is being estimated entirely from the inside, that is the conversation worth having before the number is locked in.

The Bayeseon Team

Writes about decision quality at Bayeseon. Reach the team at hello@bayeseon.com.

The inside view and the outside view

A concrete example

The "but this one is different" objection

How to actually do it

What it changes

Keep reading

The Decision Tax: Why Confident-Sounding Boards Make Expensive Calls

Forecasts Without Confidence Intervals Are Marketing

Got a decision you'd rather not get wrong?