Track your predictions

You have no idea how wrong you are

In 2005, Philip Tetlock published a study that should have ended every confident pundit's career. Over twenty years, Tetlock tracked 28,000 predictions from 284 experts — political scientists, economists, intelligence analysts, journalists — people whose professional credibility depended on their ability to forecast the future. The result: the average expert performed barely better than a dart-throwing chimpanzee. Worse, the most confident experts — the ones who appeared on television, who spoke with the most certainty, who had the strongest reputations — were the least accurate (Tetlock, 2005).

This was not a failure of intelligence. These were brilliant, well-informed people operating in their domains of expertise. The failure was structural: they never tracked their predictions. They made forecasts on television, in columns, in briefing rooms — and then the predictions evaporated. When an expert predicted regime change in six months and nothing happened, no one went back to score the prediction. When another predicted economic stability and a recession hit, the expert quietly reframed their earlier statements. Without records, hindsight bias — the universal tendency to believe you "knew it all along" — rewrote their memories. They genuinely believed they had been more accurate than they were.

You are doing the same thing. Not on television, not about geopolitics — but about everything. You predict how long tasks will take, whether relationships will work out, whether projects will succeed, whether market conditions will shift, whether your decisions will produce the outcomes you expect. You make dozens of implicit predictions every day. And because you never write them down, you never discover how wrong you are.

L-0143 established that overconfidence is the default calibration error. This lesson provides the mechanism for correcting it. You cannot fix a miscalibration you cannot see. Tracking your predictions makes the miscalibration visible. Everything else follows from that.

Why your memory is an unreliable scorekeeper

The reason untracked predictions teach you nothing has a name: hindsight bias. And it is one of the most robust findings in the history of cognitive psychology.

Baruch Fischhoff demonstrated the effect in 1975 with an elegantly simple experiment. He gave subjects a description of an 1814 conflict between British forces and Nepalese Gurkhas, listing four possible outcomes: British victory, Gurkha victory, a stalemate, or a peace settlement. One group received no information about the actual outcome. Four other groups were each told a different outcome had occurred. Then all subjects estimated the probability of each outcome "before knowing the result."

The results were unambiguous. Subjects who were told a particular outcome had occurred rated that outcome as significantly more probable — even though they had been explicitly instructed to ignore what they knew about the result. Fischhoff called this "creeping determinism": the mere knowledge that something happened makes it feel like it was always going to happen (Fischhoff, 1975). The effect has been replicated hundreds of times across domains — elections, verdicts, medical diagnoses, sporting events, business outcomes. It is not a weakness of careless thinkers. It is a structural feature of human memory.

This has a devastating consequence for learning from experience. When you make a prediction and do not record it, your memory of that prediction shifts to align with whatever actually happened. You predicted the project would take four weeks, it took eight, and three months later you remember being "worried it might take longer." You predicted a candidate would be a strong hire, they were terminated in six months, and you remember having "reservations about their fit." You predicted the market would be stable, it crashed, and you remember "sensing something was off."

None of these revised memories are lies. They are the natural operation of a brain that constructs coherent narratives from fragmentary evidence. But they are catastrophically wrong as a feedback mechanism. A feedback mechanism that tells you "you were approximately right" when you were consistently wrong does not just fail to improve your judgment — it actively prevents improvement by removing the signal you need to calibrate.

Annie Duke, a former professional poker player turned decision strategist, frames this as the fundamental problem of decision quality: without a contemporaneous record of your beliefs, you cannot distinguish between good decisions with bad outcomes and bad decisions with good outcomes (Duke, 2018). The decision journal — a written record of what you believed and why, made before the outcome is known — is the only tool that defeats hindsight bias. Not discipline. Not self-awareness. Not intelligence. A written record with a timestamp.

The superforecasters proved tracking works

The theoretical case for prediction tracking is strong. The empirical case is extraordinary.

In 2011, the Intelligence Advanced Research Projects Activity (IARPA) — the research arm of the US intelligence community — launched a forecasting tournament. The goal was to find out whether any method could improve geopolitical prediction beyond the dismal baseline Tetlock had documented. Over four years, 25,000 forecasters made over one million predictions on 500 geopolitical questions: Will North Korea test a nuclear device in the next year? Will the Euro-area GDP growth rate exceed 1%? Will there be a lethal confrontation between state military forces in the South China Sea?

The Good Judgment Project, led by Tetlock and Barbara Mellers at the University of Pennsylvania, entered the competition with a simple hypothesis: ordinary people who track their predictions, receive feedback on their accuracy, and systematically update their beliefs will outperform experts who do not.

They were right — by a staggering margin. The top forecasters in the Good Judgment Project, whom Tetlock called "superforecasters," were performing 30% better than professional intelligence analysts with access to classified information. They were beating prediction markets. They were crushing the performance of famous pundits (Tetlock & Gardner, 2015).

What distinguished superforecasters was not raw intelligence, though they were bright. It was not domain expertise, though many were knowledgeable. It was a cluster of practices centered on prediction tracking and iterative calibration. They assigned precise probability estimates rather than vague verbal hedges. They updated those estimates as new information arrived — frequently, often dozens of times per question. They tracked their accuracy using proper scoring rules. And they used the feedback to adjust not just individual predictions but their prediction process itself.

The research identified four distinct drivers of accuracy improvement. First, recruiting and retaining the best natural forecasters accounted for roughly 10% of the advantage over baselines. Second, cognitive debiasing training — teaching forecasters about overconfidence, anchoring, and base rate neglect — contributed another 10%. Third, team-based forecasting with structured discussion protocols added approximately 10%. And fourth, better statistical methods for aggregating individual forecasts contributed an additional 35% improvement over simple averaging (Mellers et al., 2014).

But underneath all four drivers was a single prerequisite: tracking. You cannot recruit good forecasters if you do not score their predictions. You cannot train people to debias if you cannot show them where their bias lives. You cannot improve team processes if you cannot measure team accuracy. You cannot aggregate forecasts intelligently if you do not know which forecasters to weight more heavily. Every intervention that improved forecasting accuracy depended on the existence of a written record, scored against reality.

The strongest predictor of who rose to superforecaster status was not IQ or education. It was what Tetlock called "perpetual beta" — a dispositional commitment to continuous self-improvement, manifested as a willingness to track, score, and revise one's predictions relentlessly. The superforecasters who maintained top performance across years were the ones who never stopped treating their prediction accuracy as a metric to be improved rather than a talent to be deployed.

The Brier score: making accuracy measurable

You cannot improve what you cannot measure. The superforecasters succeeded because they had a measurement tool: the Brier score.

Developed by meteorologist Glenn Brier in 1950, the Brier score is a proper scoring rule that quantifies how close your probabilistic predictions are to reality. The formula is simple: take the squared difference between your predicted probability and the actual outcome (coded as 1 if the event occurred, 0 if it did not), then average across all your predictions.

A perfect score is 0 — you assigned 100% to things that happened and 0% to things that didn't. The worst possible score is 1. Random guessing on binary outcomes produces a Brier score of 0.25. The typical well-calibrated human forecaster scores between 0.15 and 0.20. Superforecasters averaged around 0.10 to 0.15 across domains (Brier, 1950; Tetlock & Gardner, 2015).

The Brier score reveals something crucial that simple hit-rate tracking misses: the value of calibrated confidence. Suppose you predict ten events and assign 90% confidence to each. Seven occur. Your hit rate is 70% — which looks decent. But your Brier score exposes the problem: you were systematically overconfident. You said 90% when reality was 70%. The score penalizes you for the gap between your stated confidence and the actual frequency.

This is why the Brier score has two components that matter for self-calibration. The first is calibration — how well your stated probabilities match the actual frequencies of outcomes. A perfectly calibrated forecaster's 70% predictions come true 70% of the time, their 30% predictions come true 30% of the time. The second is resolution — how much your probability estimates vary across predictions, indicating that you are distinguishing between more and less likely events rather than assigning the same vague probability to everything.

You do not need to calculate formal Brier scores to benefit from prediction tracking. But you need to understand the principle: accuracy is not about being right or wrong on individual predictions. It is about the long-run alignment between your confidence levels and reality. A forecaster who says "70% likely" and is right 70% of the time has useful, calibrated judgment — even though they are "wrong" 30% of the time. A forecaster who says "95% certain" and is right 70% of the time has dangerous, miscalibrated judgment — even though they are right more often than not.

The prediction journal: your calibration instrument

The practical implementation of prediction tracking is a prediction journal. The format is less important than the discipline. Here is what the record must contain.

The prediction. State the expected outcome precisely enough that resolution is unambiguous. "The market will go up" is useless — when, by how much, measured by which index? "The S&P 500 will close above 5,000 on March 31, 2026" is scoreable. Vagueness is where hindsight bias hides. The more specific your prediction, the harder it is for your memory to retroactively claim accuracy.

The confidence level. Assign a probability as a percentage. This feels unnatural at first — how can you put a number on something uncertain? That discomfort is precisely the point. Vague verbal expressions like "probably" or "likely" are unscoreable because they mean different things to different people and to the same person at different times. Research shows that when people say "probably," they mean anything from 55% to 95% (Wallsten et al., 1986). A number forces precision. Precision enables scoring. Scoring enables calibration.

The reasoning. Three to five sentences explaining why you hold this belief at this confidence level. What evidence supports the prediction? What evidence argues against it? What would change your mind? The reasoning serves two functions: it forces you to examine the basis of your belief at the time you hold it, and it creates a record you can audit after the outcome is known to identify which parts of your reasoning process are reliable and which are systematically flawed.

The resolution date. When will you know if the prediction was correct? Open-ended predictions are unscorable. Every prediction needs a deadline.

The resolution and score. When the deadline arrives, record what actually happened and score the prediction. Did the event occur? How does your confidence level compare to the outcome? Where was your reasoning accurate, and where did it fail?

Annie Duke recommends a complementary practice she calls the "knowledge tracker" — recording not just predictions but the state of your knowledge at the time of each decision. What did you know? What were you uncertain about? What assumptions were you making? This additional layer of recording defeats the second major bias that corrupts untracked judgment: outcome bias, the tendency to evaluate decisions based on their results rather than the quality of the reasoning process (Duke, 2020).

The prediction journal is not a productivity tool. It is a calibration instrument — the cognitive equivalent of a bathroom scale. The scale does not make you lighter. But without it, you will systematically overestimate your fitness because your self-image is biased toward flattering narratives. The prediction journal does not make you more accurate. But without it, you will systematically overestimate your accuracy because your memory is biased toward remembering your hits and forgetting your misses.

The collective evidence: forecasting platforms at scale

Individual prediction journals are powerful. Large-scale forecasting platforms confirm the principle at population level.

Metaculus, a community forecasting platform with hundreds of thousands of predictions across science, technology, geopolitics, and public health, has demonstrated that tracked and aggregated forecasts achieve remarkably strong calibration. Metaculus community predictions have exhibited Brier scores averaging 0.10 to 0.20 across domains, with approximately 50% of questions resolving within the forecasters' stated 50% confidence intervals — the mathematical signature of proper calibration (Metaculus, 2023).

Prediction markets — platforms where participants bet on outcomes — provide another line of evidence. The Iowa Electronic Markets outperformed polls in predicting presidential elections from 1988 to 2000, with prediction errors within 1.5 percentage points of actual vote outcomes compared to 1.9 percentage points for polls (Berg et al., 2008). Analysis of 3,587 prediction markets found overall accuracy of 92.4%, with calibration curves indicating that market-priced probabilities match actual outcome frequencies without systematic bias.

The mechanism in both cases is identical to the individual prediction journal, scaled up. Participants make explicit forecasts. The forecasts are recorded with timestamps. Outcomes are scored. Accuracy feedback is provided. Participants who track their records and update their methods improve. Participants who ignore their records stay at baseline.

The key finding across all these platforms is that improvement happens. Forecasting accuracy is not a fixed trait. It is a skill that improves with practice — but only with scored practice. Unscored practice is just repetition of errors with increasing confidence.

AI as your calibration partner

Artificial intelligence transforms prediction tracking from a solo discipline into a partnership with a system that never forgets, never succumbs to hindsight bias, and can identify patterns in your prediction data that you cannot see.

An AI calibration partner can serve three functions that a paper journal cannot.

Structured recording. AI can prompt you for predictions at consistent intervals, enforce the specificity standards that make predictions scoreable, and flag predictions that are too vague to resolve. When you write "I think the project will probably be late," an AI assistant can respond: "What is your specific prediction? By what date? At what confidence level?" This enforcement of prediction hygiene is where most solo tracking systems fail — the discipline required to maintain rigor is exactly the kind of slow, unglamorous work that human attention drifts away from.

Pattern detection. After you have accumulated fifty or a hundred tracked predictions, an AI system can analyze your calibration data in ways that would take you hours. Which domains are you most overconfident in? Which types of events do you systematically underpredict? Is your overconfidence worse on Mondays than Fridays? Does your confidence correlate with your emotional state at the time of prediction? These patterns exist in your data. You will not find them by scanning journal entries. AI will find them in seconds.

Base rate retrieval. One of the most consistent findings in forecasting research is that people underuse base rates — the historical frequency of similar events. When you predict that your startup will succeed, you are using the "inside view," constructing a narrative from the specific details of your situation. The "outside view" asks: what percentage of startups with similar characteristics actually succeed? AI can retrieve relevant base rates instantly, anchoring your predictions to historical data rather than to the optimistic narratives that overconfidence generates.

Metaculus's AI forecasting benchmarks demonstrate both the power and the limits of AI prediction. In their 2024-2025 tournaments, the best AI forecasting bots — systems that conduct internet searches, generate predictions, and aggregate their own outputs — achieved performance that approached but did not match the best human forecasters (Metaculus, 2025). Human forecasters maintained an edge, particularly on questions requiring contextual judgment, stakeholder analysis, and domain expertise that cannot be scraped from the internet.

The implication is clear: neither AI nor human forecasting alone is optimal. The combination — human judgment providing contextual reasoning and value weighting, AI providing base rate retrieval, pattern detection, and calibration feedback — produces the strongest results. Your prediction journal is the interface where this partnership operates.

The protocol: building your tracking practice

Start today. Not after you finish this lesson. Not after you design the perfect system. The perfect system does not exist. The best prediction journal is the one you actually use.

Week 1: Establish the habit. Write three predictions per day for seven days. They can be small: "My 2pm meeting will run over 30 minutes (65%)." "The package will arrive by Friday (80%)." "My team will hit the sprint goal (55%)." The predictions do not need to be important. They need to be specific, timestamped, and recorded outside your head.

Week 2: Score and adjust. Review your Week 1 predictions. For every prediction that has resolved, score it: did the event occur? Was your confidence level appropriate? Calculate your hit rate for predictions where you were above 70% confident, and separately for predictions where you were below 50% confident. If your above-70% hit rate is significantly below 70%, you are overconfident. If your below-50% hit rate is significantly above 50%, you are underconfident. Both are data.

Week 3: Increase stakes. Start tracking predictions that matter: project timelines, hiring decisions, strategic bets, relationship outcomes. Apply the same discipline: specific outcome, confidence percentage, reasoning, resolution date. These higher-stakes predictions are where your calibration errors cost you the most — and where correction has the highest value.

Week 4: Analyze patterns. After three weeks of tracked predictions, look for systematic patterns. Are you consistently overconfident about timelines? Underconfident about people? Poorly calibrated on anything involving money? These patterns are your perceptual distortions made visible — exactly the material that the rest of Phase 8 is designed to address.

The prediction journal is not a phase-specific exercise. It is a permanent instrument in your epistemic infrastructure. Superforecasters do not stop tracking after they become accurate. They track because tracking is what makes them accurate and what keeps them accurate. The moment you stop tracking is the moment hindsight bias resumes its silent corruption of your judgment.

From tracking to feeling: the emotional dimension

You now have a tool for making your prediction accuracy visible. The data it produces will reveal something uncomfortable: your errors are not random. They cluster. They follow patterns. And those patterns, as you will discover in L-0145, are driven by something deeper than cognitive miscalibration.

Your prediction journal will show you that you are more overconfident when you are excited about a project. That you underestimate risks when you feel socially pressured to be optimistic. That your confidence level tracks your mood more closely than it tracks the available evidence. That anxiety makes you predict negative outcomes at higher rates, while euphoria makes you dismiss warning signs.

These are not noise in your prediction data. They are signal — evidence that emotional states distort perception systematically, bending your forecasts away from reality in predictable directions. Tracking your predictions makes these distortions measurable. L-0145 will make them understandable. Together, the two lessons build the case that calibrating your judgment requires calibrating not just your reasoning but the emotional field in which your reasoning operates.

Start the journal. Record the predictions. Score them honestly. The data will teach you things about your own mind that no amount of introspection could reveal.

Sources:

Tetlock, P. E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton, NJ: Princeton University Press.
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. New York: Crown.
Fischhoff, B. (1975). "Hindsight is not equal to foresight: The effect of outcome knowledge on judgment under uncertainty." Journal of Experimental Psychology: Human Perception and Performance, 1(3), 288-299.
Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., ... & Tetlock, P. E. (2014). "Psychological strategies for winning a geopolitical forecasting tournament." Psychological Science, 25(5), 1106-1115.
Duke, A. (2018). Thinking in Bets: Making Smarter Decisions When You Don't Have All the Facts. New York: Portfolio/Penguin.
Duke, A. (2020). How to Decide: Simple Tools for Making Better Choices. New York: Portfolio/Penguin.
Brier, G. W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1-3.
Berg, J. E., Nelson, F. D., & Rietz, T. A. (2008). "Prediction market accuracy in the long run." International Journal of Forecasting, 24(2), 285-300.
Wallsten, T. S., Budescu, D. V., Rapoport, A., Zwick, R., & Forsyth, B. (1986). "Measuring the vague meanings of probability terms." Journal of Experimental Psychology: General, 115(4), 348-365.
Metaculus. (2023). "Community Prediction Calibration and Accuracy Report." metaculus.com.