Record your calibration over time

You remember your predictions wrong

In 1975, psychologists Baruch Fischhoff and Ruth Beyth ran a deceptively simple experiment. Before Richard Nixon's historic visits to Beijing and Moscow, they asked participants to estimate the probability of various outcomes — Would Mao Zedong agree to meet with Nixon? Would the US and Soviet Union reach a nuclear arms agreement? After the trips concluded and the outcomes were known, they asked participants to recall the probabilities they had originally assigned.

The results were systematic and damning. Participants remembered having given higher probabilities to events that actually happened and lower probabilities to events that did not. They did not lie. They genuinely believed their recalled predictions matched their original ones. Their memories had quietly edited the record to align with reality, creating the confident sensation of "I knew it would happen" — when the written evidence proved they had not known at all (Fischhoff & Beyth, 1975).

This is hindsight bias, and it is not a minor cognitive quirk. It is a perceptual distortion that operates on every prediction you make, every judgment you render, every decision you evaluate after the fact. Without a written record, your memory does not store your predictions — it reconstructs them. And that reconstruction is systematically biased toward making you look more prescient than you were.

You have spent fifteen lessons in Phase 8 building the capacity to calibrate your perception. You have learned that perception is construction (L-0141), that your confidence often exceeds your accuracy (L-0145), and that other people serve as calibration instruments by exposing your blind spots (L-0155). This lesson introduces the tool that makes all of that calibration work durable: the decision journal. A written record of your predictions, your confidence levels, and your reasoning — created before outcomes are known — that gives you the one thing hindsight bias destroys: honest feedback on the quality of your own judgment.

Why written records change everything

The decision journal is not a diary. It is not a reflective practice in the loose sense. It is a measurement instrument — a tool that converts your subjective judgments into scoreable data.

Shane Parrish, who popularized the practice through Farnam Street, structures the decision journal around a simple but rigorous protocol: before any significant decision, you record the date, the decision, the mental and physical state you are in, the problem you are solving, the variables that matter, the range of outcomes you foresee, and your expected outcome with a probability attached. Then you close the journal and do not revisit the entry until enough time has passed — six months is the default — for the outcome to be known (Parrish, 2014).

The power of this practice lies in what it prevents. Without the journal, your memory performs three distortions that make learning from experience nearly impossible.

Distortion one: outcome editing. You remember your predictions as closer to what actually happened than they were. Fischhoff and Beyth demonstrated this in 1975, and fifty years of subsequent research has confirmed it across every domain studied — medical diagnosis, legal judgment, financial forecasting, sports predictions, political analysis (Roese & Vohs, 2012). The distortion is not a failure of intelligence. It is a feature of how memory works: outcomes become anchors that pull your recollection of prior beliefs toward them. A written record made at the time of prediction is immune to this pull.

Distortion two: reasoning reconstruction. You remember your reasoning as more coherent than it was. When a decision works out, you recall clear thinking. When it fails, you recall circumstances that forced your hand. Neither memory is accurate. The journal captures the actual reasoning — including the confusion, the competing factors, the emotional state — so that when you review it later, you confront what you actually thought, not the narrative your ego constructed after the fact.

Distortion three: pattern erasure. Without a record, you cannot see patterns across decisions. You might be systematically overconfident in one domain and well-calibrated in another, but without data, you experience each decision as an isolated event. The journal turns isolated judgments into a dataset — and datasets reveal patterns that individual data points cannot.

Donald Schon's foundational work on reflective practice draws the distinction between reflection-in-action — adjusting as you go — and reflection-on-action — reviewing what happened after the fact (Schon, 1983). The decision journal is the infrastructure that makes reflection-on-action honest. Without the written record, reflection-on-action is just memory reviewing itself, and memory is the unreliable narrator you are trying to correct.

The scoring system that makes calibration visible

Recording predictions is necessary. Scoring them is where calibration becomes actionable.

Philip Tetlock's Good Judgment Project — the largest study of human forecasting accuracy ever conducted — demonstrated that ordinary people who tracked and scored their predictions improved dramatically. The project recruited thousands of volunteer forecasters and tasked them with predicting geopolitical events: elections, conflicts, economic shifts, diplomatic outcomes. What separated the best forecasters from the rest was not superior intelligence or domain expertise. It was the practice of making specific predictions with probability estimates, tracking outcomes, and learning from the gap between confidence and accuracy (Tetlock & Gardner, 2015).

The superforecasters — the top 2% — outperformed professional intelligence analysts with access to classified information by approximately 30%. A one-hour training module on probabilistic reasoning improved accuracy by 6 to 11% over control groups, and the effect persisted across all four years of the study. The mechanism was straightforward: when you assign a number to your confidence and then track whether reality matches that number, you generate feedback that your brain cannot distort after the fact (Mellers et al., 2014).

The Brier score, developed by meteorologist Glenn Brier in 1950, provides the formal scoring mechanism. It measures the mean squared difference between your predicted probabilities and what actually happened. If you say something is 90% likely and it happens, your Brier score for that prediction is low — you were right and confident. If you say something is 90% likely and it does not happen, your Brier score is high — you were confident and wrong. A perfectly calibrated person's Brier score approaches zero over a large number of predictions.

But the decomposition of the Brier score matters more than the raw number. It splits into two components: calibration and resolution. Calibration measures whether your probabilities match reality — do 70% of the things you rate at 70% actually happen? Resolution measures whether your predictions differentiate — can you distinguish likely events from unlikely ones, or do you assign 50% to everything? A person with good calibration but poor resolution is honest about uncertainty but cannot tell what is probable. A person with good resolution but poor calibration can distinguish probable from improbable but assigns the wrong numbers. Your journal will reveal which problem you have.

Here is the practical application. After thirty days of recording predictions with confidence levels, sort them into buckets: all predictions where you said 90%, all where you said 70%, all where you said 50%. Count the resolution rate for each bucket. If your 90% predictions came true 72% of the time, you are overconfident at the high end. If your 50% predictions came true 80% of the time, you are underconfident when hedging. If your 70% predictions came true 69% of the time, you are well-calibrated in that range. These numbers are your calibration profile — the precise map of where your perception of your own judgment is accurate and where it is systematically distorted.

What the journal reveals that introspection cannot

Metacognition — thinking about your own thinking — is the foundation of all calibration work. But metacognition without data is unreliable. Research on metacognitive accuracy draws a critical distinction between metacognitive sensitivity — how well you can distinguish your correct judgments from your incorrect ones — and metacognitive bias — the systematic direction in which your self-assessment is off (Fleming & Lau, 2014). Most people have moderate metacognitive sensitivity — they can sort of tell when they know something versus when they are guessing — but significant metacognitive bias, typically in the direction of overconfidence.

The calibration log exposes four specific patterns that introspection alone cannot access.

Domain-specific calibration differences. You are not uniformly calibrated. You are likely well-calibrated in domains where you have deep experience and systematic feedback, and poorly calibrated in domains where feedback is delayed, ambiguous, or absent. A surgeon who gets immediate feedback on every operation may be well-calibrated on surgical outcomes but poorly calibrated on financial predictions. Your journal will reveal which domains deserve your confidence and which do not.

Emotional state correlations. Decisions made under stress, excitement, anger, or fatigue carry systematic biases. But you do not notice these biases in real time because the emotional state that distorts your judgment also distorts your perception of its quality. The journal's requirement that you record your mental and emotional state alongside each prediction creates a dataset that reveals correlations invisible to introspection. After sixty entries, you may discover that predictions made after 8 PM are 20% less accurate, or that predictions made during periods of high stress carry twice the overconfidence of predictions made when calm.

Confidence inversions. Some people are most accurate at moderate confidence levels and least accurate at their extremes. They are well-calibrated when they say 60% but overconfident when they say 95%. Others show the opposite pattern — careful at the extremes, sloppy in the middle. Without the journal, you cannot know which pattern is yours.

Temporal decay of calibration. Calibration is not static. It degrades when you stop paying attention to it and improves when you actively track it. The journal provides the longitudinal data that shows whether your calibration is improving, degrading, or oscillating — information you need to decide whether your current practice is working or needs adjustment.

Your AI partner as calibration analyst

Here is where the calibration log becomes dramatically more powerful than any pen-and-paper practice of the past.

A decision journal with fifty entries is useful. The same journal fed to an AI system that can perform statistical analysis, identify correlations across variables, and surface patterns invisible to manual review is transformative. The human-AI calibration partnership works at three levels.

Level one: pattern extraction. After you have accumulated thirty or more journal entries, an AI can analyze them for correlations you would never find manually. It can identify that your predictions about people are 15% more accurate than your predictions about systems, that your morning predictions outperform your afternoon predictions, that your confidence is inversely correlated with accuracy when the prediction involves a timeline beyond two weeks. These patterns exist in the data. You will not find them by reading through your journal. An AI will.

Level two: calibration scoring. An AI can compute your Brier scores, decompose them into calibration and resolution, track them over time, and visualize the trajectory. It can tell you whether your calibration is improving on a monthly basis, whether specific training exercises (like the ones in this phase) are producing measurable improvement, and where your remaining miscalibration is concentrated. This turns calibration from a subjective impression into a quantified metric.

Level three: real-time calibration adjustment. Once your calibration profile is established — "overconfident on timelines by approximately 20%, well-calibrated on people assessments, underconfident on technical estimates outside core domain" — an AI can prompt you to adjust in real time. When you record a new prediction about a timeline, the system can surface your historical accuracy on timeline predictions and ask: "Your base rate for 90% confidence on timelines is 62%. Do you want to adjust?" This is not the AI making the judgment for you. It is using your own data to correct for biases your brain cannot correct on its own.

The critical principle: the AI works with your data, not its own. If you feed it your predictions and outcomes, you get your calibration profile. If you ask it to make predictions for you, you get its calibration profile and learn nothing about your own. The journal must be yours. The analysis can be augmented.

The calibration journal protocol

This protocol is designed to be sustainable. The number one killer of calibration practices is excessive overhead — journals that require twenty minutes per entry get abandoned in a week. This one requires two minutes to record and one hour per month to review.

Daily recording (2 minutes).

For each significant prediction — aim for one to three per day — record:

Date and time.
The prediction. State it specifically enough to score. "The deployment will succeed" is too vague. "The deployment will complete without rollback by 5 PM Friday" is scoreable.
Confidence. Use a percentage. Start with round numbers: 50%, 60%, 70%, 80%, 90%. Avoid 100% and 0% — nothing is certain, and admitting that is itself a calibration exercise.
Reasoning. Two to three sentences on why you hold this confidence level. This is the material that hindsight bias would destroy.
State. One word for your mental-emotional condition: rested, anxious, rushed, focused, angry, uncertain.

Outcome recording (30 seconds per entry).

When the outcome is known, return to the entry and add:

Outcome. What happened. One sentence.
Resolution date.
Calibration note. One sentence: what does the gap between prediction and outcome tell you?

Monthly review (1 hour).

At the end of each month:

Sort predictions by confidence level. Compute accuracy rate per bucket.
Identify your three best-calibrated domains and your three worst.
Check for emotional state correlations.
Write a one-paragraph calibration adjustment: "This month I learned that I am [X] in [domain]. Next month I will adjust by [Y]."
If using AI: feed the month's entries into your analysis tool and compare its pattern detection against your manual observations.

Quarterly calibration profile (30 minutes).

Every three months, synthesize your monthly reviews into an updated calibration profile. This document should answer: What am I good at predicting? What am I bad at predicting? In what direction am I typically wrong? Under what conditions is my judgment least reliable? How has my calibration changed since last quarter?

This profile is not an abstract self-assessment. It is a correction algorithm — a set of specific adjustments you apply to your future predictions based on empirical evidence about your past accuracy.

From recording to updating

You now have the measurement instrument. The next step is using what it reveals.

Recording your calibration shows you where your perceptual model of reality diverges from reality itself. You discover that you are overconfident here, underconfident there, accurate in some domains and systematically distorted in others. This is enormously valuable data — most people live their entire lives without knowing the shape of their own miscalibration.

But knowing the shape is not the same as correcting it. If your calibration log shows that you assign 90% confidence to timeline predictions that resolve correctly only 60% of the time, you have identified the error. The question becomes: how do you update? How much should you adjust when new evidence arrives? How do you proportion your belief change to the strength of the evidence rather than to the strength of your emotional reaction?

That is Bayesian updating — the subject of L-0157. The calibration log gives you the baseline. Bayesian reasoning gives you the update rule. Together, they form the core mechanism of a self-correcting perceptual system: you observe, you predict, you record, you compare, and you adjust — not by intuition, but by a procedure that learns from every prediction you make.

The calibration log is not a journal. It is the memory your brain should have but does not — an honest record of what you believed before you knew what happened. Start it today. Review it in thirty days. What it shows you about yourself will be more accurate than anything you currently believe about the quality of your own judgment.

Sources:

Fischhoff, B., & Beyth, R. (1975). "I knew it would happen — Remembered probabilities of once-future things." Organizational Behavior and Human Performance, 13(1), 1-16.
Roese, N. J., & Vohs, K. D. (2012). "Hindsight Bias." Perspectives on Psychological Science, 7(5), 411-426.
Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. New York: Crown Publishers.
Mellers, B., Ungar, L., Baron, J., et al. (2014). "Psychological Strategies for Winning a Geopolitical Forecasting Tournament." Psychological Science, 25(5), 1106-1115.
Schon, D. A. (1983). The Reflective Practitioner: How Professionals Think in Action. New York: Basic Books.
Fleming, S. M., & Lau, H. C. (2014). "How to measure metacognition." Frontiers in Human Neuroscience, 8, 443.
Parrish, S. (2014). "How a Decision Journal Changed the Way I Make Decisions." Farnam Street Media.
Brier, G. W. (1950). "Verification of Forecasts Expressed in Terms of Probability." Monthly Weather Review, 78(1), 1-3.