Classification is a compression technique

Every category you use is a compression algorithm

You walk into a bookstore and see 10,000 distinct objects. Each one differs from every other in dimensions, weight, page count, binding method, font size, cover art, paper quality, publication date, author background, and subject matter. Your brain does not process 10,000 unique items. It processes a handful of categories — fiction, nonfiction, reference, children's — and navigates from there. You just compressed a 10,000-item space into four or five buckets, losing enormous amounts of detail while preserving the one distinction you need: what kind of reading experience am I looking for?

That is what every classification system does. It takes a space of high-dimensional, unique particulars and compresses it into a smaller set of categories that are useful for a specific purpose. Classification is not a preliminary step before "real" thinking begins. Classification is a form of data compression — and understanding it this way transforms how you design, evaluate, and repair your category systems.

The mathematics behind the intuition

Claude Shannon formalized this relationship in 1948 when he published "A Mathematical Theory of Communication." Shannon proved that every source of information has a measurable quantity called entropy — the minimum number of bits required to represent its output without losing any information. His source coding theorem established that you cannot compress below this entropy limit without accepting some loss of data.

Shannon later extended this idea in 1959 with rate-distortion theory, which asks the inverse question: if you are willing to accept some loss, how much can you compress? The theory provides a mathematical function — the rate-distortion function — that quantifies the exact tradeoff between compression rate and fidelity. The less you care about perfect reconstruction, the fewer bits you need.

This is precisely what classification does in everyday cognition. When you label a colleague as "reliable," you are compressing hundreds of data points — every interaction, every deadline met or missed, every promise kept or broken — into a single descriptor. The compression ratio is enormous. The distortion is real: "reliable" erases the time they missed a deadline because their child was sick, erases the context in which their reliability varies, erases the trajectory of whether they are becoming more or less reliable over time. But for the purpose of deciding whether to assign them a critical deliverable, the compressed label may be sufficient.

Jorma Rissanen formalized a related principle in 1978 that makes the connection to classification even more explicit. His Minimum Description Length (MDL) principle states that the best model for a dataset is the one that compresses the data most. A good classification system, in MDL terms, is one where describing the categories plus the exceptions within each category takes fewer bits than describing every item individually. When your categories capture real structure in the world, classification achieves genuine compression — you say less and convey more.

Your brain already works this way

George Miller's landmark 1956 paper "The Magical Number Seven, Plus or Minus Two" introduced a concept that looks different from compression but is actually the same mechanism operating in biological hardware. Miller demonstrated that short-term memory holds roughly 5 to 9 chunks of information — not 5 to 9 individual items, but 5 to 9 grouped units. A phone number like 8005551234 exceeds your chunk limit if you process each digit individually (10 items). But group it as 800-555-1234 and you have three chunks. Same information, fewer units. That is compression.

Nelson Cowan's later research (2001, 2010) tightened the estimate to approximately 3 to 5 items in the central executive workspace when chunking strategies are stripped away. This makes cognitive compression not merely useful but mandatory. You cannot think about the world in its full, uncompressed form. Your working memory has a fixed bandwidth. Categories are how you fit a complex world through that narrow channel.

Eleanor Rosch's prototype theory (1978) reveals how the compression algorithm itself works in cognition. Rosch proposed two principles governing natural category formation: cognitive economy and perceived world structure. Cognitive economy is the compression drive — the brain's preference for gaining maximum information about the environment with minimum cognitive resources. Perceived world structure is the signal the brain compresses — the fact that real-world attributes are not randomly distributed but cluster in predictable ways. Dogs tend to have four legs, fur, tails, and teeth. These attributes co-occur, making "dog" a natural compression point where one label predicts many features.

This is why categories feel effortless. They are your brain's native compression format. You do not experience the act of categorizing a new person as "friendly" as a lossy compression operation — but that is exactly what it is. You are discarding most of what you observed (their specific facial expressions, vocal cadences, word choices, body language sequences) and retaining a single-word summary that predicts future behavior well enough for social navigation.

Lossy versus lossless: what you can and cannot recover

In data compression, the distinction between lossy and lossless is absolute. Lossless compression (like ZIP files) preserves every bit — you can reconstruct the original perfectly. Lossy compression (like JPEG images or MP3 audio) discards information permanently to achieve much higher compression ratios. You can shrink a photo by 90% with lossy compression, but you can never get those discarded pixels back.

Almost all classification is lossy. When you sort your notes into "work" and "personal," you lose the fact that some notes sit at the intersection. When a hospital assigns a patient to "semi-urgent," it loses the specific clinical reasoning that led to that determination. When a hiring committee labels a candidate as "strong yes," it collapses a multi-dimensional evaluation into a single point.

The critical question is not whether the compression is lossy — it almost always is — but whether it is lossy in the right places. Rate-distortion theory tells us that optimal compression minimizes the distortion that matters while freely discarding what does not. A music compression algorithm can discard frequencies outside human hearing range without any perceived loss. A medical triage system can discard a patient's dietary preferences without affecting treatment priority. These are irrelevant dimensions for the given purpose.

The failures happen when compression discards dimensions that turn out to be relevant. This is exactly the mechanism behind stereotyping.

When compression goes wrong: the case of stereotyping

Cognitive psychologists have long recognized that stereotyping is a direct consequence of the brain's compression machinery. As the research literature consistently frames it, stereotypes are a byproduct of the brain's limited cognitive capacity and its natural processes for organizing large amounts of information to navigate social life efficiently. A stereotype compresses an entire population of unique individuals into a single prototype. It is, computationally, a very high compression ratio applied to human beings.

The problem is not that the compression exists — some social categorization is inevitable and even useful. The problem is that stereotyping applies lossy compression to dimensions that matter enormously for the compressed individuals: their competence, their character, their potential. Research on the "need for cognitive closure" shows that people who are more reliant on cognitive shortcuts lean harder on stereotypical compression precisely because "stereotypes represent pre-existing knowledge structures, ready to be used momentarily, whereas individuating information may require extensive further processing."

This is the compression tradeoff made visible in its most consequential form. The person applying the stereotype gains cognitive savings. The person being stereotyped loses their individuality — the very dimensions that make them who they are get discarded in the compression step. And unlike JPEG artifacts, which are merely aesthetic, the artifacts of social over-compression shape hiring decisions, medical diagnoses, legal outcomes, and life trajectories.

The lesson is not "stop compressing." You cannot. The lesson is: audit what your compression discards, and check whether any of it matters for the decision you are making.

Engineering compression: how software systems apply the same principle

Software engineering is, in significant part, the discipline of managing compression through abstraction. A well-designed API is a compression interface — it hides the complexity of the underlying system behind a small set of operations that expose only what the consumer needs.

Consider a payment processing API. Behind the endpoint charge(amount, currency, card_token) sits an enormous amount of complexity: fraud detection algorithms, bank network routing, currency conversion, retry logic, compliance checks, logging, error recovery. The API compresses all of that into three parameters. The consumer does not need to understand Visa's authorization protocol to charge a credit card. That detail has been compressed away.

This is the same operation as classification. The API designer decides which dimensions of the underlying system are relevant to the consumer (the amount, the currency, the payment method) and which dimensions can be discarded at this layer of abstraction (the fraud model, the routing algorithm, the retry strategy). Good API design, like good categorization, achieves high compression while preserving the distinctions the user actually needs.

The failure mode is also identical to classification failure. An API that is too compressed — too few parameters, too little configurability — forces users into workarounds when their needs do not fit the compressed interface. An API that is under-compressed — too many parameters, too much exposed internality — provides no cognitive savings and is just as complex as building the system yourself. The design literature calls this the balance between "outside-in" design (compress based on consumer needs) and "inside-out" design (expose internal structure), which maps directly to the distinction between purpose-driven categories and implementation-driven categories.

AI and the Third Brain: learned compression at scale

Modern AI systems have turned compression into a learnable operation, and the results reveal something deep about how categories work.

Word embeddings like Word2Vec (Mikolov et al., 2013) compress the full distributional behavior of words across billions of text samples into dense vectors of 100 to 300 dimensions. The word "king" in raw text is a string of characters with no inherent meaning. In an embedding space, "king" becomes a point in 300-dimensional space positioned near "queen," "monarch," and "ruler" — and the vector arithmetic king - man + woman = queen emerges from the compression itself. The categories are not programmed. They are discovered by the compression algorithm.

Autoencoders make the compression operation explicit. An autoencoder is a neural network with a bottleneck — it is forced to squeeze high-dimensional input through a narrow middle layer and then reconstruct the original from that compressed representation. The bottleneck layer is the category system. Whatever features the network preserves through the bottleneck are the ones it has learned matter for reconstruction. Whatever it discards is its learned version of "irrelevant detail." Unlike PCA (Principal Component Analysis), which can only find linear compression axes, autoencoders can learn nonlinear compression — capturing the kind of complex, overlapping structure that characterizes real-world categories.

ImageNet, the dataset that catalyzed the deep learning revolution, is itself an exercise in compression hierarchy. Organized with the WordNet taxonomy, ImageNet arranges 22,000 visual categories across 9 levels of abstraction — from level 1 ("mammal") to level 9 ("German shepherd"). Each level represents a different compression ratio applied to the visual world. Level 1 compresses enormously — millions of distinct animals into a handful of classes. Level 9 preserves fine-grained distinctions that only matter if you care about dog breeds. The hierarchy is a multi-resolution compression scheme where you choose the level of detail appropriate to your task.

When you use a large language model to summarize a document, you are asking for compression. When you prompt it to classify customer feedback into categories, you are asking it to learn your compression scheme. When you build a retrieval-augmented generation system, the embedding index is a compression of your knowledge base into a searchable, category-structured space. Understanding that classification is compression gives you a vocabulary for evaluating these tools: What is the compression ratio? What is being discarded? Is the distortion acceptable for your purpose?

The compression audit

Here is the protocol for evaluating any classification system as a compression scheme:

1. Identify the input space. What is the full diversity of items being classified? How many dimensions of variation exist before compression? A customer feedback system might have input dimensions including sentiment, topic, urgency, customer tier, product area, and feature specificity.

2. Identify the output space. How many categories exist? How many dimensions does the compressed representation preserve? If you are compressing that customer feedback into "positive / negative / neutral," your output space is one dimension with three values.

3. Calculate the compression ratio. Roughly how many distinct items map to each category? If one category holds most of your items, it is failing to differentiate where you probably need differentiation. If you have categories with nearly zero items, you are preserving distinctions you do not need.

4. Audit the distortion. For each category, what information is lost? Write it down explicitly. Then ask: does any of that lost information matter for the decisions this classification supports? If you are making product roadmap decisions based on customer feedback, "positive / negative / neutral" loses exactly the information you need most — what is positive or negative, and how urgently it matters.

5. Adjust the compression level. Add categories where the distortion is damaging your decisions. Remove categories where the preserved distinctions are not actionable. The right compression level is not a property of the data — it is a property of your purpose.

This is the deep insight that connects information theory to practical epistemology: there is no "correct" compression. There is only compression that is appropriate for a given purpose, measured by whether the preserved information supports the decisions you need to make. Shannon proved that compression below the entropy limit destroys information irreversibly. But the entropy limit itself depends on what you are trying to preserve. Change the purpose, and the optimal compression scheme changes with it.

Every category system you build — for your tasks, your contacts, your knowledge, your beliefs — is a compression algorithm you are running on your experience of the world. The question is not whether to compress. You must compress; your cognitive bandwidth demands it. The question is whether you are compressing well — preserving what matters, discarding what does not, and regularly auditing the boundary between the two.