Furnaces and Flight Simulators

Apr 09, 2026

In 1958, Mao Zedong launched one of the most catastrophic industrial experiments in human history. The Great Leap Forward’s backyard furnace campaign mobilized roughly 100 million people — peasants, teachers, students — to smelt steel in village kilns. Party cadres reported spectacular tonnage. The numbers looked like industrialization. They were not. What accumulated was a brittle, phosphorus-laden pig iron that could not bear structural load. What rotted in the fields, neglected while everyone tended the furnaces, was the harvest.

The famine that followed killed tens of millions of people.

I want to be clear upfront: I am not equating software bugs to mass starvation. The analogy I am drawing is structural, not moral. But the structural parallel is striking enough to be worth taking seriously, because the same failure mode is assembling itself at the intersection of AI-generated code and the economics of the agentic economy. A recent paper from Christian Catalini, Xiang Hui, and Jane Wu gives us the formal vocabulary to understand both the failure and the way out of it.

The Inspection Problem

Backyard pig iron and properly forged steel look similar to an untrained eye. Both are dark, heavy, metallic. The difference is invisible until the material is stressed — when you try to build a bridge with it, or a beam, or a load-bearing joint.

Vibe-coded software has the same inspection problem. A demo runs. The UI responds. Features appear to work. An executive, a product manager, a founder looking at a prototype is in roughly the same epistemic position as a party official being shown a pile of pig iron. It looks like progress. It might even be progress, for a season.

The failure reveals itself when load arrives: production traffic, edge cases, concurrent users, adversarial inputs, time. At that point, unhandled exceptions surface, the O(n²) bottlenecks that were invisible under test conditions begin compounding, the authentication flow with the timing vulnerability gets found by someone other than your team. The pig iron buckles.

What makes this tractable as a framework — rather than just an analogy — is a concept Catalini and colleagues call the Measurability Gap: the structural asymmetry between the Cost to Automate, which is falling exponentially, and the Cost to Verify, which is biologically bottlenecked by human time and embodied experience. In a world where agentic systems can generate code faster than any human team could write it, the binding constraint on realizing value from that code is not execution. It is verification.

Vibe coding widens the Measurability Gap deliberately. That is the whole pitch. The agent generates; the human accepts. What accumulates in the gap between what the agent produced and what a senior engineer would have produced — the error handling it skipped, the security model it approximated, the architectural decisions it optimized for plausibility rather than durability — is what Catalini et al. call the Trojan Horse externality: hidden debt that enters production systems while measured outputs (lines shipped, features delivered, velocity) look fine.

The Missing Smelters

The second failure mode of the backyard furnace campaign was less visible than the useless pig iron, and more durable in its damage. By mobilizing peasants to smelt steel, the campaign destroyed the agricultural labor capacity that maintained the food supply. The people who knew how to farm were melting woks. The famine arrived not just because the steel was bad, but because producing it consumed the human capital that kept everything else working.

Catalini and colleagues have a name for the software equivalent: the Missing Junior Loop.

Expert-level software engineering is not a skill that can be installed from a training program. It is built through friction — through years of debugging code you wrote badly, through reading production postmortems, through the slow accumulation of pattern recognition that tells you, on inspection, that this function will fail under concurrency, that this schema will not survive a volume change. It is tacit knowledge in the technical sense: knowledge that cannot be fully codified, only transmitted through apprenticeship and practice.

The vibe coding ecosystem, at its most aggressive, proposes to eliminate the junior layer entirely. Why hire a junior engineer to spend three years building up that intuition when the model can produce working code faster, today? The answer — which the furnace campaign illustrates with terrible clarity — is that the junior layer is not where you get production code. It is where you grow the future senior engineers who can verify, debug, and steer the systems that agentic labor produces. When you melt down the apprenticeship pathway, you do not immediately feel the loss. The pig iron looks fine. The code ships. But you have destroyed the mechanism by which the next generation of expert verifiers is produced, precisely when expert verification is becoming more valuable, not less.

Catalini et al. call the compounding dynamic the Codifier’s Curse: experts codify their own knowledge into training data, automating the entry-level work through which their successors would have developed the capacity to oversee and correct expert-level systems. The people who could certify the steel are training the machine to replace themselves, and no one is training their replacement.

The Harvest Rotting in the Fields

There is a third failure mode, more insidious than either. During the furnace campaign, the fixation on steel production metrics meant that agricultural neglect compounded silently over seasons. Each individual decision to divert labor to the furnaces was locally rational — cadres were rewarded for steel output, not crop yield. The aggregate of those locally rational decisions was catastrophic.

Technical debt from vibe-coded systems operates on the same logic. Each shortcut is individually defensible. The deadline was real. The feature was needed. The AI produced something that worked in the demo. But these shortcuts compound. Security vulnerabilities accumulate. Architectural decisions made for speed become load-bearing walls that cannot be safely moved. The codebase grows unmaintainable not because any single decision was catastrophically wrong, but because no one fully understands the system and the people who might have caught it earlier were never hired, or left, or never developed the expertise to catch it because the junior pipeline was already gone.

By the time the famine arrives — the data breach, the cascading production failure, the system that cannot be extended to meet new requirements — the debt is deeply embedded. Like the actual famine, by the time it is visible, the damage is already done.

Flight Simulators and the Certification Problem

This is where the backyard furnace analogy reaches its limit — and where the most useful thinking begins.

Mao’s planners had no way to build metallurgical expertise outside of actual steel production. Village kilns could not approximate the sustained temperatures, the controlled alloy composition, or the accumulated craft knowledge that quality steelmaking required. The ceiling was fixed by physics. You could not practice your way to a working blast furnace in a peasant courtyard, no matter how many people you mobilized or how long you ran the campaign. The skills required to produce good steel could only be built in environments capable of producing good steel, which was precisely what the backyard furnaces were not.

Software expertise is different in a way that changes everything. The conditions required to build expert engineering judgment — exposure to failure modes, edge cases, architectural stress, adversarial inputs — could be created artificially. The training environment does not have to be a production system. And AI could be a remarkably good tool for building exactly these environments.

Aviation solved its own version of the expertise problem through the flight simulator. A pilot can accumulate thousands of hours of experience with equipment failures, adverse weather, and emergency procedures without ever endangering a plane or its passengers. The simulator is not a compromise — it is in many respects a superior training environment, because it can expose pilots to out-of-distribution events that safe commercial flying almost never produces. You can practice the engine fire at 400 feet on final approach as many times as it takes.

Catalini and colleagues formalize a direct analogue: synthetic practice (Tsim in their model) — AI-generated training environments that expose engineers to the edge cases, failure modes, and architectural stress tests that production experience would eventually surface, but compressed and made deliberate. Rather than waiting years for a junior engineer to encounter a race condition in production, you build the race condition into the training environment. Rather than hoping the apprentice happens to work on a system that gets breached, you simulate the breach.

This matters enormously for the Missing Junior Loop. If the path from novice to expert verifier traditionally ran through years of production friction, and agentic systems are eliminating that friction, the question is whether synthetic practice can substitute. The answer, the paper argues, is substantially yes — provided it is treated as a genuine investment and not an afterthought.

The parallel opportunity is what Catalini et al. call accelerated talent discovery: as execution costs collapse, individuals can cycle through domains and problems at a pace that was previously impossible. The traditional decade-long apprenticeship assumed that expertise could only accumulate through sustained exposure to a single domain. When AI handles the execution, humans are free to range more widely, surfacing genuine aptitude faster and arriving at the verification layer through a compressed timeline rather than an eliminated one.

Certification as the Competitive Moat

Catalini and colleagues distinguish two economic endpoints: the Hollow Economy, where explosive nominal output masks decaying human agency and accumulating hidden debt, and the Augmented Economy, where verification capacity scales alongside agentic power.

The key variable separating them is what they call the verifiable share of deployment: the fraction of agentic output actually underwritten by human expertise capable of certifying that the output is what it appears to be. As the Measurability Gap widens, the verifiable share falls — not because engineers become less capable, but because the volume of agentic output outpaces the bandwidth of the human expert layer that could verify it.

The strategic implication is precise and, once stated, obvious: in the agentic economy, the competitive moat is not who can generate the most code. It is who can certify the most code.

As execution commoditizes toward the marginal cost of compute, rents migrate to what Catalini et al. call verification-grade ground truth — the audit trails, incident registries, outcome archives, and provenance logs that make agentic output insurable rather than merely plausible. The organizations that will extract durable value from AI-assisted development are not those that eliminate the human verification layer, but those that invest in it: what the paper calls the “sandwich topology” of human intent: machine execution, and human verification and underwriting. The engineers they retain are not there despite the AI. They are there to certify what the AI produces.

This reframes the calculus on engineering headcount entirely. The question is not whether to replace senior engineers with AI. It is whether you can afford to be in the business of deploying AI output you cannot certify — and what the liability looks like when the pig iron is embedded in your production systems, and the engineers who might have caught it left two years ago.

Mao’s cadres reported record steel production while the harvest rotted. Measured activity rose. Actual value collapsed.

The danger with vibe coding is not that it produces nothing. It is that it produces something that looks right to everyone except the people who would know why it is wrong. And the solution is not to stop running the furnaces. It is to invest, deliberately and ahead of time, in the people and infrastructure capable of telling the steel from the slag — and in the simulators that build those people faster than the furnaces burn through them.

The furnaces are lit. The flight simulators need to be, too.

Mason Reeves

Discussion about this post

Ready for more?