1 0
Read Time:4 Minute, 16 Second

We are living through a massive experiment in information dynamics, and I’m starting to worry about the results.

Initially, Large Language Models (LLMs) were trained on the “wild” internet—a chaotic, messy, and deeply human repository of text. It was a library written by people. But today, the internet is fundamentally changing. More and more of the content we consume (and that future models will train on) is heavily influenced, if not completely generated, by AI.

This creates a closed loop—a snake eating its own tail. In computer science, this phenomenon has a name: Model Collapse. And if we view this through the lens of system dynamics, it looks less like a “glitch” and more like a classical system failure waiting to happen.

Here is why I’m worried, and what the research says we have to do to stop it.

The Curse of Recursion

The core of the problem is something researchers call “The Curse of Recursion”.

Generative models are designed to approximate the “average” of a probability distribution. They are great at the middle of the bell curve—the likely, the standard, the safe. They are terrible at the “tails”—the rare, the weird, the creative, and the idiosyncratic bits of human expression.

When a new model trains on data generated by an old model, it isn’t seeing the original, rich human variance. It’s seeing a smoothed-out “shadow” of that data. Over generations, the “tails” are chopped off entirely. The model becomes confident, fluent, and completely vacuous. It stops hallucinating wild ideas and starts hallucinating plausible-sounding nonsense, converging on a “beige” mean where every sentence looks perfect but says nothing.

A System Without Brakes

My biggest concern stems from a simple principle of systems thinking: stability requires Balancing Loops.

A “Reinforcing Loop” amplifies change (think compound interest or a viral outbreak). Right now, AI generation is a massive reinforcing loop. It is cheaper and faster to generate synthetic text than human text, so synthetic text is flooding the web, which lowers the cost of generating more “web-like” content.

In the pre-AI world, we had natural balancing loops:

  1. Cognitive Cost: Writing nonsense took time, so humans mostly didn’t do it.

  2. Reality Check: Humans wrote about the physical world, constantly grounding language in reality.

Those loops are broken. We now have a system dominated by positive feedback loops with no natural restoring force. In classical system dynamics, a system with only reinforcing loops eventually overshoots and collapses. 

The Thermodynamics of “Surprise”

So, is the internet doomed to become a wasteland of synthetic sludge? Not necessarily. But saving it requires us to engineer artificial balancing loops.

The most fascinating concept I’ve come across in the research is Surplexity—a combination of “Surprise” and “Perplexity”.

For an AI to learn, it needs to be surprised. If a model reads a sentence and can perfectly predict every word, it learns nothing. It has zero “information gain.” Synthetic data is low-surprise; it confirms what the model already knows. To prevent collapse, we need to filter for High Surplexity—content that zags when the model expects it to zig.

This is where the value of humanity returns. The weird, the erratic, the “long tail” of human thought is high-entropy. It is surprising.

The “Pre-2023” Gold Standard

The research suggests a few concrete ways we mitigate this collapse, and they all point to one conclusion: Human data is becoming a non-renewable resource.

  1. The Anchor Slice: Developers are now realizing they must preserve “Gold Standard” datasets (like the Pile or Common Crawl snapshots from before 2023) to ground future models. We have to constantly remind the AI what “real” looked like before the synthetic flood.

  2. Accumulate, Don’t Replace: A key finding in recent papers is that if you replace real data with synthetic data, the model collapses fast. But if you accumulate data—mixing the new synthetic stuff with the old real stuff—the system stabilizes. We can’t throw away the past.

  3. System 2 Thinking: We are moving toward models that don’t just predict the next word (System 1) but actually “think” and critique their own output before showing it to us (System 2). This internal “Governor” acts as a new balancing loop, catching hallucinations before they pollute the training pool.

The Bottom Line

I’m worried about LLM collapse not because I think AI will stop working, but because I worry about the “Beige-ing” of the web. If we aren’t careful, the digital world will become a hall of mirrors, reflecting our own averages back at us until we forget what original thought looks like.

But there is a silver lining. As synthetic content becomes free and infinite, verified human authenticity becomes the scarcest and most valuable asset on the planet. The “balancing loop” we are looking for might just be us.

Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %