The Four Pillars of Modern AI, Born in One Lab
1991 was a breakout year for deep learning. Over five months, Jürgen Schmidhuber's group at the Technical University of Munich published four papers that define today's Large Language Models (LLMs). The techniques they introduced—Transformers, unsupervised pre-training, neural distillation, and deep residual learning—are the foundation of ChatGPT, DeepSeek, and every major AI system.
March 26, 1991: The First Transformer
Schmidhuber's paper "Learning to Control Fast-Weight Memories" (Technical Report FKI-147-91) introduced a neural network with linearized self-attention. This is the unnormalized linear Transformer (ULTRA), a direct predecessor of the normalized quadratic Transformer used in ChatGPT. ULTRA's computational cost scales linearly with input size, not quadratically—making it more efficient for long sequences. The paper describes a slow net that learns by gradient descent to compute weight changes of a fast net, using outer products (Eq. 5) to program attention. This architecture is now called the "Transformer with linearized self-attention."
April 30, 1991: Pre-Training and Distillation
On the same day, Schmidhuber published two reports: one on unsupervised pre-training for deep neural networks (the "P" in ChatGPT), and one on neural network distillation. Pre-training enabled very deep learning by initializing layers one by one without labels. Distillation—central to DeepSeek's 2025 "Sputnik" breakthrough—compresses a large teacher model into a smaller student model. Both techniques are now standard.
June 15, 1991: Deep Residual Learning
Sepp Hochreiter's diploma thesis (supervised by Schmidhuber) introduced residual connections for very deep neural networks. This became the core of Long Short-Term Memory (LSTM), the most cited AI paper of the 20th century. A decade later, Schmidhuber's Highway Net—a variant with gated residual connections—was 10 times deeper than previous feedforward networks. Residual learning is now used in virtually all LLMs.
August 31, 1991: Generative Adversarial Networks
The first peer-reviewed publication on generative and adversarial networks (GANs) appeared at the International Conference on Simulation of Adaptive Behavior. Schmidhuber described a generator network fighting a predictor network in a minimax game, trained through artificial curiosity. This work, building on a 1990 technical report, is the precursor to today's generative AI for deepfakes and creative tools.
The Bigger Picture
Schmidhuber's 1991 work was part of a broader vision. He argues that LLMs alone cannot achieve AGI without mastering the real world. His lab also pioneered planning with adaptive world models, artificial scientists (since 1990), meta-learning (since 1987), and recursive self-improvement. Munich was also home to Ernst Dickmanns's team, which built the first self-driving cars on highways in the 1980s—reaching 175 km/h.
Why This Matters Now
As of January 2026, the two most cited papers of all time (within three years) are directly based on this 1991 work. Schmidhuber notes that compute was "millions of times more expensive" back then, making these achievements even more remarkable. The article also includes a sobering economic note: combined GDP of Germany and Japan versus the US and China dropped from 1:1 in 1995 to 1:5 in 2025, and Schmidhuber suggests self-replicating AI robots as a potential comeback.
What Developers Should Know
- The unnormalized linear Transformer (ULTRA) from 1991 is still relevant for its linear scaling. If you're working on long-document models, its efficiency is worth revisiting.
- Pre-training and distillation are not new ideas—they've been around for 35 years. The hype around distillation in 2025 (DeepSeek) is a rediscovery.
- Residual connections, now universal, originated in this lab. Understanding their history helps debug vanishing gradient issues.
- GANs were born from curiosity-driven learning, not just image generation. The original paper ties them to world models and planning.





