A History of AI

Bayesian Networks: Reasoning Under Uncertainty

2026-03-16T00:00:00+00:00

Belief Propagation

“Probability theory is nothing but common sense reduced to calculation.” — Pierre-Simon Laplace, 1814

“When you have uncertainty, and you always have uncertainty,” Judea Pearl said, “rules aren’t enough.” Pearl had arrived at AI by accident. Semiconductors wiped out his job at a memory research lab in 1969, he called a friend at UCLA, and took whatever position was available.

He spent years reading the field’s approaches to uncertain reasoning — fuzzy logic, belief functions, certainty factors — and became convinced it was avoiding something it already had the mathematics for. Bayes’ theorem had been available since 1763. It told you exactly how to update a belief when evidence arrived. Why was nobody using it?

The answer he arrived at was Bayesian networks: a graph where each node is a variable and each edge represents how one variable influences another. A disease node connects to symptom nodes, each connection carrying a probability table derived from data. When evidence arrives, it propagates through the graph and the mathematics tells you the most probable explanation.

The idea came from cognitive science. He had been reading a 1976 paper by David Rumelhart on how children read, which showed how neurons at multiple levels pass messages back and forth to resolve ambiguity, for example whether a smudged word is “FAR” or “CAR” or “FAT.” Pearl realised these messages had to be conditional probabilities. Bayesian reasoning was essential for passing them up and down the network and combining them correctly.

David Heckerman at Stanford built PATHFINDER using Pearl’s framework, a system for diagnosing lymph-node diseases across 60 conditions and 130 symptoms. Get the diagnosis wrong and a patient could miss life-saving treatment. Experienced specialists frequently disagreed with each other on exactly these calls. PATHFINDER matched the accuracy of the leading expert whose knowledge it had been built from.

But Bayesian networks only answered one kind of question: given what I observe, what is most probable? Pearl called this the first rung of a ladder, and spent the next decade climbing it.

Seeing. What is the probability of X given I observe Y? This is what Bayesian networks do, and what all statistics does. You watch a thousand smokers and note how many get cancer.
Doing. What would happen if I intervened and set X to a particular value? Observational data cannot answer this, however much of it you have. Watching smokers tells you nothing about whether making someone stop smoking will reduce their cancer risk, which is why Pearl formalised do-calculus to handle it.
Imagining. What would have happened if things had been different? The patient died. Would they have survived with a different drug? No dataset contains the answer to a counterfactual. Pearl developed the mathematics to reason about it anyway.

Pearl argued they were fundamentally different kinds of question, and that conflating them had caused enormous confusion in science, medicine, and economics for over a century.

Pearl received the Turing Award in 2011 for the first revolution. By then he was deep into the second, and later wrote that “fighting for the acceptance of Bayesian networks in AI was a picnic compared with the fight I had to wage for causal diagrams.”

Someone Said This Would Happen

2026-03-08T00:00:00+00:00

Mentions of 'expert systems' and 'artificial intelligence' in books, 1950–2022. Google Ngram Viewer.

Here is the opening from a session called “The Dark Ages of AI,” at the world’s leading AI conference.

“In spite of all the commercial hustle and bustle around AI these days, there’s a mood that I’m sure many of you are familiar with of deep unease among AI researchers who have been around more than the last four years or so. This unease is due to the worry that perhaps expectations about AI are too high, and that this will eventually result in disaster.”

The year was 1984, at the height of the expert systems boom, when companies were standing up AI groups staffed by people who had read one book and attended a one-day tutorial. Drew McDermott asked the room to imagine the worst case. In five years time, imagine if the big strategic bets went nowhere, the startups all failed, and everybody hurriedly changed the names of their research projects to something else.

Eerily, the future unfolded almost exactly like that, though it took three years, not five. Expert systems had a structural problem the boom years had papered over. They were only as good as the rules you could extract from experts, and experts struggled to articulate what they actually knew. Systems were brittle outside their narrow domains, and the more rules you added, the harder they became to maintain. XCON, the system that had saved DEC $25 million a year, now required 59 technical staff to maintain it and still couldn’t keep pace with the ever-changing product line.

When cheaper hardware arrived, the economics collapsed. Sun workstations undercut LISP machines on price and Symbolics went bankrupt. Government programmes wound down, funding dried up, and companies across the industry quietly shut their AI groups. The word itself became professionally toxic. Usama Fayyad, finishing his PhD in AI in 1991, later recalled that no company would hire anyone who worked in the field.

What looked like a collapse was closer to a dispersal. The researchers scattered into adjacent fields, took their ideas with them, and kept working under names that didn’t attract attention or scepticism. The foundations of everything we now call AI were laid during a period when nobody wanted to fund it. In 2002, Brooks noted: “There’s this stupid myth that AI has failed, but AI is all around you all the time.”

Expert Systems: When Knowledge Became the Product

2026-03-04T00:00:00+00:00

Expert systems console, Wikimedia Commons

“Knowledge is the only instrument of production that is not subject to diminishing returns.” — J.M. Clark, 1923

Your best salesperson just resigned. Twenty years of knowing which people matter, who to speak to when, and how decisions actually get made. None of that knowledge is in the CRM, so when they leave, it goes with them.

Every organisation carries knowledge it can’t locate. It lives in the people who’ve been there longest, embedded in the culture and inherited processes. It isn’t written down because no one can tell you exactly what it is.

In 1965, Edward Feigenbaum, a computer scientist at Stanford, thought intelligence required knowledge. Within a decade, leading researchers were publishing findings using a system he designed, without thinking twice about the AI behind it.

Early AI could follow rules and search, but it had little knowledge of the world it was operating in. It failed at everyday problems for exactly that reason. To test the idea, Feigenbaum sat down with Nobel Laureate Joshua Lederberg and started asking questions. How do you identify an unknown molecule from a mass spectrometry output? What do you look for? How do you rule things out? The sessions went on for months, with Feigenbaum diligently capturing how an expert thinks.

The program they built, DENDRAL, did something no AI had done before. DENDRAL knew one thing: chemistry. It analysed mass spectrometry data and determined molecular structures by capturing specific rules about bond energies, fragmentation patterns, and molecular stability, drawn from Lederberg’s decades at the bench. Once captured, the knowledge outlasted the conversation that produced it and was now available for others to use.

Knowledge was the engine, Feigenbaum concluded, and reasoning was just the vehicle. Over the next decade, systems were built that made the case, including:

MYCIN diagnosed bacterial blood infections and recommended antibiotics. It held the knowledge of infectious disease specialists, encoded as several hundred if-then rules weighted by confidence. When tested head-to-head against specialists at Stanford Medical School in a blinded evaluation, it matched the performance of the most experienced physicians.
PROSPECTOR encoded the expertise of economic geologists to evaluate mineral deposits. In 1980 it predicted that a site in Washington state contained a molybdenum deposit worth over a hundred million dollars. The geologists had surveyed the area and missed it.
R1, later renamed XCON, was built for Digital Equipment Corporation to configure customer computer orders. Every order was different, and getting the configuration right was a job only a handful of engineers could do. R1 encoded their expertise as production rules. By 1986 it was processing 80,000 orders a year with 95 to 98 percent accuracy, saving DEC an estimated $25 million annually.

The idea spread fast. If you could bottle a chemist’s expertise, you could bottle anyone’s. Entire hardware companies — Symbolics, Lisp Machines Incorporated, Texas Instruments — built specialised machines just to run expert systems faster. Universities created knowledge engineering programmes. Consulting firms built practices around eliciting expert knowledge. Companies invested billions in systems that were already working. Hundreds of expert systems were deployed across financial planning, oil drilling and military logistics.

Expert systems collapsed in the late 1980s and the field moved on. Feigenbaum’s intuition about knowledge went with it. Most organisations still carry more expertise than they have ever captured. That problem is still there.

Your best salesperson resigned. The problem is that you never captured what they knew.

AlexNet: When Deep Learning Became AI

2026-02-22T00:00:00+00:00

ImageNet test images

“That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time.” — Fei-Fei Li, 2024

In 2012, a graduate student trained a neural network in his bedroom on two gaming GPUs. It beat every major AI lab in the world.

The competition was the ImageNet Large Scale Visual Recognition Challenge. AlexNet, the network he built, won by an unprecedented 10.8 percentage points. No other result in the competition’s history came close. Every other team used hand-engineered features fed into traditional classifiers. AlexNet learned its own features from the data.

Alex Krizhevsky built it. He was a graduate student at the University of Toronto, working under Geoffrey Hinton. Ilya Sutskever, another of Hinton’s students, recognized that Krizhevsky’s GPU code could tackle ImageNet. Between the three of them, they beat every major lab in the world.

The breakthrough happened because three conditions aligned for the first time.

Data. By 2012, Fei-Fei Li’s ImageNet project had assembled over a million labelled images across a thousand categories. Data at this scale was needed to make training deep networks possible. Li had spent years building this dataset, hiring workers through Amazon Mechanical Turk to label millions of images by hand. Most of the field thought it was wasted effort.
Compute. GPUs were designed to render graphics fast enough for gaming. In 2007, NVIDIA made them programmable for general computation. Krizhevsky saw the opportunity and wrote custom code that mapped neural network training onto GPU architecture. Each card had only 3GB of memory, so he split the network across two GPUs and designed communication between them at specific layers. He made it work on consumer hardware that cost $500.
Algorithm. Backpropagation had existed since 1986, so why hadn’t anyone built this before? During training, error signals pass backward through the network layer by layer. With the activation functions used at the time, those signals shrank at every layer. By the time they reached the early layers, they had vanished. Deep networks were impractical. Krizhevsky fixed it by passing positive inputs through and zeroing out negative ones. The gradient no longer shrank at each layer and training ran about six times faster. He also used dropout, randomly switching off half the neurons during each training pass, which forced the network to learn robust patterns rather than rely on any single pathway.

Winning by 10.8 percentage points was impossible to ignore. Researchers who had spent careers designing handcrafted image descriptors were confronted with a network that learned better representations on its own. Yann LeCun, who had been working on convolutional networks since the 1980s, called AlexNet “an unequivocal turning point in the history of computer vision.” Within two years, every competitor at ImageNet used deep learning.

Hinton, LeCun, and Yoshua Bengio won the Turing Award in 2018. Hinton won the Nobel Prize in Physics in 2024 and later joked about the division of labour: “Ilya thought we should do it, Alex made it work, and I got the Nobel Prize.”

Deep learning started in a bedroom in Toronto, with two graphics cards and a graduate student who made it work. The field shifted fast, within two years most major labs had reorganised around it. When people said AI, they now meant deep learning.

SVMs: Practical Theory

2026-02-15T00:00:00+00:00

Original results from 1992 paper

“Nothing is more practical than a good theory.” — Vladimir Vapnik

Vladimir Vapnik arrived at Bell Labs from Moscow in the early 1990s already in his 50s. He brought three decades of statistical learning theory the Western world had never seen. From 1961 to 1990, he had worked on one question. Under what conditions can you guarantee a learning algorithm generalises from training data? Mathematics that the Cold War had kept invisible.

His timing was perfect. Machine learning systems were hitting production and failing spectacularly. Models that worked in the lab struggled in real-world applications. Models need more than just training accuracy to be useful. Companies needed to know whether their systems would hold up on unseen data. Vapnik had spent thirty years proving exactly when you could make that promise.

In 1995, Vapnik and Corinna Cortes published the paper that changed machine learning. The insight was to maximise the margin between classes when drawing decision boundaries. Wider margins meant better generalisation. The mathematics let you calculate performance bounds before deploying the model.

Guarantees matter. Deny someone a loan and you had to explain why. Gene sequencing data cost thousands of dollars per sample. The mathematics of SVMs guaranteed how well the system would generalise from the training data to new cases.

The kernel trick made it work. Vapnik initially resisted because the kernel trick came from his rivals in Moscow, but his colleagues tried it anyway, and it worked. Everyone thought you needed supercomputers to separate data in high dimensions. Text documents generated thousands of features. Gene expression arrays had tens of thousands of measurements. The kernel trick proved them wrong. Compute in low dimensions. Make decisions as if you were working in high dimensions. This was the breakthrough that made SVMs practical.

SVMs took over and by the early 2000s, they were being used to tackle countless real world problems. Banks used them for credit scoring. Pharmaceutical companies designed drug discovery pipelines around them. Email providers deployed them to filter spam. Performance guarantees you could calculate in advance were a remarkable innovation. SVMs provided decisions you could explain to regulators, were computationally efficient, and worked when data was scarce.

Vapnik created a profession. Before SVMs, you built systems and hoped they worked. After SVMs, you could calculate bounds on performance before building anything. Data science had arrived. You could hire mathematicians and statisticians and teach machine learning as a discipline.

Every data scientist now learns what Vapnik spent his career developing. SVMs still power fraud detection, spam filters, medical diagnostics. They stopped being AI the moment they became essential.

Case-Based Reasoning: Intelligence Needs Memory

2026-02-02T00:00:00+00:00

Image: Ted Eytan, CC BY-SA 2.0, via Wikimedia Commons

“Artificial intelligence must be based on real human intelligence, which consists largely of applying old situations—and our narratives of them—to new situations.” — Roger Schank

Intelligence requires memory. You cannot expect machines to learn without remembering what worked. Everyone building enterprise AI is grappling with the same problem. How do we give AI systems persistent memory without the bloat problem?

AI researchers have been working on this problem for decades. Building commercially successful systems that got better every time they were used. Have we forgotten these approaches?

Roger Schank’s core insight that started it all was that intelligence builds from lived experience. We learn from specific experiences, what happened, what worked, what failed. This was the 1970s and Schank was proposing something radical. Why encode general rules when you could learn from what actually happened?

Schank’s dynamic memory idea spread and the field of Case-Based Reasoning was born. In 1980, Janet Kolodner built CYRUS, a system that stored episodic memories of diplomatic meetings. She showed that specific experiences beat general rules.

The field formalized around what became known as the 4 Rs: Retrieve similar past cases. Reuse what worked before. Revise the solution for the new situation. Retain the experience for future use. Sound familiar? It should. Every RAG system follows this exact pattern. Take a help desk system for example. First, we Retrieve similar past support tickets from a vector database using semantic matching. Then we Reuse and Revise by adapting the previous fix through an LLM prompt. Finally, we Retain by embedding the case back into the vector database for next time.

But memory creates a problem. How do you prevent bloat? As the case base grows, retrieval slows. Worse, redundant cases add noise without adding capability. In 1995, Smyth and Keane published “Remembering To Forget” which solved the bloat problem. They devised a competence-preserving deletion process that keeps useful memories and forgets the rest. The problem described in that paper reads like it was written in 2025 not 30 years earlier.

Case-Based Reasoning worked commercially. GE invested in a customer support system that preserved institutional knowledge that would have walked out the door when expert mechanics retired. Lockheed used the techniques to successfully configure aircraft manufacturing layouts from past projects. These systems were good long term investments and got better with use because they remembered what worked.

In July 2024, a Case-Based Reasoning conference highlighted how existing methods outperformed LLMs on structured problems and how hybrid approaches could substantially reduce hallucinations. Decades old research suddenly became relevant again. Following the conference, Ian Watson, who has studied AI memory since the 1980s, issued a challenge. Case-Based Reasoning should step up and influence how modern AI handles memory.

Maybe the rest of us building with modern AI should also step up. Let’s Retrieve what has already been discovered then Reuse and Revise it, Retaining what works.

Frames: Representing Stereotyped Situations

2026-01-25T00:00:00+00:00

The Null Stern hotel

“When one encounters a new situation one selects from memory a structure called a Frame. This is a remembered framework to be adapted to fit reality by changing details as necessary.” - Marvin Minsky, 1974

Building enterprise AI means teaching an LLM the messy reality of your business. You need to explain standard contracts but also the edge cases. You need to describe typical customers and the exceptions that break the pattern. You need to capture default processes and how they vary by region. In 1974, Marvin Minsky wrote an essay about exactly this problem. His ideas shaped knowledge representation and drove the expert systems boom of the 1980s.

Minsky argued that minds work from templates. You walk into a room and before you consciously process what you see, you already expect walls, ceiling, floor, probably a door, maybe windows. When something violates those expectations, you notice immediately. A room with no ceiling feels wrong because you were working from a remembered framework.

Minsky proposed that this is how we handle new situations. You don’t build understanding from scratch but select a remembered framework and adapt it. His insight was that knowledge about stereotyped situations should be represented as structured templates he called frames. A frame has two kinds of knowledge: the top levels are fixed and represent things always true about the situation, while the lower levels have slots that must be filled by specific instances.

Consider how this works in practice. A sales deal frame captures what is always true: deals have customers, have values, and require approval. But it also has slots for details that vary: deal size, discount justification, escalation path, competitive context. Each slot can have default values, so deals assume standard pricing unless you specify otherwise. Slots can have constraints that define valid values. The deal size slot expects small, medium, large, or strategic.

Frames addressed fundamental limitations of semantic networks, the prevailing approach at the time. Semantic networks represented knowledge as graphs where nodes were concepts and edges were relationships. Simple and visual, but they had problems frames could solve:

Defaults and exceptions. Semantic networks couldn’t handle typical cases with exceptions. A bird can fly, Tweety is a bird, so Tweety can fly works until you encounter a penguin. Frames solved this through inheritance, where birds fly by default but penguins override this.
Context sensitivity. Semantic networks struggled with words that change meaning based on context. The meaning of “bank” depends on whether you are discussing rivers or money. Frames handled this by activating different templates depending on context.
Typical values. Semantic networks represented what was explicitly stated and nothing more. Frames provided slots with default values that could be filled or overridden. This allowed reasoning about incomplete information without requiring every detail to be specified.

Frames enabled the expert systems boom of the 1980s. MYCIN diagnosed infections using frames for diseases and symptoms. Software engineers arrived at similar insights with object-oriented programming. The pattern Minsky proposed for knowledge representation became the pattern for organizing all code.

If you are building enterprise AI today, you are likely making use of Minsky’s ideas. When you write a JSON schema with required fields and constraints, you are building a frame. If you are defining stereotyped situations in your business and specifying how to handle their exceptions, Minsky’s essay is well worth a read.

AI: Still Searching

2026-01-19T00:00:00+00:00

The Towers of Hanoi illustrated in La Nature

“Physical symbol systems are capable of intelligent action, and search is the essence of heuristic problem solving.” — Allen Newell & Herbert Simon, 1976

Problem solving involves considering different options, breaking things down, searching for a solution. Playing chess you explore options, consider possible moves, what your opponent may do in reaction, weighing up options along the way. It quickly becomes apparent that you can’t look at all options, so you focus attention and search intelligently. You rely on patterns and tactics picked up along the way.

Alongside knowledge representation, replicating intelligent search became the focus of early AI efforts. When we left Simon and Newell after Logic Theorist, they had recognized the core problem: you cannot search everything, so you must search intelligently. Inspired by George Polya’s work on problem solving, they built the General Problem Solver, demonstrating two important innovations.

The first was means-ends analysis, a framework for breaking problems down:

Identify the difference between where you are and where you want to be.
Select a move that reduces that difference.
If the move has preconditions you cannot yet meet, those become subgoals.
Repeat recursively until you reach something you can solve directly.

Breaking big problems into smaller ones seems obvious now, but Simon and Newell were the first to demonstrate it could work as an algorithm.

The second was the use of heuristics to guide the search. A heuristic is a rule of thumb, a shortcut that helps focus attention on promising directions without guaranteeing success. GPS measured how far the current state was from the goal and tried moves that reduced the biggest ones first.

GPS tackled well-defined problems like the Tower of Hanoi with ease. But not all problems are as easy to define, or gaps as easy to measure. The General Problem Solver couldn’t handle general problems.

Search became one of AI’s defining problems, alongside knowledge representation. McCarthy had asked how you represent what you know. GPS showed that even with perfect knowledge representation, you still need to search efficiently through possibilities. The two problems were intertwined. You cannot reason your way to answers without understanding what you are reasoning about. You cannot search infinite possibility spaces without guidance. Together, these challenges shaped AI research for decades.

The challenge of intelligent search remains. Out of the box, modern reasoning models still struggle with multi-step planning tasks over long horizons. Anyone successfully building with AI today is breaking problems into subtasks and routing them to specialised agents. Sound familiar?

The Question That Defined a Generation

2026-01-09T00:00:00+00:00

John McCarthy

“The central problem of artificial intelligence involves how to express the knowledge about the world that is necessary for intelligent behavior.” — John McCarthy

Arguably no one has had such a long lasting impact on AI as John McCarthy. He shaped the field in ways few others did and his contributions cast a long shadow for decades. He gave the field its name, posed central questions and introduced the programming language that made exploring them possible. His key insight was to consider how knowledge should be represented.

McCarthy was there from the very start. He was one of the organizers of the Dartmouth conference in the summer of 1956 that gave the field its name. Naming mattered. It declared that machine intelligence was a legitimate research program, not science fiction.

In 1958, McCarthy described the Advice Taker. You would tell this machine facts using formal logic, and it would reason about them. Ask how to get from your house to the airport, and it would figure out cars, roads, and what makes something drivable. The Advice Taker was never built, because the technology wasn’t ready. Computers in 1958 had tiny memories and no interactive interfaces.

But in describing what such a program would need, McCarthy posed questions he couldn’t solve. Three of these problems would go on to define the field:

The frame problem. What stays the same when you paint a room? The color changes but the temperature doesn’t. A machine needs to know which facts remain true after an action. Specifying everything that does not change is infinite. Assuming nothing changes creates paradoxes.
The symbol grounding problem. The symbol “car” means something to humans because we have seen cars, sat in them, driven them. To a computer, it is just a string of letters. How do you connect symbols to what they represent?
The knowledge representation problem. How do you capture what experts know in a form machines can use? Medical diagnosis requires knowing thousands of facts about diseases, symptoms, and treatments. How do you organize this so a program can reason with it?

McCarthy had not only named the field and posed questions. He built tools to explore them. In 1960, he created LISP, a language where code and data were the same thing. Lists could represent facts, rules, or even other programs. You could even write programs that reasoned about programs. This was radical - code that modified itself, that examined its own logic. LISP became the language of AI, with most major systems for the next thirty years written in it.

McCarthy’s questions proved more durable than any answers found in his lifetime. The frame problem, symbol grounding, and knowledge representation resisted solution for decades, spawning entire subfields and countless PhD theses.

The focus of AI is very different now. Large language models seemed to sidestep these questions by learning patterns from billions of words. But have they? Can a system that cannot update a single fact without retraining truly represent knowledge? Can symbols be grounded without sensory experience? Can statistical patterns spot what doesn’t happen from observing only what does?

McCarthy gave AI its name in 1956. Seventy years later, the field is still wrestling with the problems he posed in 1958.

Neural Networks: The Wilderness Years

2026-01-05T00:00:00+00:00

Image: Canadian Institute for Advanced Research / Associated Press

“Give me another six months and I’ll prove to you that it works.” — Geoffrey Hinton

As we saw in the Perceptron post, by the end of 1960s, funding for neural network research dried up and most researchers moved on to other promising approaches. Most, but not all. A small group of researchers believed.

Geoffrey Hinton was one of them. His PhD supervisor at the University of Edinburgh urged him weekly to abandon neural networks. Hinton refused, insisting he just needed more time.

In the early 1980s, John Hopfield at Caltech published work on networks that could store and recall patterns. Show the network part of a pattern and it could reconstruct the whole. Hopfield’s work reignited interest in what neural networks might achieve.

In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a paper in Nature describing a new algorithm. The Backpropagation algorithm solved the problem that had defeated Rosenblatt and showed that you could train networks with multiple layers.

As the name suggests, Backpropagation works backward. During training, the network measures output error. Then, working backward layer by layer, the algorithm calculates how much each connection contributed to that error. It adjusts each weight by a small amount to reduce the error. Repeat this across thousands of examples and, if the weights settle down, the network learns. Backpropagation lets networks learn representations at multiple levels of abstraction. Simple features in early layers combine into complex patterns in later layers.

Some notable applications emerged. In 1987, Terry Sejnowski and Charles Rosenberg built NETtalk, a neural network that learned to pronounce English text. They trained it on examples and it started like a babbling baby, gradually improving until it could read text aloud. The system learned pronunciation rules without anyone programming them explicitly. This was a demonstration of what backpropagation could do.

In 1989, Yann LeCun at AT&T Bell Labs trained a neural network to recognise handwritten digits. The US Postal Service provided 9,298 scanned postcodes from mail sorting offices. LeCun used 7,291 to train the network and 2,007 to test it. The system achieved a 1% error rate on the digits it classified, with a 9% rejection rate on ambiguous cases. This solved expensive operational problems. Postal services deployed it for sorting mail. Banks used similar systems to read cheques, processing millions per day.

But problems emerged:

Training was slow. Using the computers available at the time, networks needed days or weeks to learn from thousands of examples.
Networks with many layers were hard to train. The error signal weakened as it passed backward through the layers, so the early layers barely learned.
Training was unstable. Small changes to initial settings could cause training to fail completely. Networks required careful tuning to work at all.
Networks were data hungry. They needed thousands of examples to learn effectively. In the late 1980s and early 1990s, large labeled datasets were rare. So researchers could only work within narrow domains.

By the early 1990s, a new approach emerged. It was easier to use, worked better on small datasets, and solved problems that neural networks struggled with. More on the success of SVMs in a later post.

The connectionist community that had survived now faced another exodus. Funding and researchers moved on. Working on neural networks in the 1990s became career suicide all over again.

But the faithful kept going. They had seen backpropagation work and built systems that learned things no one had programmed. They just needed the right conditions.

Those conditions were coming. They just didn’t know it yet.

A History of AI

Bayesian Networks: Reasoning Under Uncertainty

Further Reading

Someone Said This Would Happen

Further Reading

Expert Systems: When Knowledge Became the Product

Further Reading

AlexNet: When Deep Learning Became AI

Further Reading

SVMs: Practical Theory

Further Reading

Case-Based Reasoning: Intelligence Needs Memory

Further Reading

Frames: Representing Stereotyped Situations

Further Reading

AI: Still Searching

Further Reading

The Question That Defined a Generation

Further Reading

Neural Networks: The Wilderness Years

Further Reading