The Turing Point
Posts
The WatchTower: 19th Edition

The WatchTower: 19th Edition

Welcome to the captivating world of Artificial Intelligence!

Jonas Macken, Stephen Elliott, Sijia(Lucy) Lu, David Hung & Ziming Gong
July 01, 2024

Welcome to the 19th edition of the WatchTower! In this edition, we take a look at AI pioneer Rich Sutton’s “Bitter Lesson” learned from the history of AI and begin a series of articles exploring the role of probability in modern machine learning systems.

📰 Featured in This Edition:

The Bitter Lesson - Computation over Cognition
Probability in Machine Learning - Part 1

The Bitter Lesson - Computation over Cognition

Image Credit

In 2019, Rich Sutton, widely regarded as one of the founding figures in reinforcement learning, published an article titled “The Bitter Lesson”, outlining what he believes is the biggest lesson from the last seventy years of AI research. That is, the most effective AI systems have been ones that leverage and scale with high computational power, rather than ones that rely explicitly on encoding human knowledge into machines. To illustrate what he means, lets consider an example of the latter type, known as expert systems.

Expert systems, developed from the 1960s and popularised in the 1980s, were AI systems designed to emulate the decision-making abilities of human experts by encoding vast amounts of domain-specific knowledge into a set of rules. For example, an expert system designed for medical diagnosis might follow these rules:

If the patient has a fever and cough, then diagnose a flu.
If the patient has a fever, cough, and sore throat, then diagnose a flu or other respiratory infection.

A system like this can only address exact scenarios that match its predefined rules - it can’t learn or adapt to new cases it hasn’t seen before. You might think that we could just add more rules and facts to the expert system and it would continue to improve, but in reality, doing this makes the system more inconsistent, inflexible, and harder to maintain. Formalising and encoding all of human knowledge into a logic-based system has proven to be an impossible task.

Instead, methods that utilise computational power to learn and search for solutions themselves without restriction have consistently proven more successful and have come to dominate the field. To understand how this happened, let’s take a look at a phenomenon known as Moore’s Law.

Moore’s Law

Moore’s Law, first articulated by Gordon Moore in 1965, is the observation that the density of transistors (tiny electronic switches that turn a current on or off) on integrated circuits tends to double approximately every two years, resulting in a doubling of available computational power and a significant reduction in cost. For AI systems, this means that capabilities for processing datasets and training and operating complex models can roughly double every two years.

Moore’s Law

Although originally an empirical observation, Moore’s Law has continuously driven the pace of innovation in the semiconductor industry and has now held true for several decades. While continuing to shrink transistors and increase their density is becoming increasingly difficult, there remain many ways to continue improving computational capabilities. Among these include advances in parallel computation, quantum computing, and algorithmic efficiencies. So, in short, we can expect more increases in available compute for our AI systems to benefit from.

Let’s now look at some areas in AI where Moore’s Law and computational approaches have outpaced and superseded knowledge-based methods.

Moore’s Law in AI

Computer Vision

Early approaches in computer vision relied on identifying manually designed features believed to be important for recognising objects in images. Since humans may notice features such as edges, shapes, and specific patterns, early AI systems were taught to look out for these too. However, the number of features humans can define is finite, so this approach was fundamentally restrictive. But thanks to Moore’s Law, modern deep learning systems have improved significantly at vision tasks while largely disregarding the traditional approach. Today’s models, consisting of millions or even billions of parameters, autonomously learn from big data sets, discovering countless abstract features that haven’t been explicitly defined and that humans may not even understand.

Traditional Approach vs Deep Learning

Natural Language Processing (NLP)

Similar to computer vision, the field of NLP previously consisted largely of rule-based systems and handcrafted linguistic features. As we saw earlier, expert systems would rely solely on these rules to respond to textual inputs. The field has since transformed and is now dominated by statistics and computation. Unlike expert systems, today’s large language models (LLMs) generate nuanced and original responses by autonomously learning patterns and context from massive datasets and leveraging enormous computational power.

Game AI (Chess and Go)

In the early days of computer chess, programs relied on encoded rules and heuristic strategies to mimic human-like thinking. These approaches were unable to handle the game’s complexity and had limited success. The program that first defeated world champion Garry Kasparov in 1997 instead used the most advanced computing power available at the time, employing brute-force search techniques to look deeply into possible future moves.

Similarly in the game of Go, it wasn’t until 2016 that DeepMind’s AlphaGo, utilising advanced search algorithms and self-play learning - a computationally expensive process where the system repeatedly plays against itself to refine its strategy - was able to defeat a world champion.

Today, AI systems for both games greatly exceed human abilities thanks to the combined power of compute and reinforcement learning.

Looking Ahead

The lesson here is a vital one, and one Sutton believes AI as a field has not yet understood. Going forward, Sutton urges us to focus on building general learning systems that scale in capability with computational power, rather than static ones restricted to a finite set of built-in human knowledge.

The capabilities of the human brain are incredibly complex, and any individual would struggle to articulate even a tiny fraction of their interactions with the environment in each moment. So, rather than trying to find simple ways to describe and encode our knowledge of things like space, objects, language, and agents, we should instead build into our systems only the meta-methods that enable them to find and capture these complexities on their own.

_{Published by Jonas Macken, July 01 2024}

Probability in Machine Learning - Part 1

Image Credit: Bing AI

Written by Stephen Elliott

This article is the beginning of a series discussing how probability powers powerful artificial intelligence technologies. This week we discuss frequency modelling and chance. Next week, we will move to an exploration of how these concepts are applied in neural networks. We will finish by investigating how misunderstanding probability can derail and corrupt our artificial intelligence, from both engineering and fairness perspectives.

It is often the case that humans lack the senses to allow us to completely understand another organism's choices. Perhaps we know that some animal generally digs burrows uphill of a water source. We may create something resembling the organism's behaviour, but to reproduce the exact behaviour of that particular organism, like for like, in some simulation, is usually out of reach. We lack the understanding of its starting conditions and the factors affecting it over time. We do not have a good understanding of that animal's internal state; we do not know why it acts the way it does.

Many human processes are similarly difficult to explain in definite terms. We have all had the experience of changing our order at the last second. Probably also of wanting to change our order, but the waiter is gone. What separates our successfully changing our order from our being so unable is a thin and dubious line. Thin, for it is a matter of seconds, a matter of distraction on the part of the waiter. Dubious, for it is likely we will never reproduce this line if we exclusively intend reproduce the waiter and I, for there is so much else happening. There are patrons entering and leaving, the maître d’ is calling out, and the waiter has a hangover. All of these things will affect the outcome of our process.

For a system of machines in an enclosed box, or a factory, we should be able to ascribe a cause and effect to pretty much everything that occurs. This is how classical mechanics builds its world view. Strangely, so does much of economic theory. But for many prediction tasks we must turn away from deterministic models that say that A always leads to B, always leads to C, and towards a more flexible representation of reality.

Instead of modelling a faithful, definite reproduction of reality, let us consider the possibilities at play in a situation. Take the two possible outcomes of the restaurant problem. Either we change our order, or we do not. If we can accurately say how frequently we will change our order, and how frequently we will be unable, then we have modelled some important part of the system. Observe the restaurant 500 times. Some fraction of the tables change their order, perhaps 40 in 500. Then 8% of the tables change their order. We have a frequency representation of the system. In the future, we would expect that 8% of tables in any group will change their order, under the same circumstances by which we made the measurements. Though circumstances may differ from table to table, we have considered many tables, and so these little differences should not sway our measured frequency too much. Then we can expect that our representation will provide us roughly accurate expectations of the future. So we can rely on our model to tell us the chance of changing our order at any given visit in the future. The restaurant might also use the model to better understand their service offering.

In the animal burrowing example, we can similarly observe a large number of burrows, say 150, perhaps recording the distance from a body of water. Then we shall have burrows appearing about 10 metres away from the river, with some frequency; about 20 metres away, with some frequency; and so on. It is easy to imagine that these frequency models, with their implied chance of future events and summary information about a system as a whole, could be used to represent many other interesting systems. Weather patterns, sporting performance, and illness rates in a community needing vaccines might all benefit.

The frequency approach to modelling seems doomed in one aspect, though. It must be very imprecise, not integrating much information particular to the situation on the day. It is particularly inflexible. With fewer staff on, for example, we should expect that our waiter will be darting off as soon as the orders are taken. The chance that we may change our order is surely lower on such a day. The probability of all outcomes in our scenario are changed significantly. Is the original model not useless?

Indeed, the system’s behaviour changes under this new condition of lower staff number. Then we may create two new models to cover the new conditions. We have one “conditional model” for full staffing, and one for understaffing. We also have the original “unconditional” frequency model, which represents the restaurant in general, under whatever staffing conditions.

If the restaurant is trying to make decisions based on our model, then they should surely prefer the conditional model over the unconditional model. The conditional probabilities will be a more accurate representation of the system, because they are specific to the restaurant under a much narrower, more precise set of conditions.

This simple concept of probability and conditioning sets the stage for understanding neural networks, which power the artificial intelligences recently taking the world by storm, which have recently found unprecedented success in several impressive image, video and language tasks. In next week's newsletter we will explore how probability underpins the learning, storage and retrieval of information in neural networks, and therefore gain a thorough grasp of how modern artificial intelligences work at a deep foundational level.

_{Published by Stephen Elliott, July 01 2024}

Closing Notes

We welcome any feedback / suggestions for future editions here or email us at [email protected].

Stay curious,

Sources

The Bitter Lesson - Computation over Cognition
- The Bitter Lesson
- Exponential Progress of AI: Moore’s Law, Bitter Lesson, and the Future of Computation

The WatchTower: 19th Edition

Welcome to the captivating world of Artificial Intelligence!

📰 Featured in This Edition:

The Bitter Lesson - Computation over Cognition

Moore’s Law

Moore’s Law in AI

Looking Ahead

Probability in Machine Learning - Part 1

Sponsors

Closing Notes

Sources