The WatchTower: 22nd Edition

Welcome to the 22nd edition of the WatchTower! In this edition, we’ll learn about the basic types of machine learning algorithms, including their purposes and applications. We’ll also provide a hands-on guide to building your own MBTI personality prediction machine learning model, similar to those used in the algorithms of apps like TikTok and Instagram.

📰 Featured in This Edition:

  • Foundational Machine Learning Algorithms

  • How to Build Your Own TikTok Algorithm

Foundational Machine Learning Algorithms

Image Credit: MIT Technology Review

We’ve already discussed how machine learning extends software engineering by learning algorithms for problems which are too complicated for us to understand. We’re now going to dial in on algorithms, exploring what they are, how they work, and what sort of algorithms we use to learn solutions to complicated problems. The first part of this article is an introduction to algorithms. Computer science students might want to skip to the second bold line, where our discussion of machine learning algorithms begins.

Computation is made up of two components: data, and instructions. The data is a “window” into some world of interest. Perhaps a set of observations of the prices of some financial assets. The algorithm is the set of instructions which takes data, does some work on it, and puts out some different data. Algorithms work at all levels of computers. An algorithm keeps an eye on unused memory and processing power. When the user launches a new program on the computer, an algorithm is responsible for allocating system resources to the program.

In fact, algorithms are the only way to make a computer do anything. Computers have been built from scratch to communicate with humans through sets of instructions. There is no random behaviour in computers. Everything a computer does is the result of an instruction given to it by a human. In that sense, computers are completely foundational: they are a product largely of the human mind. A computer is not meaningfully subject to any natural laws, but follows many human ones.

Computer programs consist of many algorithms, working together to turn data into something useful. In the case of a video player, for example, the video data is stored as a file. An algorithm runs to turn the ones and zeroes in the file into light values for the thousands of pixels on the screen, so the file can be displayed as intended. The data in the video file is a window into the world that the filmmaker saw. The algorithm turns that data into a human-interpretable stream of light.

When we construct algorithms, we want to make sure they are fast, correct, and resilient. At the lowest level, a computer performs only a few basic instructions. It can add, subtract, and move numbers around. It turns out that moving numbers around in memory also allows the computer to do multiplication with a few basic instructions (see Booth’s multiplication algorithm), rather than repeated multiplication, which would involve many additions for large numbers.

Although repeated addition is a viable method for multiplying one number by a small second number, like one million by two, the task of multiplying one million by one million by repeated addition would take far, far longer. This suggests that designing fast algorithms is really about designing algorithms which do not get much slower when the data gets bigger. For an algorithm to be useful in any circumstance, we wish to minimise how quickly the algorithm slows down. That is, if we put in one million and one million, our algorithm shouldn’t take much longer than when we put in two and two. We will use massive computational power (and hence, expensive electricity) to do operations with larger numbers. It is much better to design an algorithm which doesn’t get much more expensive as the size of its task grows.

We also wish to write algorithms that are correct. We want our algorithms to give the right answer every time, without fail, so that we can make assumptions about the inputs to other algorithms. If we know our algorithm will always be correct, then we can build other algorithms on top of it, without worrying about what would happen if the first algorithm is wrong. An algorithm can be proved to be correct using mathematics. However, these ways of proving correctness generally boil down to showing that the algorithm cannot be wrong, because it makes all the right decisions along the way. Now that we have an understanding of what an algorithm is, and why we need fast and correct algorithms, let’s discuss machine learning algorithms.

Fundamental machine learning algorithms work to distill useful information from very big, very complicated data. The task of recognising a friend’s face in a train station is far too complicated for us to manually construct a set of instructions to identify the face. So, we turn to machine learning algorithms, which can construct the algorithm for us. At the most fundamental level, there are four species of machine learning algorithms. Each species has a different purpose, works on different data, and is used differently in practical applications. Today, we will discuss linear regression, dimensionality reduction, density estimation, and classification. These species are largely conceptual. In later articles, we will explore some basic implementations (ways of coding) these algorithm types, such as generalised least squares parameter estimation, principal component analysis, kernel density estimation, and neural networks. For a deeper understanding of these algorithm species, I point you to the excellent textbook Mathematics of Machine Learning.

Linear regression algorithms aim to find a function which accurately represents the data. An accurate function is one which summarises the relationship between different parts of the data coming in. For example, we might want to predict a person’s height based on their age. Our line of best fit is unlikely t obe very accurate, since age alone does not tell us much about height. But there are many interesting mathematical properties of such a line which allow us to draw conclusions about different items of the data. We might be able to use the line to tell whether a particular person is unusually tall for their age, for example, and how confident we can be about that conclusion, given the variation in height across the whole dataset. Linear regression is heavily used in data analysis because it is easily interpretable and well supported in software. It is also a mainstay of traditional frequentist statistical inference, which is taught in undergraduate courses in economics, science, and engineering. Linear regressions are also a key component of neural networks, which sit at the foundation of recent advances in generative AI.

Dimensionality reduction algorithms extract the most important features of input data through clever mathematics. It is generally true that any complicated data contains a lot of duplicated information. Image data, for example, contains many redundant pixels. We can distinguish a human face with the colour; probably also with most of the face removed, and only an outline left. Dimensionality reduction is a computational method of condensing information in this way. Dimensionality reduction is often used to crush data down to its bare essentials before feeding it to another machine learning algorithm. That means the second algorithm deals with less data, and so is faster, and thus cheaper. Dimensionality reduction may also make models more accurate by destroying information which is not important, and therefore making the second model “focus” on the important aspects of the data. We will discuss this more in our article on generalisation. Dimensionality reduction is the key principle behind autoencoders, which power ChatGPT and many other large language models.

Density estimation algorithms use data to create a probability distribution representing the data. New data can be compared to the probability distribution to determine whether the original and new data are similar. This comparison process is useful for identifying outliers in data and for creating new, “synthetic” data with the same characteristics as the original. This is one of the more abstract species, so we will limit our discussion for today.

Classification algorithms take data and identify which of several user-defined classes the data fits into. Classification is the discrete sibling of regression. Where regression predicts (continuous) height based on age, classification might be used to predict (the discrete category) eye colour based on geographic location. Classification algorithms may also be responsible for picking whether a photograph contains a dog or a muffin, for example. Classification algorithms must take data and turn it into a much smaller outcome, similar to dimensionality reduction algorithms. However, while dimensionality reduction algorithms aim to summarise the key elements of the data, classification algorithms aim to decide which of several classes a piece of data fits into. These models are often used in computer vision application, reporting what objects appear in a video feed for other algorithms to respond to.

We have discussed what algorithms are, what makes a good algorithm, and the four fundamental types of machine learning algorithms. From the second half of our discussion, it starts to become clear what it means for a machine to “learn” from data. We will expand our discussion to implementations of these algorithms in future weeks, along with discussions on the nature of data and models, human methods of data analysis, and applications of machine learning in the real world. Later, we will delve into technical analysis of the machine learning algorithms which run the world.

Published by Stephen Elliott, July 22 2024.

How to Build Your Own TikTok Algorithm

Following on from last week; ever wondered how you could make your own TikTok algo? Well, fear not; we've put together a beginner friendly (and somewhat goofy) step-by-step guide to build your own MBTI detector!

Since the guide includes code, we’ve hosted it on Google Colab for easy access and execution. Find the full guide here!

Published by Zac, July 22 2024.

Sponsors

Our ambitious projects would not be possible without the support of our GOLD sponsor, UNOVA.

Closing Notes

We welcome any feedback / suggestions for future editions here or email us at [email protected].

Stay curious,