The Turing Point
Posts
The WatchTower: 26th Edition

The WatchTower: 26th Edition

AISociety Education
September 16, 2024

Welcome to the 26th edition of the WatchTower! In this edition, we have a fantastic lineup of events, including our upcoming games night and industry nights, our new reading group, and an article on how neural networks really work.

📰 Featured in This Edition

Event: AI Society Game Night (this Thursday)
Event: AI Industry Night (Week 9)
Event: AI Reading Group (Tuesdays and Thursdays, odd weeks)
Article: How Neural Networks Really Work

🗓 Upcoming Events

AI Society Game Night

🎮 READY FOR A GAME NIGHT? 🎮

We’re bringing you a night full of fun, games, and a little friendly competition! 👾🔥 Whether you're a seasoned gamer or just looking to relax and unwind, we've got you covered!

📅 Date: Thursday, 19th of September

🕒 Time: 6pm - 9pm

📍 Location: Quad G025

AI Industry Night - Expression of Interest

The AI Society is holding an AI Industry Night in Week 9 of this term, where you’ll be able to meet industry professionals and learn about opportunities in AI!

📅 Date: Week 9, Term 3 (Date TBD)

🕒 Time: TBD

📍 Location: TBD

If you’re interested in attending, please fill out the following form to help us plan the event!

AI Industry Night - Expression of Interest

AI Reading Group

The AI Society runs a fortnightly reading group open to anyone interested in AI, from those considering research in the field to those just curious about the latest developments.

We meet fortnightly (Weeks 1, 3, 5, 7, and 9) on campus to discuss influential papers and books related to AI. Follow the link below to sign up!

AI Reading Group Registration Form

How Neural Networks Really Work.

By Stephen Elliott

Image: LifeWire

Neural networks are powerful function approximation models. They allow us to find a best guess at the forces generating some data in a reproducible format. In many modern learning machines, neural networks are a sub-component of a larger mathematical information processing machine. Neural layers are critically important in many modern machine learning applications. They contain many trainable parameters, giving the engineer nuanced influence over information mixing and processing. Developments in training algorithms, architects and computer hardware mean neural networks can now be used to build convergent approximations (very accurate models) to extremely complex distributions like language.

A dense neural network is one where each node (or neuron) in one layer is connected to every node in the next. Consider a dense network with one hidden layer, described in the image below. It is, “hidden,” because its outputs are not directly interpreted by the mechanism used to train the model. The training algorithm only has knowledge of the final output of intermediate layers. The representations formed by intermediate layers are akin to building blocks, which are combined together in intricate ways by the model’s mathematics, finally constructing the model output.

Image: Scott Robinson, StackAbuse

Though the training algorithm does not directly see this intermediate machinery, it is of great concern to the architect of the model. The structure of intermediate layers determine what information is learnable by the model. However, for this article we will consider simply the case of the simple dense neural network for a deeper analysis of the mechanisms at play in these function approximators.

Consider a functional representation of such a dense network. Observe features of some event, like values in a financial time series, or pixel values in an image are put into a linear function. Such a function takes the form

H_i= w₀ + w_1iX + w_2iX + …

This linear transformation of inputs is then passed through an activation function, producing an output vector y. Such an activation function may take any form. To exploit modern hardware capabilities in the training process, it is best to use smooth activation functions or piecewise linear functions, both of which are effective for backpropagation with low computational complexity.

The output of the hidden layer is passed through an output function, which transforms the hidden layers’ output into a useful form. Softmax, for example, is used to represent the possible output classes as a discrete probability distribution. When we wish to regress (approximate by sampling) the input function, we might simply use the identity function on the output, which is equivalent to using the raw outputs of the hidden layer to generate an estimate in the original unit.

Neural networks take basic linear models, which use straight lines to separate data, and add complexity by introducing curves and multiple layers, making them better at handling real-world data. For the technically-minded, neural networks are a non-linear generalisation of linear models. This is a powerful generalisation and corresponds to an interesting spatial abstraction in the underlying mathematics. In the intuitive case of a binary (logistic) linear regression, we can represent the model’s predicted probability that the inputs come from one of the two categories by a plane in n-dimensional space, where n is the number of inputs. With two input features, this is a plane in two-dimensional space. An equivalent special case of a neural network is a dense network with a single hidden layer with linear activation functions and softmax at the output.

Each node computes a linear transformation of the inputs, creating a line for each of n hidden nodes. A set of lines is non-linearly transformed and summed. The result is a piecewise linear approximation to a smooth surface. This model is no different to a (logistic) linear regression.

Figure: Willemson (2008) – the paper is unrelated.

Using the binary case for its 3D visualisation, we have a discontinuous curved surface in 3D space. The decision boundary will be like a rolling countryside, with hills, valleys, saddles and pits. Each linear segment of the decision surface represents one combination of linear transformations of the inputs. It does not overlap or cut itself – the machine is never indifferent between two outcomes.

Increasing the number of nodes in a hidden layer is equivalent to increasing the approximation resolution of the underlying distribution of the classes. The discontinuities in the surface get closer together and ultimately disappear as the number of linear transformations approximating the surface becomes infinite. Notice that sampling resolution does not increase strictly quadratically, but rather is polynomial on layer width. Consider the effect of adding a node to the second layer of a 1x1 network, compared to the effect of adding a node to the second layer of a 100x100 network.

Hence, even dense networks with only a single layer with non-linear activations can create arbitrarily precise approximations to any input data. This result is known as the Universal Approximation Theorem. Increasing the number of nodes in the layer splits each line/plane/hyperplane in the decision space into smaller pieces. We shrink each flat region to get a smoother surface. Equivalently, we are increasing the precision of the model’s approximation of the input data. With an infinitely high resolution, the surface is infinitely smooth. Though it is possible to create a model of arbitrary precision this way, it is not feasible (too expensive and slow) on complex, real-world data, due to the expense of fitting such a model to highly conditional data.

Using an elegant geometric abstraction, we can also alter the structure of the space in which the decision boundary lives. Adding layers to a neural network creates nested non-linear functions in the prediction/decision space. The (unintuitive) result is high-dimensional decision surfaces whose spatial structure morphs as the value of inputs changes. Imagine a kitchen where the distance and angle between the stove and the dishwasher changes depending on how dim the lights are. Non-linear transformations allow the prediction/decision surface to completely change its shape given only a small change in a single input value, for example. In our shallow (single layer) dense network, the value of the decision surface might change, but we would not observe a change from a flat surface to a rising and falling surface with the change of an input value.

We see that adding depth has dramatically altered the network’s approximation mechanism of approximation, and hence improved its ability to approximate complex functions. It is intuitive upon considering again the nested functional representation of the neural network. The first layer is linearly transformed and then non-linearly, “activated.” This is akin to putting an input, or some part of the processing machinery, on a slider. Each slider is dialled up or down by a weighted function of the inputs from the previous layer. Intermediate layers may compute, “flag,” values which increase or decrease values of subsequent layers.

Non-linear transformation bends relationships in data, where linear transformation can only amplify or angle them. Non-linearity in activation functions gives the model the capability to turn part of its machinery off when it sees certain inputs. The decision space learned by a neural network is therefore a high-dimensional nonlinear subspace of the input space – a space of linear subspaces, each a shrunk and bent version of its input space.

Though a neural network’s decision surface can be made arbitrarily precise and covers an infinite space, we are rarely able to model any given process perfectly. Practical problems such as noisy data, computational expense (AWS bills and human lifespans), and imperfections in training/fitting methods prevent it. Consider also some theoretical complications: the network will produce output for inputs that it never observes, and produces estimates for unrealistic or impossible input values. These problems suggest the data quality, fitting/learning efficiency, and out-of-sample generalisation problems that are central challenges in neural architecture design.

Understanding how neural networks transform data at different layers helps us see why they are so effective for complex tasks like image recognition or language modelling. In later articles, we will analyse fitting/training methods and the effect of different neural architectures on the types of data our models can approximate. We will continue exploring the relationship between theoretical and practical concepts in machine learning. Thank you for reading.

_{Published by Lucy Lu, September 09 2024.}