Chapter 10: Deep Learning
Neural networks, originally conceived in the 1940s, have grown from a modest idea about mimicking brain functions into the powerhouse behind modern AI. Today neural networks are everywhere — they help AI recognize faces, recommend music, drive autonomous cars and even diagnose diseases.
In this chapter we start with basic neural networks, and then introduce some advanced structures. We will also explore two of the most famous use cases of modern neural networks.
Single layer Neural Networks
Like with every algorithm we have seen so far, the objective of neural networks is to estimate the response Y. They take an input vector X of p variables and build some non-linear function f(X) to estimate Y. What makes them different is their particular structure.
A single layer neural network. (Source : ISL Python)
The arrows indicate that each of the inputs from the input layer feeds into each of the K (5 in this case) hidden units in the next layer. Finally each of the inputs from the hidden layer feeds into the output layer. The neural network model has the following form
We notice that the output is just a linear combinations of the activations in the previous layer. Peeling back one layer, the activations themselves are just some non-linear transformation of the inputs. If g was linear, we would end up with layered linear regression!
In a neural network, each layer’s outputs serve as inputs for the following layer, creating a sequential flow that continues until the final layer is reached. This structure sets it apart from other algorithms we’ve encountered. Working backwards from the formula for f(X) was similar to peeling back the layers of an onion!
The machine learning problem then is to ‘learn’ the right weights and biases β_i that give the closest possible approximation to response Y. This is done by minimizing some cost function ; RSS for quantitative responses & misclassification rate for qualitative responses.
Now, what exactly is g? In early days, the sigmoid function (yes, from logistic regression) was favored, but in modern neural networks, the rectified liner unit (ReLU) is favored.
Activation functions. (Source : ISL Python)
Multilayer Neural Networks
While we saw an easy example to begin with, most modern neural networks have more than one hidden layer, with each layer having many units. While in theory one hidden layer with many units can be used to approximate any function, the optimization task is made easier with multiple moderate sized layers.
Let’s take the classic example of neural networks, hand written digit recognition. In this case, we take as input an image and classify it as one of 10 digits. We use this problem to explain a multi layer neural network. Note : This is a very brief introduction to the problem, we will revisit this in much greater detail at a later stage.
A multilayer neural network. (Source : ISL Python)
The idea behind the layers remains the same, however it’s important to note that the activations in the second layer are non-linear transformations of the activations in the preceding layer. Again, we can formulate f_i(X)
Notice again that the output layer Z_m is a linear combination of the activations in the previous layer, the only difference is that we have now performed 2 non-linear transformations! We simply assign an input xto the output unit with maximum probability. This process is illustrated below
Image recognition process. (Source : 3b1b)
What we saw above was the most basic (not easy) neural network possible. To make more specialized neural networks, we can change the hidden layers or the architecture itself. We see examples of both below.
Convolutional Neural Networks (CNN)
One special type of neural networks are convolutional neural networks. These gained popularity in 2010 and have been immensely successful in the task of image classification. They way they classify images is similar to how humans perform the task, in that they identify low level features and then combine them in order to form some high level features. These high level features are then used to predict an output class.
A CNN classifying a tiger. (Source : ISL Python)
In the above example, we see low level features like stripes, pupils, parts of the eye, .. so on. These are then combined to form high level features like a mouth. Based on all these high level features, the neural network predicts the image to be that of a tiger.
There are two types of hidden layers in a CNN : (i) convolution layer (2) pooling layer. The convolution layer acts as a filter which acts on small chunks of the image. It rewards sections that are similar to the filter, and penalizes ones that are not. A pooling layer is a way of condensing a large image into a smaller summary image. There are may ways to achieve this, but we will use the max pooling method. Both of these layers are formulated below
Horizontal and vertical stripes filter applied to a tiger. (Source : ISL Python)
A max pool layer on a 4x4 matrix
For input we supply an image represented as a matrix of it’s constituent pixels. We apply many of these convolution and pooling layers on Red, Green and Blue channels of the image. Since the size of the image is reduced after each pool layer, we add more convolution filters in order to compensate.
We repeat this stacking until the pooling layer has reduced each image into a few pixels. At this point we flatten the 3 dimensional feature maps into individual pixels and feed it to a connected layer before the final output layer. An example of this architecture is shown below
An example CNN. (Source : ISL Python)
There are many tuning parameters to be selected in constructing such a
network, apart from the number, nature, and sizes of each layer. Many pre-trained CNNs exist that work very well for different types of images, these can be incorporated and modified to our use case as opposed to starting from scratch. One very popular example being the resnet-50 classifier.
Recurrent Neural Networks (RNN)
Many data sources are sequential in nature and we need to account for their relative positions while building predictive models. Some examples are listed below
- Documents such as book and movie reviews, newspaper articles, and
tweets. The sequence and relative positions of words in a document
capture the narrative, theme and tone, and can be exploited in tasks
such as topic classification, sentiment analysis, and language translation. - Time series of temperature, rainfall, wind speed, air quality, and so
on. We may want to forecast the weather several days ahead, or climate several decades ahead. - Recorded speech, musical recordings, and other sound recordings. We
may want to give a text transcription of a speech, or perhaps a language translation. We may want to assess the quality of a piece of
music, or assign certain attributes.
In a recurrent neural network the input object X is a sequence. For textual data, X ={X1, X2, …, Xj} where each element is a word. The order of the words, closeness of some words in a sentence convey semantic meaning. RNNs are designed to accommodate and take advantage of this sequential nature, similar to how CNNs are designed to work well with images.
The structure of an RNN. (Source : ISL Python)
In the above image, each Xi is a word. The sequence is processed one word Xi at a time, the network updates the activations Ai in the hidden layer, taking as input the word Xi and the activation vector Ai−1 from the previous
step in the sequence. Each activation Ai feeds into the final layer and produces output Oi, but the most important of these is the last output.
Here W = input layer weights U = hidden-to-hidden layer weights and B = output layer weights. Note : the same weights W, U and B are used as we process each element in the sequence, i.e. they are not functions of i. This
is a form of weight sharing used by RNNs.
The final output of an RNN is shaped by a combination of previous activations and inputs. As we proceed from beginning to end, the activations accumulate a history of what has been seen before, so that the learned context can be used for prediction.
RNN forecast of the trading volume of stocks. Orange = prediction, Black = observation. (Source : ISL Python)
When to use deep learning
Given that neural networks work so well with image recognition, language translation & document modelling, there’s a question that arises : should we discard all our older tools, and use deep learning on every problem with data?
To answer this question, let’s revisit the problem of predicting baseball player’s salaries using performance data. We will apply simple linear regression, lasso regression & a (highly tuned) neural network to this problem. The results are listed below
Model comparison. (Source : ISL Python)
We see that the simple model works better than the neural network that took lots of time & tinkering to parametrize. In such a case, we stick to Occam’s razor principle : when faced with several methods that give roughly the same performance, pick the simplest.
There are however a few cases where deep learning is a good choice :
- Large number of training examples
- Interpretability of model is not a priority
- High signal to noise ratio i.e. less noisy data
I have, uncharacteristically, not explained the actual learning algorithm this time around. This is because I don’t understand it well enough to explain it at this stage. I will do my due diligence once I construct a neural network from scratch. So stay tuned!
With that I wrap up my study of this great textbook (for now). It’s been a very good guide to help build my fundamentals in machine learning & statistics, and I will no doubt come back to this resource multiple times.
Thank you for reading along. I hope you learnt something from my posts & have some interest in exploring the world of machine learning further 🤖
Leave a comment