As a general rule of thumb for learning, you will only understand something in detail once you are able to build it on your own. This is particularly true for technical subjects like machine learning and neural networks. During your first encounter with deep learning, you can take a few large steps by learning and mastering a high level framework such as TensorFlow or PyTorch. However, if you want to go really deep, you need to go back to square one: you need to build your own neural network and optimizer from scratch! This is a common rite of passage for deep learning engineers and researchers. Embracing the fine details of these algorithms will give you a very valuable advantage in the field.
When I undertook this journey recently, succeeding was a level up moment for me. In this post, we will walk through this path together, eventually arriving to a fully functional neural net. If you are familiar with the subject on a high level and know what gradient descent is, you are good to go! (If you are not familiar with gradient descent, check out this post!)
An important note, however. Instead of reading this post right away, I encourage you to bookmark this post, then try to build a neural network completely on your own. In my life as a developer and scientist, very few moments carried the same feeling of accomplishment and intellectual satisfaction as watching my first neural network learn.
The code for this post can be found in my GitHub repository https://github.com/cosmic-cortex/neural-networks-from-scratch. (Here, I go a little bit beyond this post: additional layers such as batch normalization and convolutional layers are implemented.)
What is a neural network anyway?
On a high level, a neural network is simply a function, mapping an input (such as an image) to a prediction (a probability distribution on the possible labels for example). Fundamentally, there are two simple operations you want to do with a function: calculate the output and the derivative, given an input. The first one is needed to obtain predictions, the latter one is to train your network with gradient descent. In neural network terminology, calculating the output is called the forward pass, while the gradient with respect to the input is called a local gradient.
This is very simple to model Python.
local_grad()method, we will return a gradient dictionary instead of a NumPy array of gradients. The reason for this will became apparent when we are implementing layers for our neural network. Since layers have tunable parameters, these parameters will have their gradients on their own. So, it is useful to treat the gradients with respect to variable groups distinct.
To give a simple example, let’s consider the famous Sigmoid function!
Note that this function will work with NumPy array inputs as well. Currently, you can think of this as a function of a single variable. Later on however, we will treat multivariate functions as a function of single variable with a tensor input. This layer of abstraction will be essential to write compact building blocks for neural networks.
Layering the fun(ctions)
In essence, a neural network is simply the repeated application of functions, where each function can be thought of as a layer. Calculating the forward pass is simple for composed functions. However, function composition complicates things if we want to calculate the gradient, which is essential for our purposes. To see what happens, let’s consider a simple example! Suppose that the N(x) function represents our neural network, defined by composition of a linear function with Sigmoid.
For simplicity, we don’t have any weights yet. To calculate its derivative, we can use the chain rule:
To summarize, we need three things to calculate this.
- The derivative function of Sigmoid.
- The derivative function of f.
- The output of f at x.
Let’s ponder on this from a practical and computational viewpoint. We could always compute the derivative of N(x) by hand and hard-code it into our program:
However, this won’t be a good idea. First, we want more complicated functions than this, potentially composed of hundreds of layers. Calculating the derivative for the composition of hundreds of functions is practically impossible. Second, we want our code to be modular, so instead of defining a single function for our network, we want to define it as a composition of functions.
Because of these, we are better off with the chain rule. However, if you notice, evaluating the derivatives themselves also rely on calculating the value of f(x), so we cannot get away with simply calculating the derivative functions. To make sure everything is right, we can implement the following algorithm, building upon our
Function. (Note the prime ‘ after a function, which denotes its derivative. It is hard to see sometimes, so I emphasize it in advance.)
- Take x. Pass it to the
Functionobject representing f, calculate f(x) AND the local derivative f’(x). Store both results in a cache. Return f(x).
- Take f(x). Pass it to the object representing Sigmoid, calculate Sigmoid(f(x)) AND the local derivative Sigmoid’(f(x)). Store both results in a cache. Return Sigmoid(f(x)).
- Take Sigmoid’(f(x)).and return Sigmoid’(f(x)). (Yes I know, this is not the most complicated step ever, but to be algorithmically correct, this must be done.)
- Pass it to the object representing f, retrieve the local derivative f’(x) from the cache, finally multiply it with Sigmoid’(f(x)). Return the product.
If you take another look, you can see that steps 1. and 2. are forward pass + local gradient calculation + caching for f and Sigmoid chained together. Steps 3. and 4. are something new: they are called the backward pass for Sigmoid and f. The return value of the backward pass is the gradient. Note that this is not just simply the local gradient of the specific layer, it is the global gradient of the layer and all layers after it. In technical literature, it is often not explicitly distinguished from the local gradient (because they are both gradients), but I like to emphasize the difference by using the global and local prefixes.
To reflect these methods in our
Function, we add the caching and backward methods to it. It is important to note that when the object is called (i.e. the
Function.__call__ method is used), the local gradient is automatically cached.
With this new model, we can define our Sigmoid function as the following.
This is now not only composable, but we also have a way to calculate the gradient of composed functions. This algorithm we are using is called automatic differentiation. This is the core of modern deep learning frameworks. Without it, gradient descent would be computationally infeasible.
Note that cacheing plays a very important role. For instance, we save the input during the forward pass for later, to save time during the gradient calculations. Consider this: if the particular fully connected linear layer is the 109th layer in a neural net, its input is the output of the composition of the previous 108 functions. We really don’t want to recalculate this every time, since this would incur a huge computational cost.
Introducing tunable parameters
In the previous examples, we have only seen functions which are static in a sense, like the Sigmoid. To fit a model to the training data, we need a set of parameters to tune. A parametrized differentiable function is what we call a layer. In theory, the tunable parameters (called weights) can be thought as another set of inputs. In practice however, it is easier to handle these parameters separately, since
- we won’t specify their values directly, but rather store a set of current values,
- we need methods to modify these after each backward pass to perform gradient descent.
To do that, we create the
Layer class by extending
Function with weight initializing and weight updating methods:
For instance, a simple fully connected linear layer representing the f(X) = Wx + b functions with W and b as parameters can be implemented as the following.
You may observe that although this function maps from the
in_dim dimensional space to the
out_dim dimensional space, but the forward pass takes one variable. That is because our code is vectorized: instead of operating on distinct variables for each dimension, we use linear algebraic operations to compress our code. Vectorized code also executes faster in general, so this is a really indispensable trick in the machine learning arsenal.
In addition to this, our linear layer also takes inputs in batches. That is, if you have n data points in the d-dimensional space, you pass an n x d matrix instead of calling the function n times. Figuring out the correct implementations for forward and backward passes using batches may take some mental gymnastics, but this is something I also recommend to working out in detail by yourself for at least once. Of course if you are stuck, you can always consult the code above!
Among the building blocks of a neural network, the loss function plays a somewhat special role. Instead of only taking the output of the previous layer as input, a loss function also takes the ground truth, outputting a single number which reflects the function’s ability to fit the training data. The smaller the loss, the better the fit. In any case, they are simply differentiable functions with no tunable parameters.
In practice, this means two important differences.
- The forward pass requires two arguments: the output of the final layer and the ground truth.
- Since it is the last layer, the global gradient is equal to the local gradient.
For example, a frequently used loss function is the cross-entropy loss, which can be implemented as below.
Composing the layers
Up until now, we have only created the building blocks of a neural network, not the actual network itself. To be able to use such a DIY deep learning framework comfortably, we would like to have an interface like this.
We have done most of the work, so
Net is only a simple container, composing the layers and the loss together. This object just loops over the list of layers, passing the output of each one to the next one.
Testing our framework
Now that we have built the necessary tools, it is time to put them into work! We will use a simple 2-layer fully connected neural net with ReLU activations to classify points in the two-dimensional plane.
The full example can be found here, which you can run locally in your own computer and experiment with it, which is highly encouraged. I suggest setting a random seed (to keep the weight initialization deterministic) and playing around with the learning rate parameter for example.
Going beyond fully connected layers
Of course, this simple architecture is just the tip of the iceberg. There are much more complicated architectures available, for instance convolutional networks for computer vision, recurrent networks for sequence learning, etc.
In the accompanying code, you can find the implementations of convolutional and pooling layers, along with a 2d batch normalization layer, which you can use for studying and experimenting with the certain components.
During my quest, the famous CS231n: Convolutional Neural Networks for Visual Recognition course from Stanford helped me a lot, so if you decide to take on this journey and implement a convolutional network from scratch, I can recommend this to you.
Have fun and keep on learning!