As a general rule of thumb for learning, you will only understand something in detail once you are able to build it on your own. This is particularly true for technical subjects like machine learning and neural networks. During your first encounter with deep learning, you can take a few large steps by learning and mastering a high level framework such as TensorFlow or PyTorch. However, if you want to go really deep, you need to go back to square one: you need to build your own neural network and optimizer from scratch! This is a common rite of passage for deep learning engineers and researchers. Embracing the fine details of these algorithms will give you a very valuable advantage in the field.

When I undertook this journey recently, succeeding was a *level up moment* for me. In this post, we will walk through this path together, eventually arriving to a fully functional neural net. If you are familiar with the subject on a high level and know what gradient descent is, you are good to go! (If you are not familiar with gradient descent, check out this post!)

**An important note, however. **Instead of reading this post right away, I encourage you to bookmark this post, then try to build a neural network completely on your own. In my life as a developer and scientist, very few moments carried the same feeling of accomplishment and intellectual satisfaction as watching my first neural network learn.

The code for this post can be found in my GitHub repository https://github.com/cosmic-cortex/neural-networks-from-scratch. (Here, I go a little bit beyond this post: additional layers such as batch normalization and convolutional layers are implemented.)

## What is a neural network anyway?

On a high level, a neural network is simply a function, mapping an input (such as an image) to a prediction (a probability distribution on the possible labels for example). Fundamentally, there are two simple operations you want to do with a function: calculate the output and the derivative, given an input. The first one is needed to obtain predictions, the latter one is to train your network with gradient descent. In neural network terminology, calculating the output is called the *forward pass*, while the gradient with respect to the input is called a *local gradient*.

This is very simple to model Python.

In the `local_grad()`

method, we will return a gradient dictionary instead of a NumPy array of gradients. The reason for this will became apparent when we are implementing layers for our neural network. Since layers have tunable parameters, these parameters will have their gradients on their own. So, it is useful to treat the gradients with respect to variable groups distinct.

To give a simple example, let’s consider the famous Sigmoid function!

class Sigmoid(Function): | |

def forward(self, X): | |

return 1/(1 + np.exp(–X)) | |

def local_grad(self, X): | |

s = 1/(1 + np.exp(–X)) | |

grads = {'X': s * (1 – s)} | |

return grads |

Note that this function will work with NumPy array inputs as well. Currently, you can think of this as a function of a single variable. Later on however, we will treat multivariate functions as a function of single variable with a tensor input. This layer of abstraction will be essential to write compact building blocks for neural networks.

## Layering the fun(ctions)

In essence, a neural network is simply the repeated application of functions, where each function can be thought of as a layer. Calculating the forward pass is simple for composed functions. However, function composition complicates things if we want to calculate the gradient, which is essential for our purposes. To see what happens, let’s consider a simple example! Suppose that the *N(x)* function represents our neural network, defined by composition of a linear function with Sigmoid.

For simplicity, we don’t have any weights yet. To calculate its derivative, we can use the chain rule:

To summarize, we need three things to calculate this.

- The derivative function of Sigmoid.
- The derivative function of
*f*. - The output of
*f*at*x*.

Let’s ponder on this from a practical and computational viewpoint. We could always compute the derivative of *N(x)* by hand and hard-code it into our program:

However, this won’t be a good idea. First, we want more complicated functions than this, potentially composed of hundreds of layers. Calculating the derivative for the composition of hundreds of functions is practically impossible. Second, we want our code to be modular, so instead of defining a single function for our network, we want to define it as a composition of functions.

Because of these, we are better off with the chain rule. However, if you notice, evaluating the derivatives themselves also rely on calculating the value of *f(x)*, so we cannot get away with simply calculating the derivative functions. To make sure everything is right, we can implement the following algorithm, building upon our `Function`

. (Note the prime ‘ after a function, which denotes its derivative. It is hard to see sometimes, so I emphasize it in advance.)

- Take
*x*. Pass it to the`Function`

object representing*f*, calculate*f(x)*AND the local derivative*f’(x)*. Store both results in a cache. Return*f(x)*. - Take
*f(x).*Pass it to the object representing*Sigmoid*, calculate*Sigmoid(f(x))*AND the local derivative*Sigmoid’(f(x)).*Store both results in a cache. Return*Sigmoid(f(x))*. - Take
*Sigmoid’(f(x))*.and return*Sigmoid’(f(x))*. (Yes I know, this is not the most complicated step ever, but to be algorithmically correct, this must be done.) - Pass it to the object representing
*f*, retrieve the local derivative*f’(x)*from the cache, finally multiply it with*Sigmoid’(f(x))*. Return the product.

If you take another look, you can see that steps 1. and 2. are *forward pass + local gradient calculation + caching *for *f* and *Sigmoid* chained together. Steps 3. and 4. are something new: they are called the *backward pass* for *Sigmoid* and *f*. The return value of the backward pass is the gradient. Note that this is not just simply the local gradient of the specific layer, it is the global gradient of the layer and all layers after it. In technical literature, it is often not explicitly distinguished from the local gradient (because they are both gradients), but I like to emphasize the difference by using the *global *and *local *prefixes.

To reflect these methods in our `Function`

, we add the caching and backward methods to it. It is important to note that when the object is called (i.e. the `Function.__call__ `

method is used), the local gradient is automatically cached.

class Function: | |

""" | |

Abstract model of a differentiable function. | |

""" | |

def __init__(self, *args, **kwargs): | |

# initializing cache for intermediate results | |

# helps with gradient calculation in some cases | |

self.cache = {} | |

# cache for gradients | |

self.grad = {} | |

def __call__(self, *args, **kwargs): | |

# calculating output | |

output = self.forward(*args, **kwargs) | |

# calculating and caching local gradients | |

self.grad = self.local_grad(*args, **kwargs) | |

return output | |

def forward(self, *args, **kwargs): | |

""" | |

Forward pass of the function. Calculates the output value and the | |

gradient at the input as well. | |

""" | |

pass | |

def backward(self, *args, **kwargs): | |

""" | |

Backward pass. Computes the global gradient at the input value | |

after forward pass. | |

""" | |

pass | |

def local_grad(self, *args, **kwargs): | |

""" | |

Calculates the local gradients of the function at the given input. | |

Returns: | |

grad: dictionary of local gradients. | |

""" | |

pass |

With this new model, we can define our Sigmoid function as the following.

class Sigmoid(Function): | |

def forward(self, X): | |

return 1/(1 + np.exp(–X)) | |

def backward(self, dY): | |

return dY * self.grad['X'] | |

def local_grad(self, X): | |

s = 1/(1 + np.exp(–X)) | |

grads = {'X': s * (1 – s)} | |

return grads |

This is now not only composable, but we also have a way to calculate the gradient of composed functions. This algorithm we are using is called *automatic differentiation*. This is the core of modern deep learning frameworks. Without it, gradient descent would be computationally infeasible.

Note that cacheing plays a very important role. For instance, we save the input during the forward pass for later, to save time during the gradient calculations. Consider this: if the particular fully connected linear layer is the 109th layer in a neural net, its input is the output of the composition of the previous 108 functions. We really don’t want to recalculate this every time, since this would incur a huge computational cost.

## Introducing tunable parameters

In the previous examples, we have only seen functions which are static in a sense, like the Sigmoid. To fit a model to the training data, we need a set of parameters to tune. A parametrized differentiable function is what we call a *layer*. In theory, the tunable parameters (called *weights*) can be thought as another set of inputs. In practice however, it is easier to handle these parameters separately, since

- we won’t specify their values directly, but rather store a set of current values,
- we need methods to modify these after each backward pass to perform gradient descent.

To do that, we create the `Layer`

class by extending `Function`

with weight initializing and weight updating methods:

class Layer(Function): | |

""" | |

Abstract model of a neural network layer. In addition to Function, a Layer | |

also has weights and gradients with respect to the weights. | |

""" | |

def __init__(self, *args, **kwargs): | |

super().__init__(*args, **kwargs) | |

self.weight = {} | |

self.weight_update = {} | |

def _init_weights(self, *args, **kwargs): | |

""" | |

Initializes the weights. | |

""" | |

pass | |

def _update_weights(self, lr): | |

""" | |

Updates the weights using the corresponding _global_ gradients computed during | |

backpropagation. | |

Args: | |

lr: float. Learning rate. | |

""" | |

for weight_key, weight in self.weight.items(): | |

self.weight[weight_key] = self.weight[weight_key] – lr * self.weight_update[weight_key] |

For instance, a simple fully connected linear layer representing the *f(X) = Wx + b* functions with *W* and *b* as parameters can be implemented as the following.

class Linear(Layer): | |

def __init__(self, in_dim, out_dim): | |

super().__init__() | |

self._init_weights(in_dim, out_dim) | |

def _init_weights(self, in_dim, out_dim): | |

scale = 1 / sqrt(in_dim) | |

self.weight['W'] = scale * np.random.randn(in_dim, out_dim) | |

self.weight['b'] = scale * np.random.randn(1, out_dim) | |

def forward(self, X): | |

""" | |

Forward pass for the Linear layer. | |

Args: | |

X: numpy.ndarray of shape (n_batch, in_dim) containing | |

the input value. | |

Returns: | |

Y: numpy.ndarray of shape of shape (n_batch, out_dim) containing | |

the output value. | |

""" | |

output = np.dot(X, self.weight['W']) + self.weight['b'] | |

# caching variables for backprop | |

self.cache['X'] = X | |

self.cache['output'] = output | |

return output | |

def backward(self, dY): | |

""" | |

Backward pass for the Linear layer. | |

Args: | |

dY: numpy.ndarray of shape (n_batch, n_out). Global gradient | |

backpropagated from the next layer. | |

Returns: | |

dX: numpy.ndarray of shape (n_batch, n_out). Global gradient | |

of the Linear layer. | |

""" | |

# calculating the global gradient, to be propagated backwards | |

dX = dY.dot(self.grad['X'].T) | |

# calculating the global gradient wrt to weights | |

X = self.cache['X'] | |

dW = self.grad['W'].T.dot(dY) | |

db = np.sum(dY, axis=0, keepdims=True) | |

# caching the global gradients | |

self.weight_update = {'W': dW, 'b': db} | |

return dX |

You may observe that although this function maps from the `in_dim `

dimensional space to the `out_dim`

dimensional space, but the forward pass takes one variable. That is because our code is *vectorized*: instead of operating on distinct variables for each dimension, we use linear algebraic operations to compress our code. Vectorized code also executes faster in general, so this is a really indispensable trick in the machine learning arsenal.

In addition to this, our linear layer also takes inputs in batches. That is, if you have *n* data points in the *d-*dimensional space, you pass an *n x d* matrix instead of calling the function *n* times. Figuring out the correct implementations for forward and backward passes using batches may take some mental gymnastics, but this is something I also recommend to working out in detail by yourself for at least once. Of course if you are stuck, you can always consult the code above!

## Loss functions

Among the building blocks of a neural network, the loss function plays a somewhat special role. Instead of only taking the output of the previous layer as input, a loss function also takes the ground truth, outputting a single number which reflects the function’s ability to fit the training data. The smaller the loss, the better the fit. In any case, they are simply differentiable functions with no tunable parameters.

In practice, this means two important differences.

- The forward pass requires two arguments: the output of the final layer and the ground truth.
- Since it is the last layer, the global gradient is equal to the local gradient.

class Loss(Function): | |

def forward(self, X, Y): | |

""" | |

Computes the loss of x with respect to y. | |

Args: | |

X: numpy.ndarray of shape (n_batch, n_dim). | |

Y: numpy.ndarray of shape (n_batch, n_dim). | |

Returns: | |

loss: numpy.float. | |

""" | |

pass | |

def backward(self): | |

""" | |

Backward pass for the loss function. Since it should be the final layer | |

of an architecture, no input is needed for the backward pass. | |

Returns: | |

gradX: numpy.ndarray of shape (n_batch, n_dim). Local gradient of the loss. | |

""" | |

return self.grad['X'] | |

def local_grad(self, X, Y): | |

""" | |

Local gradient with respect to X at (X, Y). | |

Args: | |

X: numpy.ndarray of shape (n_batch, n_dim). | |

Y: numpy.ndarray of shape (n_batch, n_dim). | |

Returns: | |

gradX: numpy.ndarray of shape (n_batch, n_dim). | |

""" | |

pass |

For example, a frequently used loss function is the *cross-entropy loss*, which can be implemented as below.

class CrossEntropyLoss(Loss): | |

def forward(self, X, y): | |

""" | |

Computes the cross entropy loss of x with respect to y. | |

Args: | |

X: numpy.ndarray of shape (n_batch, n_dim). | |

y: numpy.ndarray of shape (n_batch, 1). Should contain class labels | |

for each data point in x. | |

Returns: | |

crossentropy_loss: numpy.float. Cross entropy loss of x with respect to y. | |

""" | |

# calculating crossentropy | |

exp_x = np.exp(X) | |

probs = exp_x/np.sum(exp_x, axis=1, keepdims=True) | |

log_probs = –np.log([probs[i, y[i]] for i in range(len(probs))]) | |

crossentropy_loss = np.mean(log_probs) | |

# caching for backprop | |

self.cache['probs'] = probs | |

self.cache['y'] = y | |

return crossentropy_loss | |

def local_grad(self, X, Y): | |

probs = self.cache['probs'] | |

ones = np.zeros_like(probs) | |

for row_idx, col_idx in enumerate(Y): | |

ones[row_idx, col_idx] = 1.0 | |

grads = {'X': (probs – ones)/float(len(X))} | |

return grads |

## Composing the layers

Up until now, we have only created the building blocks of a neural network, not the actual network itself. To be able to use such a DIY deep learning framework comfortably, we would like to have an interface like this.

net = Net(layers=[Linear(2, 4), ReLU(), Linear(4, 2)], | |

loss=CrossEntropyLoss()) | |

out = net(X) | |

loss = net.loss(out, Y_labels) | |

grad = net.backward() | |

net.update_weights(lr=0.1) |

We have done most of the work, so `Net`

is only a simple container, composing the layers and the loss together. This object just loops over the list of layers, passing the output of each one to the next one.

class Net: | |

def __init__(self, layers, loss): | |

self.layers = layers | |

self.loss_fn = loss | |

def __call__(self, *args, **kwargs): | |

return self.forward(*args, **kwargs) | |

def forward(self, x): | |

""" | |

Calculates the forward pass by propagating the input through the | |

layers. | |

Args: | |

x: numpy.ndarray. Input of the net. | |

Returns: | |

output: numpy.ndarray. Output of the net. | |

""" | |

for layer in self.layers: | |

x = layer(x) | |

return x | |

def loss(self, x, y): | |

""" | |

Calculates the loss of the forward pass output with respect to y. | |

Should be called after forward pass. | |

Args: | |

x: numpy.ndarray. Output of the forward pass. | |

y: numpy.ndarray. Ground truth. | |

Returns: | |

loss: numpy.float. Loss value. | |

""" | |

loss = self.loss_fn(x, y) | |

return loss | |

def backward(self): | |

""" | |

Complete backward pass for the net. Should be called after the forward | |

pass and the loss are calculated. | |

Returns: | |

d: numpy.ndarray of shape matching the input during forward pass. | |

""" | |

d = self.loss_fn.backward() | |

for layer in reversed(self.layers): | |

d = layer.backward(d) | |

return d | |

def update_weights(self, lr): | |

""" | |

Updates the weights for all layers using the corresponding gradients | |

computed during backpropagation. | |

Args: | |

lr: float. Learning rate. | |

""" | |

for layer in self.layers: | |

if isinstance(layer, Layer): | |

layer._update_weights(lr) |

## Testing our framework

Now that we have built the necessary tools, it is time to put them into work! We will use a simple 2-layer fully connected neural net with ReLU activations to classify points in the two-dimensional plane.

The full example can be found here, which you can run locally in your own computer and experiment with it, which is highly encouraged. I suggest setting a random seed (to keep the weight initialization deterministic) and playing around with the learning rate parameter for example.

## Going beyond fully connected layers

Of course, this simple architecture is just the tip of the iceberg. There are much more complicated architectures available, for instance convolutional networks for computer vision, recurrent networks for sequence learning, etc.

In the accompanying code, you can find the implementations of convolutional and pooling layers, along with a 2d batch normalization layer, which you can use for studying and experimenting with the certain components.

During my quest, the famous CS231n: Convolutional Neural Networks for Visual Recognition course from Stanford helped me a lot, so if you decide to take on this journey and implement a convolutional network from scratch, I can recommend this to you.

Have fun and keep on learning!