Even the most common neural network architectures have *a lot* of parameters. ResNet50, which is a frequently used baseline model, has ~25 million. This means that during training, we perform a search in a 25 million dimensional parameter space.

To put this number in perspective, let’s take a look at a cube in this space. An n-dimensional cube has 2ⁿ vertices, so in 25 million dimensions, we are talking about 2²⁵⁰⁰⁰⁰⁰⁰ points. In a search grid, this would be just a single element. For comparison, the number of atoms in the observable universe is estimated to be around 10⁸². It is safe to say that the magnitude of this problem is incomprehensible to us humans.

Thus, reducing the number of parameters would have several benefits. A sparse network is not only smaller, it is faster to train and use. Where hardware is limited, such as in embedded devices or smartphones, speed and size can make or break a model. In addition, more complex models are more prone to overfitting. So, restricting the search space also acts as a regularizer.

However, this is not a simple task, as reducing the model’s capacity can also lead to loss in accuracy. Thus, there is a delicate balance between complexity and performance. In this post, we are going to take a deeper look into what the challenges and potential solutions are.

## Weight pruning

To start off, the simplest problem would be to aim for reducing model parameters *after* training. This would not help with the training itself, but would reduce computational requirements for inference.

The process of eliminating weights is called *pruning*. (I will use weights and parameters interchangeable from now.) Its origins go back to the famous paper *Optimal Brain Damage* by Yann LeCun, John S. Denker and Sara A. Solla.

They proposed the following iterative pruning method.

- Train the model.
- Estimate the
*saliency*of each weight, which they define by the change in the loss function upon perturbing the weight. The smaller the change, the less effect the weight has on the training. - Remove weights with the lowest saliency. (That is, set their value to zero and keep it there for the rest of the process.)
- Go to Step 1. and retrain the pruned model.

Continuing the training with pruned weights is necessary. The authors observed that without it, the objective function (a.k.a. the loss) increases significantly when a large portion of the weights are removed.

One particular challenge arises with this method when the pruned network is retrained. It turned out that due to its decreased capacity, retraining was much more difficult. The solution to this problem arrived later, along with the so-called Lottery Ticket Hypothesis, which put this problem into a whole another perspective.

### The Lottery Ticket Hypothesis

Winning the jackpot of a lottery has very small chances. For example, if you are playing Powerball, you have exactly 1 in 292,201,338 odds to win per ticket. What are your chances if you purchase *n* tickets? So, if the probability of winning is

then the probability of *not winning* is

When we buy *n* tickets, the probability that *none of them win*s is

From this, it follows that the probability of *at least one of them win*s is simply

If *n* is large, this can be arbitrarily close to one.

What does this have to do with neural networks? Before training, the weights of a model are initialized *randomly*. Can it happen that there is a subnetwork of a randomly initialized network which *“won the initialization lottery”*? In their work, Jonathan Frankle and Michael Carbin state the **Lottery Ticket Hypothesis**:

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations.

The most significant problem is achieving the same accuracy as the full network with at most the same number of training steps. As we have seen before, this is the biggest challenge of pruning: training smaller networks can be more difficult due to the decreased modelling capacity.

How do we find such initialization lottery winners? Do they even exist? The authors had found a way to consistently identify such subnetworks in certain architectures. Their method was the following.

- Randomly initialize a neural network.
- Train the network for
*n*training steps. - Remove
*k%*of the weights with the lowest magnitude. - Reset the remaining weights to the value they had during the random initialization.
- Go to Step 2. and iterate the training and pruning.

There are two key steps here compared to previous methods. First, the weights are simply removed according to their magnitude. Second, **the weights of the pruned network are not reinitialized but reset to the state after the first initialization**. This had proven to be essential: random reinitialization of the pruned network resulted in significantly worse results when more than 80% of the weights was removed.

However, there are significant limitations to this method. For one, it does not perform well for larger scale problems and architectures. In the original paper, the authors stated that for more complex datasets (like ImageNet) and deeper architectures (like ResNet), the method fails to identify the winners of the initialization lottery.

In general, achieving a good sparsity-accuracy tradeoff is a difficult problem. It is a very active research field and the state of the art keeps moving.

Although this method was a huge improvement, it did not fix one significant issue: the pruned subnetworks still required retraining. Surprisingly, this was solved by turning the entire problem upside down.

## Weight Agnostic Neural Networks

Up until this point, we have started from a large neural network, trimming it down iteratively to compress it. However, there is an alternative approach with the opposite logic: instead of pruning, we could start with a minimal architecture and add to it incrementally.

This was the starting point of Adam Gaier and David Ha in their recent paper Weight Agnostic Neural Networks. In addition to obtaining a minimal network, they had a more ambitious goal: *the resulting architectures should perform well with random weights*. If you recall, the biggest weakness of weight pruning methods was that the resulting subnetworks were difficult to train. This way, no training would be needed.

In essence, their method is an evolutionary search in the space of possible architectures. Compared to previous methods, their search prefers simplicity and weight-agnostic property.

- Create a set of minimal neural network architectures, which
**can be embedded into a single parent architecture**. There are no weight values at this point, just connections. Basically, each network is equivalent to a specific pruning of the parent architecture. - Test out the networks
**with sets of shared weights**. Sharing the weights is essential, since this is what forces the search to prefer weight-agnostic architectures. - Rank the networks according to performance and complexity. Because the weights were shared between the networks, the performance also reflects the weight-agnostic capability.
- Create new networks by taking the top ranking networks and randomly modifying them.
- Take the obtained networks and go to Step 2.

The authors summarize the process in the following figure.

With this, they were able to find architectures performing well with random weights. This is quite an impressive result.

The method works for several machine learning tasks, not just for handwritten digit recognition. They were able to engineer minimal weight-agnostic architectures for several continuous control problems, such as pole balancing and cart racing.

So far, this method brings us the closest to finding minimal architectures which doesn’t require training. Yet, there is much to do. For more complex tasks, like ImageNet classification, finding accurate weight-agnostic networks is still unsolved.

### Exciting new developments

Just as I am writing this, a new paper appeared and claimed to achieve the theoretical maximum compression — without looking at data! In the paper *Pruning neural networks without any data by iteratively conserving synaptic flow* by Hidenori Tanaka et al., their algorithm SynFlow is capable to handle those complex tasks that previous methods were not able to. I am definitely looking forward to learn more about this and dive into the paper!

## Pruning in practice

Up until now, we have talked about only theoretical aspects of pruning. I would like to finish up by collecting some available tools which you can start using right away.

If you are a TensorFlow user, there are several built-in tools for model optimization, including pruning. Here is a guide to the Pruning API!

Analogously, PyTorch has similar functionalities. You can find their tutorial about weight pruning here, but I have also written a quick summary of this and other available tools like model quantization.

## Summary

Machine learning doesn’t stop at training accurate models. When deploying to production, speed and size matters a lot. To fit complex architectures into production constraints, one can reduce the size of the models by pruning certain weights. (Along with many other methods, which we haven’t talked about.)

This is not a simple task, since loss in accuracy must be taken into account. In addition, smaller architectures require retraining, which can be more difficult due to the decreased modelling capacity. Here, we have reviewed two weight pruning methods and a complementary architecture search approach which progressively address these challenges.

However, this is just the tip of the iceberg. Weight pruning is an active research area, especially for more complex tasks, like large-scale image recognition. (Think ImageNet size.)

Pruning methods are available in modern deep learning frameworks such as TensorFlow and PyTorch, so you can start experimenting with them right away in your work, perhaps even attempting to push the boundaries of the state of the art. So, if you have found this subject interesting, go and put these awesome methods into practice 🙂

## References

**[1]** *Optimal Brain Damage* by Yann LeCun, John S. Denker and Sara A. Solla

**[2]** *The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks* by Jonathan Frankle and Michael Carbin

**[3]** *Weight Agnostic Neural Networks* by Adam Gaier and David Ha

**[4]** *Pruning neural networks without any data by iteratively conserving synaptic flow** by *Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins and Surya Ganguli