Neural networks and deep learning are not recent methods. In fact, they are quite old. Perceptrons, the first neural networks, were created in 1958 by Frank Rosenblatt. Even the invention of the ubiquitous building blocks of deep learning architectures happened mostly near the end of the 20th century. For example, convolutional networks were introduced in 1989 in the landmark paper Backpropagation Applied to Handwritten Zip Code Recognition by Yann LeCun et al.
Why did the deep learning revolution had to wait decades?
One major reason was the computational cost. Even the smallest architectures can have dozens of layers and millions of parameters, so repeatedly calculating gradients during is computationally expensive. On large enough datasets, training used to take days or even weeks. Nowadays, you can train a state of the art model in your notebook under a few hours.
There were three major advances which brought deep learning from a research tool to a method present in almost all areas of our life. These are backpropagation, stochastic gradient descent and GPU computing. In this post, we are going to dive into the latter and see that neural networks are actually embarrassingly parallel algorithms, which can be leveraged to improve computational costs by orders of magnitude.
A big pile of linear algebra
Deep neural networks may seem complicated for the first glance. However, if we zoom into them, we can see that its components are pretty simple in most cases. As the always brilliant xkcd puts it, a network is (mostly) a pile of linear algebra.
During training, the most commonly used functions are the basic linear algebra operations such as matrix multiplication and addition. The situation is simple: if you call a function a bazillion times, shaving off just the tiniest amount of the time from the function call can compound to a serious amount.
Using GPU-s not only provide a small improvement here, they supercharge the entire process. To see how it is done, let’s consider activations for instance.
Suppose that φ is an activation function such as ReLU or Sigmoid. Applied to the output of the previous layer
the result is
(The same goes for multidimensional input such as images.)
This requires to loop over the vector and calculate the value for each element. There are two ways to make this computation faster. First, we can calculate each φ(xᵢ) faster. Second, we can calculate the values φ(x₁), φ(x₂), …, φ(xₙ) simultaneously, in parallel. In fact, this is embarrassingly parallel, which means that the computation can be parallelized without any significant additional effort.
Over the years, doing things faster became much more difficult. A processor’s clock speed used to double almost every year, but this has plateaued recently. Modern processor design has reached a point where packing more transistors into the units has quantum-mechanical barriers.
However, calculating the values in parallel does not require faster processors, just more of them. This is how GPUs work, as we are going to see.
The principles of GPU computing
Graphics Processing Units, or GPUs in short were developed to create and process images. Since the value of every pixel can be calculated independently of others, it is better to have a lot of weaker processors than a single very strong one doing the calculations sequentially.
This is the same situation we have for deep learning models. Most operations can be easily decomposed to parts which can be completed independently.
To give you an analogy, let’s consider a restaurant, which has to produce French fries on a massive scale. To do this, workers must peel, slice and fry the potato. Hiring people to peel the potatoes costs much more than purchasing many more kitchen robots capable to perform this task. Even if the robots are slower, you can buy much more from the budget, so overall the process will be faster.
Modes of parallelism
When talking about parallel programming, one can classify the computing architectures into four different classes. This was introduced by Michael J. Flynn in 1966 and it is in use ever since.
- Single Instruction, Single Data (SISD)
- Single Instruction, Multiple Data (SIMD)
- Multiple Instructions, Single Data (MISD)
- Multiple Instructions, Multiple Data (MIMD)
A multi-core processor is MIMD, while GPUs are SIMD. Deep learning is a problem for which SIMD is very well suited. When you calculate the activations, the same exact operation needs to be performed, with different data for each call.
Latency vs throughput
To give a more detailed picture on what GPU better than CPU, we need to take a look into latency and throughput. Latency is the time required to complete a single task, while throughput is the number of tasks completed per unit time.
Simply put, a GPU can provide much better throughput, at the cost of latency. For embarrassingly parallel tasks such as matrix computations, this can offer an order of magnitude improvement in performance. However, it is not well suited for complex tasks, such as running an operating system.
CPU, on the other hand, is optimized for latency, not throughput. They can do much more than floating point calculations.
General purpose GPU programming
In practice, general purpose GPU programming was not available for a long time. GPU-s were restricted to do graphics, and if you wanted to leverage their processing power, you needed to learn graphics programming languages such as OpenGL. This was not very practical and the barrier of entry was high.
This was the case until 2007, when NVIDIA launched the CUDA framework, an extension of C, which provides an API for GPU computing. This significantly flattened the learning curve for users. Fast forward a few years: modern deep learning frameworks use GPUs without us explicitly knowing about it.
GPU computing for deep learning
So, we have talked about how GPU computing can be used for deep learning, but we haven’t seen the effects. The following table shows a benchmark, which was made in 2017. Although it was made three years ago, it still demonstrates the order of magnitude improvement in speed.
How modern deep learning frameworks use GPUs
Programming directly in CUDA and writing kernels by yourself is not the easiest thing to do. Thankfully, modern deep learning frameworks such as TensorFlow and PyTorch doesn’t require you to do that. Behind the scenes, the computationally intensive parts are written in CUDA using its deep learning library cuDNN by NVIDIA. These are called from Python, so you don’t need to use them directly at all. Python is really strong in this aspect: it can be combined with C easily, which gives you both the power and the ease of use.
This is similar to how NumPy works behind the scenes: it is blazing fast because its functions are written directly in C. If you are interested, we have written an article, where we explain how to use NumPy to maximize its benefits!
Do you need to build a deep learning rig?
If you want to train deep learning models on your own, you have several choices. First, you can build a GPU machine for yourself, however, this can be a significant investment. Thankfully, you don’t need to do that: cloud providers such as Amazon and Google offer remote GPU instances to work on. If you want to access resources for free, check out Google Colab, which offers free access to GPU instances.
Deep learning is computationally very intensive. For decades, training neural networks was limited by hardware. Even relatively smaller models had to be trained for days, and training large architectures on huge datasets was impossible.
However, with the appearance of general computing GPU programming, deep learning exploded. GPUs excel in parallel programming, and since these algorithms can be parallelized very efficiently, it can accelerate training and inference by several orders of magnitude.
This has opened the way for rapid growth. Now, even relatively cheap commercially available computers can train state of the art models. Combined with the amazing open source tools such as TensorFlow and PyTorch, people are building awesome things every day. This is truly a great time to be in the field.