Not surprisingly, several groundbreaking results of machine learning and artificial intelligence can be originated from competitions. In this article, we highlight the most notable cases where open challenges inspired significant advances.

For many years, the ImageNet competition was one of the main driving forces behind the innovation in computer vision. Remember when convolutional networks first exploded in popularity? It was because of the annual ImageNet competition.

In short, the goal of the annual ImageNet Large Scale Visual Recognition Challenge between 2010 and 2017 was to build an accurate classifier for a massive dataset of images containing more than 1.4 million images, categorized into 1000 classes. Before the challenge series’ launch, achieving even an acceptable performance on such a complex task was a formidable feat.

However, 2012 was a turning point. In their landmark paper titled ImageNet Classification with Deep Convolutional Neural Networks, authors Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton introduced a novel architecture (later named as *AlexNet*) that represented a hyperspace jump in terms of performance. Its top-5 error was as low as 15.3%, a whopping 10.8% improvement on the runner-up. Below, you can see the results visualized between 2011 and 2016.

AlexNet was the first convolutional network that won the competition, and the first one to shed light on their capabilities. After this landmark submission, all subsequent winners utilized the power of convolutional networks.

The competition served as a benchmark during its years, with many famous entries such as VGG, GoogLeNet, and ResNet.

By 2017, the ImageNet challenge was considered as solved. Most entries in the last competition had reached 95% accuracy, a threshold that was considered an extremely difficult feat before. After its great successes, the challenge was discontinued in this form. However, the organizers announced for the challenge to return in a renewed form, focusing on 3D vision in the future.

For a long time, the protein folding problem was the Holy Grail of bioinformatics. Predicting the three-dimensional protein structure from the sequence of its amino acids is an extremely complex task, involving the deep understanding of thermodynamics and the interactions between molecules.

AlphaFold by Google DeepMind solved this 50 years old problem. After decades of painfully slow progress, the performance improvement compared to the previous state-of-the-art was unprecedented. As said by John Moult, as quoted in the AlphaFold announcement post,

“We have been stuck on this one problem – how do proteins fold up – for nearly 50 years. To see DeepMind produce a solution for this, having worked personally on this problem for so long and after so many stops and starts, wondering if we’d ever get there, is a very special moment.”

John Moult, Co-founder and Chair of CASP, University of Maryland

The method debuted in the 13th Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition, which was a driving force of development in the field.

Like AlexNet, AlphaFold ended up revolutionizing a field. Without CASP, this would probably have happened much later.

The four most common tasks in computer vision are, in increasing difficulty,

- classification,
- detection,
- semantic segmentation,
- instance segmentation.

Instance segmentation, which requires identifying precisely the pixels that belong to the object and which category the object belongs to, was an insurmountable challenge for a long time. COCO, short for Common Objects in Context, aimed to tackle that. Since 2015, the publication of the dataset and the competition’s launch, the average precision nearly doubled. Year after year, top-tier groups from places like Facebook, Alibaba, or Microsoft compete with each other to push state of the art even further.

One notable submission for this challenge was the Mask R-CNN, published by Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick from Facebook AI (FAIR). For a while, it represented the pinnacle of region-based convolutional networks, with hundreds of applications throughout the spectrum. For example, part of our team used Mask R-CNN to develop a powerful method for cell nuclei segmentation with microscopy.

Competitive programming has been a significant part of computer science for a long time. However, with the rise of machine learning, they gained a surge in popularity. In essence, teamwork and competition can drive brilliant minds to develop ingenious solutions to various challenges. A fixed-term competition is a distillation of how research and development work on a larger timescale.

During the past decade, open machine learning competitions have been a significant driving force of development. To see this, it is enough to take a look at the ImageNet challenge, where AlexNet was premiered in 2012, single-handledly popularizing convolutional neural networks in computer vision.

Besides pushing state of the art, competitions can be used for other purposes with great success. Companies often organize one for supercharging the development time, but it can lead to great results in classroom settings as well.

For us, the lesson is clear: if you want to solve interesting and hard problems, go participate in a competition. On the other hand, if you want to accelerate progress in a field, organize an open challenge.

]]>Well, GPT-3 has around 175 billion parameters.

To give you a perspective about how large this number is, consider the following. A $100 bill is approximately 6.14 inches wide. If you start laying down the bills right next to each other, the line will stretch 169,586 miles. For comparison, Earth’s circumference is 24,901 miles, measured along the equator. So, it would take ~6.8 round trips until we ran out of the money.

Unfortunately, as opposed to money, more is sometimes not better when it comes to the number of parameters. Sure, more parameters seem to mean better results, but also more massive costs. According to the original paper, GPT-3 required 3.14E+23 flops of training time, and the computing cost itself is in the millions of dollars.

GPT-3 is so large that it cannot be easily moved to other machines. It is currently accessible through the OpenAI API, so you can’t just clone a GitHub repository and run it on your computer.

However, this is just the tip of the iceberg. Deploying much smaller models can also present a significant challenge for machine learning engineers. In practice, small and fast models are much better than cumbersome ones.

Because of this, researchers and engineers have put significant energy into compressing models. Out of these efforts, several methods have emerged to deal with the problem.

If we revisit GPT-3 for a minute, we can see how the number of parameters and the training time influences the performance.

The trend seems clear: more parameters lead to better performance and higher computational costs. The latter not only impacts the training time but the server costs and the environmental effects as well. (Training large models can emit more CO2 than a car in its entire lifetime.) However, training is only the first part of the life cycle of a neural network. In the long run, inference costs take over.

To optimize these costs by compressing the models, three main methods have emerged:

- weight pruning,
- quantization,
- knowledge distillation.

In this article, my goal is to introduce you to these and give an overview of how they work.

Let’s get started!

One of the oldest methods for reducing a neural network’s size is *weight pruning*, eliminating specific connections between neurons. In practice, elimination means that the removed weight is replaced with zero.

At first glance, this idea might be surprising. Wouldn’t this eliminate the knowledge learned by the neural network?

Sure, removing all of the connections would undoubtedly result in losing all that is learned. On the other part of the spectrum, pruning only one connection probably wouldn’t mean any decrease in accuracy.

The question is, how much can you remove until the predictive performance starts to suffer?

The first ones to study this question were Yann LeCun, John S. Denker, and Sara A. Solla, in their paper *Optimal Brain Damage* from 1990*. *They have developed the following iterative method.

- Train a network.
- Estimate the importance of each weight by watching how the loss would change upon perturbing the weight. Smaller change means less importance. (This importance is called the
*saliency*.) - Remove the weights with low importance.
- Go back to Step 1. and retrain the network, permanently fixing the removed weights to zero.

During their experiments with pruning the LeNet for MNIST classification, they found that a significant portion of the weights can be removed without a noticeable increase in the loss.

However, retraining was necessary after the pruning. This proved to be quite tricky since *a smaller model means a smaller capacity*. Besides, as mentioned above, training amounts for a significant portion of the computational costs. This compression only helps in inference time.

Is there a method requiring less post-pruning training, but still reaching the unpruned model’s predictive performance?

One essential breakthrough was made in 2008 by researchers from MIT. In their paper titled *The Lottery Ticket Hypothesis*, Jonathan Frankle and Michael Carbin stated that in their hypothesis that

A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations.

Such subnetworks are called *winning lottery tickets*. To see why let’s consider that you buy 10¹⁰⁰⁰ lottery tickets. (This is more than the number of atoms in the observable universe, but we’ll let this one slide.) Because you have so many, there is a tiny probability that none of them are winners. This is similar to training a neural network, where we randomly initialize weights.

**If this hypothesis is true**, and such subnetworks can be found, training could be done much faster and cheaper, since a single iteration step would take less computation.

The question is, does the hypothesis hold, and if so, how can we find such subnetworks? The authors proposed the following iterative method.

- Randomly initialize the network and store the initial weights for later reference.
- Train the network for a given number of steps.
- Remove a percentage of the weights with the lowest magnitude.
- Restore the remaining weights to the value that was given during the first initialization.
- Go to Step 2. and iterate the pruning.

On simple architectures trained on simple datasets, such that LeNet on MNIST, this method offered significant improvement, as shown in the figure below.

However, although it showed promise, it did not perform well on more complex architectures like ResNets. Moreover, pruning still happens after training, which is a significant problem.

The most recent algorithm to prune before training was published in 2020. (Which is the year I am writing this.) In their paper, Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli from Stanford developed a method that goes much further and does the pruning *without training*.

First, they introduce the concept of *layer collapse*,

the premature pruning of an entire layer making a network untrainable,

which plays a significant part in the theory. Any pruning algorithm should avoid layer collapse. The hard part is identifying a class of algorithms that satisfies this criterion.

For this purpose, the authors introduce the *synaptic saliency score* for a given weight in the network defined by

where *L* is the loss function given by the network’s output, and *w* is a weight parameter. Each neuron *conserves this quantity*: under certain constraints for the activation functions, the sum of incoming synaptic salience scores equal to the sum of outgoing synaptic saliency scores.

This score is used to select which weights are pruned. (Recall that for this purpose, the *Optimal Brain Damage* method used a perturbation-based quantity, while the authors of the Lottery Ticket Hypothesis paper used the magnitude.)

It turns out that synaptic saliency scores are conserved between layers, and roughly speaking, if an iterative pruning algorithm respects this layer-wise conservation, layer collapse can be avoided.

The SynFlow algorithm is an iterative pruning algorithm similar to the previous ones, but the selection is based on the synaptic saliency scores.

However, the work is far from done. As Jonathan Frankle and co-authors point out in their very recent paper, there is no universal state of the art solution. Each method shines in specific scenarios but outperformed in others. Moreover, the pre-training pruning methods outperform the baseline random pruning, they still don’t perform as well as some post-training algorithms, especially magnitude-based pruning.

Pruning is available both in TensorFlow and PyTorch.

Next, we are going to take a look at another tool for neural network compression: *quantization*.

In essence, a neural network is just a bunch of linear algebra and some other operations. By default, most systems use *float32* types to represent the variables and weights.

However, in general, computations in other formats such as *int8* can be faster than in *float32*, with less memory footprint. (Of course, these can depend on the hardware, but we are not trying to be extra specific here.)

*Neural network quantization* is the suite of methods aiming to take advantage of this. For instance, if we would like to go from *float32* to *int8* as mentioned, and our values are in the range *[-a, a]* for some real number *a*, we could use the transformation

to convert the weights and proceed with the computations in the new form.

Of course, things are not that simple. Multiplying two *int8* numbers can easily overflow to *int16*, and so on. During quantization, care must be taken to avoid errors due to this.

As with all compression methods, this comes with a loss of information and possibly predictive performance. The problem is the same as before: to find an optimal trade-off.

Quantization has two primary flavors: *post-training quantization* and *quantization-aware training*. The former is more straightforward but can result in more significant accuracy loss than the latter.

As you can see in the table above, this can cut the inference time in half in some instances. However, converting from *float32* to *int8* is not a smooth transformation; thus, it can lead to suboptimal results when the gradient landscape is wild.

With quantization-aware training, this method has the potential to improve training time as well.

Similarly to weight pruning, quantization is also available both in TensorFlow and PyTorch.

At the time of the writing, the feature is experimental in PyTorch, which means that it is subject to change. So, you should expect breaking changes in the upcoming versions.

So far, the methods we have seen share the same principle: train the network and discard some information to compress it. As we will see, the third one, *knowledge distillation*, differs from these significantly.

Although quantization and pruning can be effective, they are destructive in the end. An alternative approach was developed by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their paper Distilling the Knowledge in a Neural Network.

Their idea is simple: train a big model (*teacher*) to achieve top performance and use its predictions to train a smaller one (*student*).

Their work showed that this way, large ensemble models can be compressed with simpler architectures, more suitable towards production.

**Knowledge distillation improves the inference time of the distilled models, not the training time.** This is an essential distinction between the other two methods since training time often has a high cost. (If we think back to the GPT-3 example, it was millions of dollars.)

You might ask, why not just use a compact architecture from the start? The secret sauce is to teach the student model to generalize like the teacher by using its predictions. Here, the student model not only sees the training data for the big one, but new data as well, where it is fitted to approximate the output of the teacher.

The smaller a model is, the more training data it needs to generalize well. Thus, it might require a complex architecture such as an ensemble model to reach the state of the art performance on challenging tasks. Still, its knowledge can be used to push the student model’s performance beyond the baseline.

One of the first use cases for knowledge distillation was compressing ensembles and making them suitable for production. Ensembles were notorious in Kaggle competitions. Several winning models were composed of several smaller ones, offering outstanding results but being unusable in practice.

Besides the baseline distillation approach by Hinton et al., there are several other ones, trying to push the state of the art. If you would like to get an overview of those, I recommend the survey paper by Jianping Gou et al.

Since knowledge distillation does not require the manipulation of weights like pruning or quantization, it can be performed in any framework of your choice.

Here are some examples to get you started!

As neural networks are getting larger and larger, compressing the models are becoming even more critical. As the complexity of the problems and architectures increases, so does the computational cost and the environmental impact.

This trend only seems to accelerate: GPT-3 contains 175 billion parameters, which is a 10x jump in the magnitude, compared to previous giant models. Thus, compressing these networks is a fundamental problem, which will become even more important in the future.

Are you ready to tackle this challenge?

]]>To illustrate the point, this is the number of parameters for the most common architectures in NLP, as summarized in the recent State of AI Report 2020 by Nathan Benaich and Ian Hogarth.

In Kaggle competitions, the winner models are often ensembles, composed of several predictors. Although they can beat simple models by a large margin in terms of accuracy, their enormous computational costs make them utterly unusable in practice.

Is there any way to somehow leverage these powerful but massive models to train state of the art models, without scaling the hardware?

Currently, there are three main methods out there to compress a neural network while preserving the predictive performance:

*weight pruning*,*quantization*,- and
*knowledge distillation*.

In this post, my goal is to introduce you to the fundamentals of *knowledge distillation*, which is an incredibly exciting idea, building on training a smaller network to approximate the large one.

Let’s imagine a very complex task, such as image classification for thousands of classes. Often, you can’t just slap on a ResNet50 and expect it to achieve 99% accuracy. So, you build an ensemble of models, balancing out the flaws of each one. Now you have a huge model, which, although performs excellently, there is no way to deploy it into production and get predictions in a reasonable time.

However, the model generalizes pretty well to the unseen data, so it is safe to trust its predictions. (I know, this might not be the case, but let’s just roll with the thought experiment for now.)

What if we use the predictions from the large and *cumbersome* model to train a smaller, so-called *student* model to approximate the big one?

This is knowledge distillation in essence, which was introduced in the paper Distilling the Knowledge in a Neural Network by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean.

In broad strokes, the process is the following.

- Train a large model that performs and generalizes very well. This is called the
*teacher model*. - Take all the data you have, and compute the predictions of the teacher model. The total dataset with these predictions is called the
*knowledge,*and the predictions themselves are often referred to as*soft targets*. This is the*knowledge distillation*step. - Use the previously obtained knowledge to train the smaller network, called the
*student model*.

To visualize the process, you can think of the following.

Let’s focus on the details a bit. How is the knowledge obtained?

In classifier models, the class probabilities are given by a *softmax* layer, converting the *logits* to probabilities:

where

are the logits produced by the last layer. Instead of these, a slightly modified version is used:

where *T* is a hyperparameter called *temperature*. These values are called *soft targets*.

If *T* is large, the class probabilities are “softer”, that is, they will be closer to each other. In the extreme case, when *T* approaches infinity,

If *T = 1*, we obtain the softmax function. For our purposes, the temperature is set to higher than 1, thus the name *distillation*.

Hinton, Vinyals, and Dean showed that a distilled model can perform as good as an ensemble composed of 10 large models.

You might ask, why not train a smaller network from the start? Wouldn’t it be easier? Sure, but it *wouldn’t work *necessarily.

Empirical evidence suggests that more parameters result in better generalization and faster convergence. For instance, this was studied by Sanjeev Arora, Nadav Cohen, and Elad Hazan in their paper On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization.

For complex problems, simple models have trouble learning to generalize well on the given training data. However, we have much more than the training data: the teacher model’s predictions for all the available data.

This benefits us in two ways.

First, the teacher model’s knowledge can teach the student model how to generalize via available predictions outside the training dataset. Recall that we use the teacher model’s predictions **for all available data** to train the student model, instead of the original training dataset.

Second, the soft targets provide more useful information than class labels: **it indicates if two classes are similar to each other***. *For instance, if the task is to classify dog breeds, information like *“Shiba Inu and Akita are very similar”* is extremely valuable regarding model generalization.

As noted by Hinton et al., one of the earliest attempts to compress models by transferring knowledge was to reuse some layers of a trained ensemble, as done by Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil in their 2006 paper titled Model compression.

In the words of Hinton et al.,

“…we tend to identify the knowledge in a trained model with the learned parameter values and this makes it hard to see how we can change the form of the model but keep the same knowledge. A more abstract view of the knowledge, that frees it from any particular instantiation, is that it is a learned mapping from input vectors to output vectors.”— Distilling the Knowledge in a Neural Network

Thus, the knowledge distillation doesn’t use the learned weights directly, as opposed to transfer learning.

If you want to compress the model even further, you can try using even simpler models like decision trees. Although they are not as expressive as neural networks, their predictions can be explained by looking at the nodes individually.

This was done by Nicholas Frosst and Geoffrey Hinton, who studied this in their paper Distilling a Neural Network Into a Soft Decision Tree.

They showed that distilling indeed helped a little, although even simpler neural networks have outperformed them. On the MNIST dataset, the distilled decision tree model achieved 96.76% test accuracy, which was an improvement from the baseline 94.34% model. However, a straightforward two-layer deep convolutional network still reached 99.21% accuracy. Thus, there is a trade-off between performance and explainability.

So far, we have only seen theoretical results instead of practical examples. To change this, let’s consider one of the most popular and useful models in recent years: BERT.

Originally published in the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin et al. from Google, it soon became widely used for various NLP tasks like document retrieval or sentiment analysis. It was a real breakthrough, pushing state of the art in several fields.

There is one issue, however. BERT contains ~110 million parameters and takes a lot of time to train. The authors reported that the training required 4 days using 16 TPU chips in 4 pods. Calculating with the currently available TPU pod pricing per hour, training costs would be around 10000 USD, not mentioning the environmental costs like carbon emissions.

One successful attempt to reduce the size and computational cost of BERT was made by Hugging Face. They used knowledge distillation to train DistilBERT, which is 60% the original model’s size while being 60% faster and keeping 97% of its language understanding capabilities.

The smaller architecture requires much less time and computational resources: 90 hours on 8 16GB V100 GPUs.

If you are interested in more details, you can read the original paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, or the summarizing article was written by one of the authors. This is a fantastic read, so I strongly recommend you to do so!

** Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT**

Knowledge distillation is one of the three main methods to compress neural networks and make them suitable for less powerful hardware.

Unlike weight pruning and quantization, the other two powerful compression methods, knowledge distillation does not reduce the network directly. Rather, it uses the original model to train a smaller one called the *student model*. Since the teacher model can provide its predictions even on unlabelled data, the student model can learn how to generalize like the teacher.

Here, we have looked at two key results: the original paper, which introduced the idea, and a follow-up, showing that simple models such as decision trees can be used as student models.

If you are interested in a broader overview of the field, I recommend the paper Knowledge Distillation: A Survey by Jianping Gou et al.!

**This post is the third one in the series about model compression. If you are interested in other techniques, check out the following articles!**

How to accelerate and compress neural networks with quantization

Can you remove 99% of a neural network without losing accuracy?

]]>However, this is only a small part of the entire machine learning pipeline. As an engineer or data scientist, your task rarely begins and ends with method development. Rather, most time is spent with *data engineering *and *model serving infrastructure management*.

As the community of professionals soon realized this, an increasingly large effort was placed to manage machine learning operations throughout the entire life cycle. Thus, to the analogy of DevOps, the field of MLOps has slowly emerged.

During the evolution of a technical field, its accessibility goes through three major phases. First, upon its inception, knowledge is not readily available if you are not at the forefront of the efforts. Second, where the first textbooks are written and courses are created, but best practices are still not clear and the information is scattered in several places. Finally, a field reaches a certain level of maturity when it becomes part of a standard curriculum. Deep learning and machine learning are already there.

However, MLOps is still in the second phase. There are several great learning resources out there, but it can take quite a while to find and filter them. This post aims to do this work for you: we are going to take a look into three of the best places to learn the fundamentals of MLOps.

Let’s get started!

Originally taught at as a boot camp in Berkeley, the Full Stack Deep Learning course has become one of the most comprehensive introductions to the more practical side of machine learning.

Recently, they have made the entire lecture series available online, along with the projects

Instead of the theory and model training, their curriculum contains the following lectures:

**Setting up machine learning projects****Infrastructure and tooling****Data management****Machine learning teams****Training and debugging****Testing and deploying**

Overall, this is the best introduction to the field in my opinion. The material taught runs wide rather than deep, but in the end, you’ll realize how vast MLOps is and how much you don’t know.

The Machine Learning Engineering book is written by Andriy Burkov, which perfectly complements the Full Stack Deep Learning course. The book itself is distributed according to the “read first, buy later” principle, which means that if it provided you value, you can support the author by purchasing.

Instead of going into the toolkit of MLOps, the book offers more of a *“theory of the practice”* approach, providing you an overview of the problems, questions, and best practices of machine learning problems.

If you are interested, you should check out The Hundred-Page Machine Learning Book, which is a more theory-focused reading from the same author.

Probably you have encountered the concept before, but if this is the first, an Awesome list is a thematic curated catalog of resources, hosted in the form of a GitHub repository containing only a README file.

In our case, two very useful lists are the Awesome MLOps and the Awesome Production Machine Learning. While the former focuses on learning resources, the latter complements it with an emphasis on tooling.

These lists are useful when you already have a comprehensive view of the MLOps field and you would like to specialize in a given subdomain, such as model serving and monitoring.

As you can already see, managing machine learning projects throughout their entire life cycle is astoundingly complex. However, with this knowledge below your belt, you’ll be ready to tackle many of its challenges. So, go and learn awesome things

If you are interested in MLOps, check out our other articles about the topic!

]]>Linus Pauling, the only scientist who won two unshared Nobel Prizes and one of the greatest chemists of all time, was a well-organized person. Among other things, he was known for meticulously keeping notebooks, containing his experiments, conclusions, and ideas.

During his life’s work, he left 46 research notebooks, which is an impressive number.

Pauling was not the only scientist who did this: Charles Darwin, Alexander Graham Bell, Thomas Edison, and practically every scientist before and after their time.

Notebooks provide an excellent tool to help reproduce experiments, formulate hypotheses, and draw conclusions. When an experiment has several sensitive parameters (like humidity, temperature, and light conditions for a plant biologist), reproducing results are impossible without keeping track.

Fast forward to the present day.

Albeit the tools have evolved, experimental scientists are still doing this. The notebooks may be digital, but the point is the same.

You may ask, what does this have to do with data *science *or machine learning? Let me explain. Does the following folder look familiar?

models/

├── model_v1

├── model_v1_final

├── model_v1_final_l1_regularization

├── model_v2_adam_final

├── model_v2_final

├── model_v2_final_finetuning

└── model_v2_final_on_augmented_dataset

A month later, it turns out that `model_v1_final_l1_regularization`

was the best. Now you have to go back and reproduce it. What were the hyperparameters? The initial learning rate? Which dataset have you used? (Because guessing from the name of the last model, you have several.) What was the model architecture? These are all essential questions.

If you did not keep track of this, you are in trouble. Thankfully, you are not the first one to face this problem, and because of this, there are already well-established solutions.

Recording the parameters and the metrics have other benefits than reproducibility. Evaluating the results can give valuable insight into how can the model be even better and which hyperparameters are important. On top of this, since most projects are done by teams, communication can be simplified with beautiful charts and reports.

Introducing experiment tracking in your workflow requires a time investment, which will pay off exponentially later. Getting rid of old habits require conscious effort, but this is worth the time.

So, you have decided to move beyond using only Jupyter Notebooks in your machine learning development workflow. There are several tools, ranging from simple to complex full-stack solutions.

To give a concrete example, suppose that you are building a ResNet-based model for classification.

Even with disregarding the hyperparameters arising from the model architecture (like the number of filters learned by the convolutional layers), you still have a lot. For instance, the

- learning rate,
- learning rate decay strategy and rate,
- optimizer and its parameters,
- data augmentation methods,
- batch size,
- number of epochs

are all important and can influence the result.

In this post, we are going to take a look into two tools that you can use to track your machine learning experiments: *MLFlow* and *Weights and Biases*.

Before we move on, I would like to mention that if you want to implement a solution fast, you should just keep a record of your experiments in a table.

I know, keeping track of things manually is far from optimal. Logging is tedious and if you are not clear enough in your notes, interpretation can be difficult later. Despite this, there are multiple arguments for using an Excel table (or Google Sheet or whatever you use) to record the hyperparameters and the experimental results.

This method is extremely low effort and you can start it right now, without learning anything new or browsing tutorials for hours. You don’t always need an advanced solution, sometimes you just have to whip something up quickly. Opening up an Excel and recording your experiments there instantly brings some order to your workflow.

However, often you need much more. Let’s see what can you use to be as organized as a Kaggle Grandmaster!

One of the first and most established tools for experiment tracking is MLFlow, which consists of three main components: Tracking, Projects, and Models.

*MLFlow Tracking* offers an interactive UI to record hyperparameters, metrics, and other data for experiment tracking. We are going to focus on this part, however, it goes in tandem with *Projects*, a tool for easily packaging your model into an API, and *Models* which lets you deploy the models to production. These are designed to be seamlessly integrated, summing up to a complete *machine learning workflow*. (Hence the name.) It can be used with the major machine learning frameworks, like *TensorFlow*, *PyTorch*, *scikit-learn*, and others.

Once you have installed it with `pip install mlflow`

, you can start adding it to your code without any extra steps.

MLFlow tracks your experiment in *runs*. Each run is saved on your hard drive, which you can review in an interactive dashboard. To start a run, you should wrap the training code to the `mlflow.start_run()`

context. To give an example, this is how it works.

The logs are saved in the working directory. To check the result of the runs, the dashboard can be launched with

mlflow ui

from the same directory where the runs are saved. By default, the UI can be opened at `localhost:5000`

in your browser.

Plots for the logged quantities are also available, where the different runs can be compared with each other.

A tutorial example with linear regression can be found here.

To use MLFlow, you don’t have to record all experiments locally. You can use remote tracking servers, by using the managed MLFlow service on Databricks or even hosting your own. (However, the Databricks Community Edition is the free version of the managed platform, which comes with access to a remote tracking server.)

One particular feature which MLFlow lacks is collaboration and team management. Machine learning projects are rarely done in isolation, and communicating the results can be difficult.

**To summarize,**

- MLFlow is an entirely open source tool,
- which comes with features to manage (almost) the entire machine learning development cycle such as model packaging and serving,
- easy to install and integrate it into your experiments,
- provides a simple UI to track your experiments where you can track experiments locally or remotely,
- but for remote tracking, you either have to host your own server or use the Databricks platform,
- moreover, collaborative features are not available.

In recent years, several tools were created to improve the user experience and make the machine learning development process even more seamless. These are similar in functionality, so we are going to single out just one: the *Weights and Biases* tool.

One of the newest applications on the list is Weights and Biases, or *wandb* in short. It is also free to use, but certain functionalities are only available with a paid membership. Similarly to MLFlow, it provides an interactive dashboard that is accessible online and updated in real time. However, the tracking is done by a remote service.

With Weights and Biases, it is also very easy to get started, and adding the experiment tracking to an existing codebase is as simple as possible.

After registering, installing the package with `pip install wandb`

, and getting your API key, you can log in by typing

wandp login

into the command line, which prompts you to authenticate yourself with the key.

Now you are ready to add tracking to the code!

For simplicity, let’s assume that we only have two parameters: batch size and learning rate. First, you have to initialize the `wandb`

object, which will track hyperparameters and communicate with the web app.

import wandb

wandb.init(

project="wandb-example",

config=config, # config dict contains the hyperparameters

)

Here, the `config`

dictionary stores the hyperparameters, like

config = {"batch_size": 4, "learning_rate": 1e-3}

These are recorded and when later reviewing the runs with different hyperparameters, you can filter and group for these variables in the dashboard.

When this is set, you can use the `wandb.log`

method to record metrics, like the training and validation losses after an epoch:

wandb.log({"train_loss": train_loss, "val_loss": val_loss})

Every call to this method logs the metrics to the interactive dashboard. Besides scalars, sample predictions like images or *matplotlib* plots can also be saved.

If you use PyTorch, the `wandb.watch`

method can be used to register your model and keep track of all its parameters.

Weights and Biases collects much more data than you specify. First, it records all system data it can access, like GPU and CPU utilization, memory usage, and even GPU temperature.

In addition to this, the plots and visualizations present in the logs can be used to create research paper quality reports and export them to pdfs. It even supports LaTeX for mathematical formulas. (I love this feature.) See this one for example.

The free version gives a ton of value already, but if you want to access the collaborative and team management tools, you have to pay a monthly subscription fee.

Compared to MLFlow, some features stand out. For instance,

- the UI is beautifully designed and the user experience is significantly better,
- tracking is done remotely by default in the web app,
- you can create beautiful reports and export them to pdf.

Besides *MLFlow* and *Weights and Biases*, there are many tools with similar functionality. Some of them even go beyond and provide a full stack MLOps platform. Without any preference, here are some of them.

Similarly to software development, creating a machine learning solution is not a linear process.

In software development, as features are added, the product is constantly tested, improved. As user feedback, feature requests, and bug reports pours in, engineers often go back to the drawing board and rethink components.

This is where things often go sideways.

With every change, things may break. Bugs can be introduced and things can get very confusing. Manage a single shared codebase between multiple developers would be impossible without version control like *git*.

In machine learning, complexity is taken to another level. Not only the code to train and serve the models change rapidly, but the model development itself is more of an experimental process. Without specialized tools to track the results, data scientists are lost.

Two of the best tools out there (among many other excellent ones) are the *MLFlow* suite and the *Weights and Biases* tool. Depending on your needs, these can introduce you to a whole new world if you haven’t spent the extra time to organize your work. These are easy to learn and bring so many positive things that you shouldn’t miss out on them.

A few years ago, if you wanted to make money with code, you had two solutions. The classical low risk-low reward method was to get a 9–5 job. However, if you wanted more control, the only other way was to go self-employed and take a much higher risk.

This is not the case anymore. As the online ecosystem evolved, more and more companies have realized that they would rather pay to use a properly developed and maintained external solution, instead of developing it internally for hundreds of times the cost.

On the supply side, this had created ample opportunities for developers to turn their tools into fully fleshed out services and charge users based on a subscription fee. Digital channels, such as Twitter or Facebook, were established to sell digital products, making things simpler for coders with an entrepreneurial side. Creating an app from your bedroom has never been easier and the SaaS model (Software as a Service) was thriving.

Thus, starting a business had become an aspiration for many. However, just building an excellent product is not enough. As the product distribution channels had become accessible for everyone, the competition intensified. Coding, devops, product development, sales, marketing, branding: they all require different skill sets. To be successful in a saturated market, all must be extraordinary.

SaaS was built to match the demand of non-technological companies to use advanced software solutions without having to build them. On top of this, developers with intentions to build SaaS products have created a new demand: services for setting up and monetizing their products, allowing the creators to focus on the software.

Enter API marketplaces.

To be successful as a SaaS business, there are several things one must pay attention to. For instance,

- product development,
- hosting and devops,
- documentation,
- customer support,
- marketing,
- sales,
- branding,
- market research,
- pricing and monetization.

This raises the barrier to entry, as doing all of it requires a larger team rather than a few coders. To make it easier, entrepreneurs and hackers noticed two key things that had shifted the balance significantly.

First, *most services can be delivered in the form of an API*. To give an example, let’s consider an example from finance. If you perform algorithmic trading, you want a magic function which you can call to return the data and perform the quantitative analysis, instead of having to do it on your own from raw data. Or, if you want to do social network analysis on Twitter and gain insights on what is happening, you don’t want to crawl the site to make a database of tweets. Instead, you want direct access to search and analyze the data programmatically.

Second, that *there is a need from developers to set up a service quickly without having to worry about all of the above except coding.* This led to the rise of API marketplaces, taking care of most burdens for builders. Developers need to focus only on providing the best service possible, without worrying about how to monetize their API.

So, how do these API marketplaces work and what can they provide?

Simply speaking, an API marketplace takes care of two problems: *monetization* and *distribution*.

Suppose that you have built your service and have a hosting solution in place. To profit from your code, you would have to implement a solution to authenticate users and accept payments for subscriptions. However, the less code you write, the less you have to maintain, so you have more time to focus on the product.

API marketplaces such as RapidAPI takes care of this. For instance, there you can register your API and set up a payment plan easily. Users call your API through RapidAPI, who is responsible for authentication.

For the users, calling your service is as simple as possible. An API key is issued upon subscription, which is used to authenticate the callers without you having to do any extra work.

To demonstrate how simple this is for the users, here is an example code snippet, straight from the documentation of Dark Sky, one of the weather forecasting APIs at RapidAPI.

Also, your service is listed in the marketplace, so it is searchable and discoverable by potential users.

On the flip side, RapidAPI charges a 20% fee on all payments made through it. This should be taken into account when working out the pricing plan, especially because you also have to cover the hosting costs.

Here are a few additional case studies and hands-on tutorials which will be useful, if you utilize these platforms to make money with your code directly.

API as a product. How to sell your work when all you know is a back-end

Develop and sell a Python API — from start to end tutorial

Despite the significant improvements API marketplaces provided, things are still not quite there yet. As things stand, the responsibility of working out a hosting solution still lies with the developer. Depending on how much the API needs to scale, this can require serious expertise. (And if the API is successful, it *will* need scaling.)

Besides, you must pay the hosting fees upfront, even if your application is not successful. In certain cases, for example, when the service requires a running GPU instance, these costs can be significant.

Recently, a new contender has arrived to bring API marketplaces to the next level. Byvalue, an upcoming NoOps platform, promises to streamline the entire process of monetizing your code, without requiring you to work on optimizing hosting and paying any costs upfront.

At the moment, they are in Beta, and they are looking for early adopters to join their community and help build their marketplace. If you are interested in shaping the future of how developers monetize their code, you should definitely check them out!

Use Byvalue to Create Your Own API Business

We live in a time where opportunities are all around us. As the distribution channels of digital products have evolved, making money by building software and selling it as a service was gradually enabled for everyone with coding skills. Delivering services via APIs became a tried and tested business model, for small and large enterprises as well.

Initially, the required technical expertise went far beyond the domain knowledge of the problem to be solved. Deploying and monetizing the product was time-consuming and difficult.

Thus, API marketplaces were created to address this need and do this work for the developer. If you are looking for ways to profit from your code, API marketplaces such as RapidAPI are definitely the way to go.

However, the problem is far from resolved. New startups like Byvalue challenge the current implementations of marketplaces and set out to provide a one stop shop experience for all who want to turn their code into profit.

If you want to diversify your income streams or maybe even go self-employed, offering services via APIs is a great way to go.

]]>Later, as my experience grew and I became involved in large scale projects, it dawned on me that this freedom is a blessing *and* a curse. As contributors grow and the code is being pushed closer to production-grade, not having static typing or type checking can lead to nasty surprises.

This was a feeling shared between many in the Python ecosystem. However, keeping the freedom allowed by dynamic typing, but mitigating its negative impact is difficult.

Enter *type annotations*.

If there is one feature in Python which maximizes the positive impact on your code while requiring minimal effort, it is *type annotation*. It allows the developer to effectively communicate expected argument types and return values with the interpreter (and other developers as well) while keeping the advantages of dynamic typing.

So, what is dynamic typing anyway? To see this, let’s play around a little.

Behind the scenes, variables in Python like `a`

above are *pointers*, pointing to objects with a certain type. However, the pointers are not restricted to represent objects of fixed type for a given name. This gives us a lot of freedom. For instance, functions can accept any type as an argument, because in the background, a pointer is passed.

Unfortunately, without proper care, this can go wrong really fast.

Just think of the following example.

This simple statistical function works for lists and NumPy arrays as well! How awesome!

Well, not entirely. There are plenty of ways to burn yourself with this. For instance, you can unintentionally pass something which causes the function to crash. This is what happens if we call the function with a string.

The division operator is not defined between strings and integers, thus the error. Since Python is an interpreted language, this problem would not surface until the function is actually called with a bad argument. This can be after weeks of runtime. Languages like C can catch these errors in *compile-time* before anything goes wrong.

Things can get even worse. For instance, let’s call the function with a *dictionary*.

The execution is successful, but *the result is wrong*. When a dictionary is passed, the `min`

and `max`

functions compute the minimum and maximum of the *keys*, not the values like we want. This kind of bug can remain undetected for a long time, meanwhile, you are under the impression that things are alright.

Let’s see what can we do to avoid problems like these!

In 2006, PEP 3107 introduced *function annotations*, which were extended in PEP 484 with *type hints*. (PEP is short for Python Enhancement Proposal, which is Python’s way of suggesting and discussing new language features.)

Function annotation is simply *“a syntax for adding arbitrary metadata annotations to Python functions”*, as PEP 3107 states. How does it look in practice?

Types can be hinted with `argument: type = default_value`

and return values with `def function(...) -> type`

.

These are not enforced at all and ignored by the interpreter. However, this does not mean that they are not mind-blowingly useful! To convince you, let’s see what can we gain!

Have you ever tried to develop in a barebones text editor like Notepad? You have to type in everything and keep in mind what is what all the time. Even an IDE cannot help you if it has no idea about the object you are using.

Take a look at the example below.

With function annotation, the IDE is aware of the type of the `data`

object, which is the return value of the `preprocess_data`

function. Thus, you get autocompletion, which saves a tremendous amount of time.

This also helps when using the function. Most often, the definition is in an entirely different module, far away from where you are calling it. By telling the IDE the type of arguments, it will be able to help you pass the arguments in the correct format without having to manually check the documentation or the implementation.

Developers spend much more time reading code than writing it. I firmly believe that great code is self-documenting. With proper structuring and variable naming, comments are rarely needed. Function annotation contributes significantly to this. Just a glance at the definition will reveal a lot on how to use it.

The annotations are accessible from the outside of the function.

This is not only useful for the programmer but the program itself as well! Before calling the function, you can check the validity of its arguments at runtime, if needed.

Based on PEP 484, type checking was taken to the next level. It included the `typing`

module, *“providing a standard syntax for type annotations, opening up Python code to easier static analysis and refactoring, potential runtime type checking, and (perhaps, in some contexts) code generation utilizing type information”*, as stated in the PEP.

To give a more concrete example, the `typing`

module contains `List`

, so by using the annotation `List[int]`

, you can tell that the function expects (or returns) a list of integers.

Type checking opens up a lot of opportunities. However, doing it manually all the time is not so convenient.

If you want a stable solution, you should try *pydantic*, a data validation library. Using its `BaseModel`

class, you can validate data at runtime.

You can go even beyond this example, for instance by providing custom validators for *pydantic* models.

*pydantic* is one of the pillars of the FastAPI, which is the rising star of web development frameworks in Python. There, *pydantic *makes available to easily define the JSON schemas for the endpoints.

So, I hope that I have convinced you by now. Type annotations require minimal effort, but they have a huge positive impact on your code. It

- makes the code easier to read for you and your team,
- encourages you to have types in mind,
- helps to identify type-related issues,
- enables proper type checking.

If you are not using it already, you should start doing it now! This is one of the biggest code improvements you can do with only a small amount of work.

]]>To study structure,

tear away all flesh so

only the bone shows.

Linear algebra. Graph theory. If you are a data scientist, you have encountered both of these fields in your study or work at some point. They are part of a standard curriculum, frequently used tools in the kit of every engineer.

What is rarely taught, however, is that they have a very close and fruitful relationship. Graphs can be used to prove strong structural results about matrices easily and beautifully.

To begin our journey, first, we shall take a look at how a matrix can be described with a graph.

Suppose that we have a square matrix

We say that the weighted and directed graph

corresponds to *A* if

If this sounds complicated, here is an example.

So, for every square matrix, we have a weighted and directed graph. In general, having distinct representations for the same object is colossally useful in mathematics. Sometimes, complex things can be significantly simplified the moment you start looking at things from a different perspective.

Such as the case of matrices and graphs.

A great feature of the graph representation is the ability to visualize matrix powers.

Suppose that the *k*-th power of *A* is denoted by

where *k* can be any positive integer. For *k = 2*, it is calculated by

which seems pretty mystical.

With graphs, there is a pretty simple explanation.

For any two nodes *i* and *j*, there are several ways to go from one to another. These are called *walks*. In general, a walk is defined by the sequence of vertices

where there must be an edge between *vᵢ* and *vᵢ₊₁*. Since the graph of the matrix is weighted, we can define the weight of the walk as well by multiplying the weights of each edge traversed:

(To motivate the definition, let’s suppose we are talking about a transition graph of a discrete Markov chain. In this case, the weight of a walk would equal to the probability of having consecutive states corresponding to the prescribed walk.)

So, in graph terminology,

equals to the sum of weights for all possible *k*-long paths between *i* and *j*. This is easy to see for *k = 2*, and the general case follows from this using induction.

If you are familiar with some advanced topics in linear algebra, you must have encountered the concept of *similar matrices*. *A* and *B* are called *similar* if there is a matrix *P* such that

holds. This is a very important concept. If we think of matrices as linear transformations, similarity means that *A* and *B* are essentially the same transformations, only in a different coordinate system. (And *P* is the change of coordinates.)

For example, some matrices can be diagonalized with similarity transformations. So, if we look at them using a special coordinate system, a complicated matrix may just describe a simple stretching.

An important special case is where the similarity matrix *P* is a so-called *permutation matrix*.

In mathematics, any bijective mapping

is called a permutation. Specifically, if *X = {1, 2, …, n}*, then *π* is simply a reordering of the numbers.

Any such permutation can be represented with a matrix, defined by

If this is not easy to understand, no worries, I’ll give an example right now. For the permutation defined by

we have

Why do we define the permutation matrix this way? To see this, consider the following:

Multiplying a matrix with a permutation matrix shuffles its rows or columns, depending on whether we multiply from the left or right, so we have

and

(Recall that matrix multiplication is *not* commutative.)

The inverse of a permutation matrix is its transpose. This is easy to see, once you explicitly calculate the product

by hand.

So, let’s go back to graphs and matrices. For a given permutation matrix and a matrix *A*, what does the similarity transform do? If you think about it, the matrix

contains identical elements, just its rows and columns are shuffled. In fact, their corresponding graphs are isomorphic with each other. (Which is a fancy expression for being the same after relabeling certain vertices.) Although showing this might look difficult, it can be done by simply noting three key things.

- The graph of
*APπ*can be obtained from the graph of*A*by taking all edges*(i, j)*and replacing them with*(i, π(j))*. - Similarly, the graph of
*PπᵀA*can be obtained by replacing*(i, j)*with*(π⁻¹(i), j ).* - Every permutation graph can be written as a product of permutation matrices where only two elements are swapped. These are called
*transpositions*and their inverses are themselves.

A central question in the theory of matrices is to simplify and reveal their underlying structure by some kind of transformation, like similarity.

For example, as mentioned, certain matrices can be diagonalized by a similarity transformation:

**Note that this is not true for all matrices. **(Check out the spectral theorem if you are interested.) Diagonal matrices are special and easy to work with, so when diagonalization is possible, our job is much simpler.

Another special form is the block-triangular form. The matrix *A* is upper block-triangular, if there are submatrices *B, C, D* such that

(Note that *0* is a matrix with all zeros here.)

**Definition.** A nonnegative matrix *A* is called *reducible*, if it can be upper block-triangularized with a similarity transform using a permutation matrix *P*:

If it cannot be done, the matrix is called *irreducible*. From a graph-theoretical perspective, reducibility is equivalent to partitioning the nodes to two subsets *S, T* such that there are no outgoing edges from *T* to *S*.

Imagine that you are randomly walking along the edges of this graph, like a Markov chain. Reducibility means that once you enter *T*, you cannot leave it. (And, if there is a nonzero probability to enter, *you will enter eventually*.)

With irreducible and reducible matrices, nonnegative matrices can be significantly simplified, as we shall see next.

Before going into explanations, let’s

just state the theorem.

**Theorem.** (Frobenius normal form) For every nonnegative matrix *A*, there is a permutation matrix P such that

where the *Aᵢ –*s are irreducible matrices.

This theorem is hard to prove using only the tools of algebra. (As it was done originally.) However, with graphs, it is almost trivial.

To do this, let’s introduce the concept of *strongly connectedness nodes* in the graph. The nodes *i* and *j* are strongly connected if there is a directed walk from *i* to *j* AND a directed walk from *j* to *i*. That is, they are mutually reachable from each other.

In the language of algebra, the relation “*i* and *j* are strongly connected” is an equivalence relation. This means that they partition V into subsets *V₁, V₂, …, Vₖ* such that all vertices in *Vᵢ* are strongly connected, but NOT strongly connected with any other vertex.

For illustration, the graph looks something like this.

Believe it or not, this proves the Frobenius theorem about the normal form. To see, we just have to renumber the vertices such that each strongly connected set of vertices are numbered consecutively. As we have seen, the renumbering of vertices is equivalent to applying a permutation transform. Hence,

is of the desired form.

This theorem illustrates the use of graph theory in linear algebra. By simply drawing a picture, so many structural patterns are revealed.

Of course, this is just the tip of the iceberg. If you are interested, check out the book A Combinatorial Approach to Matrix Theory and Its Applications by Richard A. Brualdi and Dragos Cvetkovic, which is full of beautiful mathematics regarding this topic.

Graphs and matrices go hand in hand. Specifically, graph theory provides a new way to think about matrices. Although this is usually not part of a standard curriculum in linear algebra, it is a fruitful connection between the two. With it, certain structural aspects of matrices become trivial.

]]>A small one-liner solves a problem which makes a function work. The function is needed for a data processing pipeline. The pipeline is integrated to a platform which enables a machine learning driven solution for its users.

Problems are everywhere. Their magnitude and impact might be different, but the general problem solving strategies are the same.

As an engineer, developer, or data scientist, being effective in problem-solving *can really supercharge your results and put you before your peers*. Some can do this instinctively after years of practice, some have to put conscious effort to learn it. However, no matter who you are, you can and you must improve your problem-solving skills.

Having a background in research-level mathematics, I had the opportunity to practice problem solving and observe the process. Surprisingly, this is not something which you have to improvise each time. Rather,

a successful problem solver has several standard tools and a general plan under their belt, adapting as they go.

In this post, my aim is to give an overview of these tools and use them to create a process, which can be followed at any time. To make the situation realistic, let’s place ourselves into the following scenario: we are deep learning engineers, working on an object detection model. Our data is limited, so we need to provide a solution for *image augmentation*.

Augmentation is the process of generating new data from the available images by applying random transformations like crops, blurs, brightness changes, etc. See the image above which is from the readme of the awesome albumentations library.

You need to deliver the feature by next week, so you need to get working on it right away. How to approach the *problem*?

(As a mathematician myself, my thinking process is heavily influenced by the book How to Solve It by George Pólya. Although a mathematical problem is different from real life coding problems, this is a must read for anyone who wishes to get good in problem solving.)

Before attempting to solve whatever problem you have in mind, there are some questions that need to be answered. Not understanding some details properly can lead to wasted time. You definitely don’t want to do that. For instance, it is good to be clear about the following.

**What is the scale of the problem?**In our image augmentation example, will you need to process thousands of images per second in production, or is it just for you to experiment with some methods? If a production grade solution is needed, you should be aware of this in advance.**Will other people use your solution?**If people are going to work with your code extensively, significant effort must be put into code quality and documentation. On the other end of the spectrum, if this is for your use only, there is no need to work as much on this. (I already see people disagreeing with me However, I firmly believe in minimizing the amount of work. So, if you only need to quickly try out an idea and experiment, feel free to*not*consider code quality.)**Do you need a general or a special solution?**A lot of time can be wasted on implementing features no one will ever use, including you. In our example, do you need a wide range of image augmentation methods or just vertical and horizontal flips? In the latter case, flipping the images in advance and adding them to your training set can also work, which requires minimal work.

A good gauge of your degree of understanding is your ability to explain and discuss the problem with others. Discussion is also a great way to discover unexpected approaches and edge cases.

When you understand your constraints and have a somewhat precise problem specification, it is time to get to work.

The first thing you must always do is to look for existing solutions. Unless you are pushing the very boundaries of human knowledge, someone else had already encountered this issue, created a thread on Stack Overflow, and possibly wrote an open source library around it.

Take advantage of this. There are several benefits of using well-established tools, instead of creating your own ones.

**You save a tremendous amount of time and work.**This is essential when operating under tight deadlines. (One of my teachers used to say ironically that “you can save an hour of Google search with two months of work”. Spot on.)**Established tools are more likely to be correct.**Open source tools are constantly validated and checked by the community. Thus, they are less likely to contain bugs. (Of course, this is not a guarantee.)**Less code for you to maintain.**Again, we should always strive for reducing complexity, and preferably the amount of code. If you use an external tool, you don’t have to worry about its maintenance, which is a great deal. Every line of code has a hidden cost of maintenance, to be paid later. (Often when it is the most inconvenient.)

Junior developers and data scientists often overlook these and prefer to always write everything from scratch. (I certainly did, but quickly learned to know better.) The most extreme case I have seen was a developer, who wrote his own deep learning framework. You should **never** do that unless you are a deep learning researcher and you have an idea of how to do significantly better than the existing frameworks.

Of course, not all problems require an entire framework, maybe you are just looking for a one-liner. Looking for existing solutions can be certainly beneficial, though you need to be careful in this case. Finding and using code snippets from Stack Overflow is only fine if you take the time to understand how and why it works. Not doing so may result in unpleasant debugging sessions later, or even serious security vulnerabilities in the worst case.

For these smaller problems, looking for existing solution consists of browsing tutorials and best practices. In general, there is a balance between the ruthless pragmatism and the outside of the box thinking. When you implement something in a way that is usually done, you are doing a favor for the developers who are going to use and maintain that piece of code. (Often including you.)

Suppose that on your path towards delivering image augmentation for your data preprocessing pipeline, you have followed my advice, looked for existing solutions, and found the awesome albumentations library. Great! What next?

As always, there is a wide range of things to consider. Unfortunately, just because you have identified an external tool which can be a potential solution, it doesn’t mean that it will be suitable for your purposes.

**Is it working well and supported properly?**There is one thing worse than not using external code: using buggy and unmaintained external code. If a project is not well documented and not maintained, you should avoid it.

For smaller problems, where answers generally can be found on Stack Overflow, the*working well*part is essential. (See the post I have linked above.)**Is it adaptable directly?**For example, if you use an image processing library that is not compatible with albumentations, then you have to do additional work. Sometimes, this can be too much and you have to look for another solution.**Does it perform adequately?**If you need to process thousands of images per second, performance is a factor. A library might be totally convenient to use, but if it fails to perform, it has to go. This might not be a problem for all cases (for instance, if you are just looking for a quick solution to do experiments), but if it is, it should be discovered early, before putting much work to it.**Do you understand how it works and what are its underlying assumptions?**This is especially true for using Stack Overflow code snippets, for the reasons I have mentioned above. For more complex issues like the image augmentation problem, you don’t need to understand every piece of external code line by line. However, you need to be aware of the requirements of the library, for instance, the format of the input images.

This, of course, is applicable only if you can actually find an external solution. Read on to see what to do when this is not the case.

Sometimes you have to develop a solution on your own. The smaller the problem is, the more frequently it happens. These are great opportunities for learning and building. In fact, this is the actual *problem solving* part, the one which makes many of us most excited.

There are several strategies to employ, all of them should be in your toolkit. If you read carefully, you’ll notice that there is a common pattern.

**Can you simplify?**Sometimes, it is enough to solve only a special case. For instance, if you know for a fact that the inputs for your image augmentation pipeline will always have the same format, there is no need to spend time on processing the input for several cases.**Isolate the components of the problem**. Solving one problem can be difficult, let alone two at the same time. You should always make things easy for yourself. When I was younger, I used to think that solving hard problems is the thing to do in order to get dev points. Soon, I have realized that the people who solve hard problems always do it by solving many small ones.**Can you solve for special cases?**Before you go and implement an abstract interface for image augmentation, you should work on a single method to add into your pipeline. Once you discover the finer details and map out the exact requirements, a more general solution can be devised.

In essence, problem solving is an iterative process, where you pick the problem apart step by step, eventually reducing it to easily solvable pieces.

There is a common trait which I have noticed in many excellent mathematicians and developers: they enjoy picking apart a solution, analyzing what makes them work. This is how you learn, and how you build robust yet simple code.

Breaking things can be part of the problem solving process. Going from a special case to general, you usually discover solutions by breaking what you have.

Depending on the magnitude of the problem itself, you should consider open-sourcing it, if you are allowed. Solving problems for other developers is a great way to contribute to the community.

For instance, this is how I have built modAL, one of the most popular active learning libraries for Python. I started with a very specific problem: building active learning pipelines for bioinformatics. Since building complex methods always require experimentation, I needed a tool which enabled rapid experimentation. This was difficult to achieve with the available frameworks at the time, so I slowly transformed my code into a tool that can be easily adopted by others.

What used to be “just” a solution became a library, with thousands of users.

Contrary to popular belief, effective problem solving is not the same as coming up with brilliant ideas all the time. Rather, it is a thinking process with some well-defined and easy to use tools, which can be learned by anyone. Smart developers use these instinctively, making them look like magic.

Problem-solving skills can be improved with deliberate practice and awareness of thinking habits. There are several platforms where you can find problems to work on, like Project Euler or HackerRank. However, even if you start applying these methods to issues you encounter during your work, you’ll see your skills improve rapidly.

]]>Even though the commercially available computational resources increase day by day, optimizing the training and inference of deep neural networks is extremely important.

If we run our models in the cloud, we want to minimize the infrastructure costs and the carbon footprint. When we are running our models on the edge, network optimization becomes even more significant. If we have to run our models on smartphones or embedded devices, hardware limitations are immediately apparent.

Since more and more models move from the servers to the edge, reducing size and computational complexity is essential. One particular and fascinating technique is *quantization*, which replaces floating points with integers inside the network. In this post, we are going to see why they work and how can you do this in practice.

The fundamental idea behind quantization is that if we convert the weights and inputs into integer types, we consume less memory and on certain hardware, the calculations are faster.

However, there is a trade-off: with quantization, we can lose significant accuracy. We will dive into this later, but first let’s see *why* quantization works.

As you probably know, you can’t just simply store numbers in the memory, only ones and zeros. So, to properly keep numbers and use them for computation, we must encode them.

There are two fundamental representations: *integers* and *floating point numbers.*

**Integers** are represented with their form in base-2 numeral system. Depending on the number of digits used, an integer can take up several different sizes. The most important are

*int8*or*short*(ranges from -128 to 127),*uint8*(ranges from 0 to 255),*int16*or*long*(ranges from -32768 to 32767),*uint16*(ranges from 0 to 65535).

If we would like to represent real numbers, we have to give up perfect precision. To give an example, the number *1/3 *can be written in decimal form as *0.33333…*, with infinitely many digits*, *which *cannot* be represented in the memory. To handle this, **floating-point** numbers were introduced.

Essentially, a float is the scientific notation of the number in the form

where the base is most frequently 2, but can be 10 also. (For our purposes, it doesn’t matter, but let’s assume it is 2.)

Similarly to integers, there are different types of floats. The most commonly used are

*half*or*float16*(1 bit sign, 5 bit exponent, 10 bit significand, so**16 bits in total**),*single*or*float32*(1 bit sign, 8 bit exponent, 23 bit significand, so**32 bits in total**),*double*or*float64*(1 bit sign, 11 bit exponent, 52 bit significand, so**64 bits in total**).

If you try to add and multiply two numbers together in the scientific format, you can see that float arithmetic is slightly more involved than integer arithmetic. In practice, the speed of each calculation very much depends on the actual hardware. For instance, a modern CPU in a desktop machine does float arithmetic as fast as integer arithmetic. On the other hand, GPUs are more optimized towards single precision float calculations. (Since this is the most prevalent type for computer graphics.)

Without being completely precise, it can be said that using *int8* is typically faster than *float32*. However, *float32* is used by default for training and inference for neural networks. (If you have trained a network before and didn’t specify the types of parameters and inputs, it was most likely *float32*.)

So, how can you convert a network from *float32* to *int8*?

The idea is very simple in principle. (Not so much in practice, as we’ll see later.) Suppose that you have a layer with outputs in the range of *[-a, a)*, where *a* is any real number.

First, we scale the output to *[-128, 128)*, then we simply round down. That is, we use the transformation

To give a concrete example, let’s consider the calculation below.

The range of the values here is in *(-1, 1)*, so if we quantize the matrix and the input, we get

This is where we see that the result is not an *int8*. Since multiplying two 8-bit integers is a 16-bit integer, we can de-quantize the result with the transformation

to obtain the result

As you can see, this is not exactly what we had originally. This is expected, as quantization is an approximation and we lose information in the process. However, this can be acceptable sometimes. Later, we will see how the model performance is impacted.

We have seen that quantization basically happens operation-wise. Going from *float32* to *int8* is not the only option, there are others, like from *float32* to *float16*. These can be combined as well. For instance, you can quantize matrix multiplications to *int8*, while activations to *float16*.

Quantization is an approximation. In general, the closer the approximation, the less performance decay you can expect. If you quantize everything to *float16*, you cut the memory in half and probably you won’t lose accuracy, but won’t really gain speedup. On the other hand, quantizing with *int8* can result in much faster inference, but the performance will probably be worse. In extreme scenarios, it won’t even work and may require quantization-aware training.

There are two principal ways to do quantization in practice.

**Post-training:**train the model using*float32*weights and inputs, then quantize the weights. Its main advantage that it is simple to apply. Downside is, it can result in accuracy loss.**Quantization-aware training:**quantize the weights during training. Here, even the gradients are calculated for the quantized weights. When applying*int8*quantization, this has the best result, but it is more involved than the other option.

In practice, the performance strongly depends on the hardware. A network quantized to *int8* will perform much better on a processor specialized to integer calculations.

Although these techniques look very promising, one must take great care when applying them. Neural networks are extremely complicated functions, and even though they are continuous, they can change very rapidly. To illustrate this, let’s revisit the legendary paper *Visualizing the Loss Landscape of Neural Nets* by Hao Li et al.

Below is the visualization of the loss landscape of a ResNet56 model without skip connections. The independent variables represent the weights of the model, while the the dependent variable is the loss.

This figure above illustrates the point perfectly. Even by changing the weights just a bit, the differences in loss can be enormous.

Upon quantization, this is exactly what we are doing: approximating the parameters by sacrificing precision for a compressed representation. There is no guarantee that it won’t totally mess up the model in result.

As a consequence, if you are building deep networks for tasks where safety is critical and the loss of a wrong prediction is large, you have to be *extremely* careful.

If you would like to experiment with these techniques, you don’t have to implement things from scratch. One of the most established tools is the model optimization toolkit for TensorFlow Lite. This is packed with methods to squeeze down your models as small as possible.

You can find the documentation here and an introductory article, if you are interested in the details.

PyTorch also supports several quantization workflows. Although it is currently marked experimental, it is fully functional. (But expect the API to change until it is in the *experimental* state.) There are also tutorials about dynamic quantization, more specifically on an LSTM model and BERT.

Aside from quantization, there are other techniques to compress your models and accelerate inference.

One particularly interesting one is *weight pruning*, where the connections of a network are iteratively removed during training. (Or post-training in some variations.) Surprisingly, you can remove even 99% of the weights in some cases and still have adequate performance.

If you are interested, I wrote a detailed summary about the milestones in the field, with a discussion of the state of the art.

The second major network-optimizing technique is *knowledge distillation*. Essentially, after the model is trained, a significantly smaller *student* model is trained to predict the original model.

This method was introduced by Geoffrey Hinton, Oriol Vinyals and Jeff Dean in their paper *Distilling the Knowledge in a Neural Network*.

Distilling has been successfully applied to compress BERT, a huge language representation model, which has applications all throughout the spectrum. With distilling, the model can actually be capable to be used on the edge, like smartphone devices.

One of the leaders in these efforts is the awesome Hugging Face, who are the authors of the DistilBERT paper.

If you are interested in a practical example, you can check out how to perform the above mentioned BERT distillation using Catalyst, a library from the PyTorch ecosystem.

As neural networks move from servers to the edge, optimizing speed and size is extremely important. Quantization is a technique which can achieve this. It replaces *float32* parameters and inputs with other types, such as *float16* or *int8*. With specialized hardware, inference can be made much faster compared to not quantized models.

However, since it quantization is an approximation, care must be taken. In certain situations, it can lead to significant accuracy loss.

Along with other model optimization methods such as weight pruning and knowledge distillation, this can be the quickest to use. With this tool under your belt, you can achieve results without retraining your model. In a scenario where post-training optimization is the only option, quantization can go a long way.

]]>