Do Machine Learning Like an Experimental Scientist

Let’s travel back in time a few decades.

Linus Pauling, the only scientist who won two unshared Nobel Prizes and one of the greatest chemists of all time, was a well-organized person. Among other things, he was known for meticulously keeping notebooks, containing his experiments, conclusions, and ideas. 

During his life’s work, he left 46 research notebooks, which is an impressive number.

Pauling was not the only scientist who did this: Charles Darwin, Alexander Graham Bell, Thomas Edison, and practically every scientist before and after their time.

Notebooks provide an excellent tool to help reproduce experiments, formulate hypotheses, and draw conclusions. When an experiment has several sensitive parameters (like humidity, temperature, and light conditions for a plant biologist), reproducing results are impossible without keeping track.

Fast forward to the present day.

Albeit the tools have evolved, experimental scientists are still doing this. The notebooks may be digital, but the point is the same.

You may ask, what does this have to do with data science or machine learning? Let me explain. Does the following folder look familiar?

├── model_v1
├── model_v1_final
├── model_v1_final_l1_regularization
├── model_v2_adam_final
├── model_v2_final
├── model_v2_final_finetuning
└── model_v2_final_on_augmented_dataset

A month later, it turns out that model_v1_final_l1_regularization was the best. Now you have to go back and reproduce it. What were the hyperparameters? The initial learning rate? Which dataset have you used? (Because guessing from the name of the last model, you have several.) What was the model architecture? These are all essential questions.

If you did not keep track of this, you are in trouble. Thankfully, you are not the first one to face this problem, and because of this, there are already well-established solutions.

Recording the parameters and the metrics have other benefits than reproducibility. Evaluating the results can give valuable insight into how can the model be even better and which hyperparameters are important. On top of this, since most projects are done by teams, communication can be simplified with beautiful charts and reports.

Introducing experiment tracking in your workflow requires a time investment, which will pay off exponentially later. Getting rid of old habits require conscious effort, but this is worth the time.

How to track experiments

So, you have decided to move beyond using only Jupyter Notebooks in your machine learning development workflow. There are several tools, ranging from simple to complex full-stack solutions.

To give a concrete example, suppose that you are building a ResNet-based model for classification.

Even with disregarding the hyperparameters arising from the model architecture (like the number of filters learned by the convolutional layers), you still have a lot. For instance, the

  • learning rate,
  • learning rate decay strategy and rate,
  • optimizer and its parameters,
  • data augmentation methods,
  • batch size,
  • number of epochs

are all important and can influence the result.

In this post, we are going to take a look into two tools that you can use to track your machine learning experiments: MLFlow and Weights and Biases.

The quick fix: Excel tables

Before we move on, I would like to mention that if you want to implement a solution fast, you should just keep a record of your experiments in a table.

I know, keeping track of things manually is far from optimal. Logging is tedious and if you are not clear enough in your notes, interpretation can be difficult later. Despite this, there are multiple arguments for using an Excel table (or Google Sheet or whatever you use) to record the hyperparameters and the experimental results.

This method is extremely low effort and you can start it right now, without learning anything new or browsing tutorials for hours. You don’t always need an advanced solution, sometimes you just have to whip something up quickly. Opening up an Excel and recording your experiments there instantly brings some order to your workflow.

However, often you need much more. Let’s see what can you use to be as organized as a Kaggle Grandmaster!


One of the first and most established tools for experiment tracking is MLFlow, which consists of three main components: Tracking, Projects, and Models.

MLFlow Tracking offers an interactive UI to record hyperparameters, metrics, and other data for experiment tracking. We are going to focus on this part, however, it goes in tandem with Projects, a tool for easily packaging your model into an API, and Models which lets you deploy the models to production. These are designed to be seamlessly integrated, summing up to a complete machine learning workflow. (Hence the name.) It can be used with the major machine learning frameworks, like TensorFlow, PyTorch, scikit-learn, and others.

Once you have installed it with pip install mlflow, you can start adding it to your code without any extra steps.

MLFlow tracks your experiment in runs. Each run is saved on your hard drive, which you can review in an interactive dashboard. To start a run, you should wrap the training code to the mlflow.start_run() context. To give an example, this is how it works.

import mlflow
with mlflow.start_run():
model =# define model here
mlflow.log_params(config) # log hyperparameter dictionary
for epoch_idx in range(num_epochs):
train_loss = train(model, data)
val_loss = validate(model, data)
mlflow.log_param("train_loss", train_loss) # log training loss
mlflow.log_param("val_loss", val_loss) # log validation loss
view raw hosted with ❤ by GitHub

The logs are saved in the working directory. To check the result of the runs, the dashboard can be launched with 

mlflow ui

from the same directory where the runs are saved. By default, the UI can be opened at localhost:5000 in your browser.

Plots for the logged quantities are also available, where the different runs can be compared with each other.

Comparing validation losses between runs in MLFlow

A tutorial example with linear regression can be found here.

To use MLFlow, you don’t have to record all experiments locally. You can use remote tracking servers, by using the managed MLFlow service on Databricks or even hosting your own. (However, the Databricks Community Edition is the free version of the managed platform, which comes with access to a remote tracking server.)

One particular feature which MLFlow lacks is collaboration and team management. Machine learning projects are rarely done in isolation, and communicating the results can be difficult.

To summarize,

  • MLFlow is an entirely open source tool,
  • which comes with features to manage (almost) the entire machine learning development cycle such as model packaging and serving,
  • easy to install and integrate it into your experiments,
  • provides a simple UI to track your experiments where you can track experiments locally or remotely,
  • but for remote tracking, you either have to host your own server or use the Databricks platform,
  • moreover, collaborative features are not available.

In recent years, several tools were created to improve the user experience and make the machine learning development process even more seamless. These are similar in functionality, so we are going to single out just one: the Weights and Biases tool.

Weights and Biases

One of the newest applications on the list is Weights and Biases, or wandb in short. It is also free to use, but certain functionalities are only available with a paid membership. Similarly to MLFlow, it provides an interactive dashboard that is accessible online and updated in real time. However, the tracking is done by a remote service.

With Weights and Biases, it is also very easy to get started, and adding the experiment tracking to an existing codebase is as simple as possible.

After registering, installing the package with pip install wandb, and getting your API key, you can log in by typing

wandp login

into the command line, which prompts you to authenticate yourself with the key.

Now you are ready to add tracking to the code!

For simplicity, let’s assume that we only have two parameters: batch size and learning rate. First, you have to initialize the wandb object, which will track hyperparameters and communicate with the web app.

import wandb
config=config, # config dict contains the hyperparameters

Here, the config dictionary stores the hyperparameters, like

config = {"batch_size": 4, "learning_rate": 1e-3}

These are recorded and when later reviewing the runs with different hyperparameters, you can filter and group for these variables in the dashboard.

Visualization of the logged metrics for a given run in Weights and Biases. This example is available to explore online.

When this is set, you can use the wandb.log method to record metrics, like the training and validation losses after an epoch:

wandb.log({"train_loss": train_loss, "val_loss": val_loss})

Every call to this method logs the metrics to the interactive dashboard. Besides scalars, sample predictions like images or matplotlib plots can also be saved.

Weights and Biases dashboard. This example is available to explore online.

If you use PyTorch, the method can be used to register your model and keep track of all its parameters.

Weights and Biases collects much more data than you specify. First, it records all system data it can access, like GPU and CPU utilization, memory usage, and even GPU temperature. 

In addition to this, the plots and visualizations present in the logs can be used to create research paper quality reports and export them to pdfs. It even supports LaTeX for mathematical formulas. (I love this feature.) See this one for example. 

The free version gives a ton of value already, but if you want to access the collaborative and team management tools, you have to pay a monthly subscription fee.

To get a more detailed look at the tool, check this article, written by the founder of Weights and Biases.

Compared to MLFlow, some features stand out. For instance,

  • the UI is beautifully designed and the user experience is significantly better,
  • tracking is done remotely by default in the web app,
  • you can create beautiful reports and export them to pdf.

Besides MLFlow and Weights and Biases, there are many tools with similar functionality. Some of them even go beyond and provide a full stack MLOps platform. Without any preference, here are some of them.


Similarly to software development, creating a machine learning solution is not a linear process. 

In software development, as features are added, the product is constantly tested, improved. As user feedback, feature requests, and bug reports pours in, engineers often go back to the drawing board and rethink components.

This is where things often go sideways.

With every change, things may break. Bugs can be introduced and things can get very confusing. Manage a single shared codebase between multiple developers would be impossible without version control like git.

In machine learning, complexity is taken to another level. Not only the code to train and serve the models change rapidly, but the model development itself is more of an experimental process. Without specialized tools to track the results, data scientists are lost.

Two of the best tools out there (among many other excellent ones) are the MLFlow suite and the Weights and Biases tool. Depending on your needs, these can introduce you to a whole new world if you haven’t spent the extra time to organize your work. These are easy to learn and bring so many positive things that you shouldn’t miss out on them.

Share on facebook
Share on twitter
Share on linkedin

Related posts