Developing machine learning algorithms is easier than ever. There are several high-level libraries like TensorFlow, PyTorch or scikit-learn to build upon, and thanks to the amazing effort of many talented developers, these are really easy to use and require only a superficial familiarity with the underlying algorithms. However, this comes at a cost of deep understanding. Without proper theoretical foundations, one quickly gets overwhelmed with the complex technical details.

My aim is demonstrate how seemingly obscure and ad-hoc methods in machine learning can be explained with really simple and natural ideas when we have the appropriate perspective. Our mathematical tool for that is going to be probability theory and statistics, which lies at the foundation of predictive models. Using the theory of probability is not just an academic exercise, it can actually provide very deep insight into how machine learning works, giving you the tools to improve on the state of the art.

Before we begin our journey in the theoretical foundations of machine learning, let’s look at a toy problem!

## Fitting models

Suppose that we have two numerical quantities, say *x* and *y*, that are in relation with each other. For instance, *x* is the the square footage of a real estate and *y* is its price in the market for a given city. Our goal is to predict *y* from *x*. In mathematical terms, we can say that we are looking for a function *f(x)* such that

is as close as possible to the ground truth. The simplest assumption is that the function is linear, that is, *f(x) = ax + b* for some *a* and *b* constants. This model surprisingly fits well to a lot of real-life problems, moreover it is a fundamental building block of more complex models such as neural networks.

Our example dataset looks like the following.

Suppose that our dataset consists of data points and observations

Since data is noisy, it is impossible to fit a linear model to our data for which

holds. Instead, we measure how well a particular model fits the data and try to find the best fit possible according to that measurement.

Most frequently, this is is done by calculating the prediction error for each point, then averaging the pointwise error. The lower the error, the better our model is. The simplest way to measure this is to take the square of the differences between prediction and ground truth. This is called the mean square error and it is defined by

The best fit is the one where the mean square error assumes its minimum, or in mathematical terms,

For our concrete dataset, this looks like the following.

How to minimize this is not particularly important for our purposes. What matters to us is how can the model be interpreted and what kind of information it conveys. By simply looking at it, one can tell that there are many missing things. Sure, the data clearly follows a linear trend, but can this even be described with our model? A function is deterministic: you plug in *x*, you get the prediction *f(x)*. However, the data seems to show some noise which is clearly not captured by our linear model properly.

Where else might we look? One obvious answer would be to look for a more complex function, say a higher degree polynomial. In mathematical terms, we are looking for a function defined by

which can hopefully describe the data more precisely. Unfortunately, this is not the case.

When we try to brute force it and excessively increase the expressive power of our model, we see that in a sense, things are even worse: our concerns are not addressed and other issues are introduced, like the wild behavior of the model outside our domain. We have to look in other directions for the solution. Instead of looking for a single value as a prediction, we should aim for a model explaining the probability of certain outcomes! This way, we can make a more informed decision such as calculating risks.

## The language of probability theory

Suppose that I have a coin, which has 1/2 probability to come up heads and 1/2 probability to come up tails upon a coin toss. If I toss the coin 100 times, how many heads will I get? It might be tempting to quickly answer 50, but the reality is not so simple. See, you can get any number of heads from 0 to 100, but these are not equally probable. In one experiment, you might get 34, in the next, 52, etc. The probability of heads being 1/2 means that if you keep tossing the coins infinitely, the proportion of heads vs all tosses will get closer and closer to 1/2.

Let’s stick to the example of tossing a coin *n* times! Suppose that *X* denotes the number of heads. *X* is a random variable, for which we do not know the exact value, only the probability that it will assume a given value. The probability that *X* is *k* is denoted by *P(X = k)*, which is always a number between 0 and 1. Moreover, since we know that we can get either 0, 1, 2, and so on up until *n* heads, these probabilities sum to 1, that is,

Every experiment can be described with an *n*-long sequence of *H*-s and *T*-s. To calculate the exact probability for a given *k*, we need to count how many ways can we have *k* heads among *n* tosses. This is mathematically equivalent of selecting a subset of *k* elements from a set of *n* elements, which can be done in

ways. Moreover, each configuration has *(1/2)^n* probability. (For example, consider *n = 3. *The configuration *HTH* has probability 1/8, since each particular toss has 1/2 probability in itself and the tosses are independent.) So putting this together, we have

Together, these numbers are referred to as the probability distribution of the experiment. (Which is tossing *n* coins and observing the number of heads among them.)

To summarize, a probability distribution encompasses all information about the outcome of the experiment. Our model in the previous section only gives us a single number, however, we should be looking for a probability distribution instead to make a fully informed decision.

### The general Bernoulli and Binomial distributions

Our two example above can be generalized. First, suppose that we toss an unfair coin, that is, a coin where heads and tails don’t have equal probability. Say, the probability of heads is *p* and denote the number of heads with *X* as before. (Which can be zero or one.)

is called the Bernoulli distribution with parameter *p*, or *Bernoulli(p)* in short. Following this line of thinking, if we toss up this unfair coin *n* times, we have

which is called the binomial distribution with parameters *n* and *p*, or *b(n, p)* in short. Note that the Bernoulli distribution is just the binomial distribution with *n = 1*.

### Continuous probability distributions

In the previous coin tossing examples, probability distributions were fully described by the numbers *P(X = k)* for all feasible *k*. This particular random variable can only assume integers as its values, of which there are countably many. Such random variables are called *discrete*. But what about the previous case, where we estimated real estate prices? In general, random variables can also assume all real numbers as its values for instance. In this case they are called *continuous*. Say *X* can be any random number between zero and one with equal probability. What is its probability distribution? Intuitively, we can see that

for any fixed *x* in *[0, 1]*. So, how would we describe this distribution? We use the so-called probability density function (PDF in short), which describes the probability that *X* falls into a certain range, say *a ≤ x ≤ b*. The probability itself can be calculated by measuring the area under the curve of the PDF between *a* and *b*. In the case of selecting a random number between zero and one, we have

for the density function. Note that the total area under the graph of a density function is always *1*, due to the fact it expresses the probability of all outcomes.

### The normal distribution

A continuous distribution of great importance is the so-called normal distribution. (Or Gaussian in another name.) You have definitely encountered this at some point, even though you may not have known that it was a normal distribution. We say that *X* is normally distributed with mean μ and variance σ² variance, or *N(μ, σ²)* in short, if its density function is

This comes up very frequently in real life, for instance the height of people tend to show this distribution. We will encounter this many times later in our journey. The mean describes the center of the bell curve.

Notationwise, the PDF of *N(μ, σ²)* is frequently denoted as *N(x | μ, σ²)*. We will use this later.

### Conditional probability

Let’s consider a different example now. Instead of tossing coins, we will throw with a single six-sided dice. If we denote the outcome of the throw with *X*, it is reasonable to assume that

Suppose that you want to make a bet. Your friend throws the dice, and if if the outcome is less or equal than 3, you win. Otherwise, you lose the same amount. It is easy to see that by default, your probability of winning is 1/2. However, your friend tells you after throwing the dice that the outcome is an even number. What about your chances of winning now? Intuitively, it feels like that you have lower chances now, because you can only win with a 2 and you will lose if it is 4 or 6. Thus, additional information on the experiment has changed the underlying probability distribution.

This concept can be formalized mathematically as the *conditional probability*. Let’s suppose you have two events, *A* and *B*. In our concrete example, *A* is that the outcome of the throw is less or equal than 3, while *B* is that the outcome of the throw is an even number. The probability of *A* given that *B* has happened is called the conditional probability of *A* given *B*. It is denoted with *P(A | B)* and can be calculated as

In our case, *P(A and B) = 1/6*, while *P(B) = 1/2*, so our chance of winning is *P(A | B) = 1/3*.

### The statistical foundations of machine learning

To see how conditional probability fits into our machine learning perspective, we need to take another conceptual jump. Let’s revisit our coin tossing example again! This time however, the coin is not fair, so the probability of heads is not 1/2. Let’s assume that

for some *p* in *[0, 1]*. The catch is, we don’t know its exact value, we have to guess from the data. In other words, we want to estimate its probability distribution. *p* is a parameter in our coin toss experiment. (Just to be clear, we have two distributions here: one describes the outcome of a coin toss, while the second one describes our belief about the probability of heads for our given coin.)

Suppose that we have the particular coin in our hands and we toss it up in the air ten times, obtaining the result

that is, three tails and seven heads. In the language of probability, let *E* describe the event “seven heads out of ten”. So, what we really want, is

This is called the *posterior*, since it describes our belief about the coin after observing some data. Note that this is a continuous probability distribution, since *p* can assume any value between zero and one. How would we calculate this? A fundamental property of conditional probability comes to the rescue here. If *A* and *B* are general events, then

In other words, the probability of *A* conditioned on *B* can be expressed with the probability of *B* conditioned on *A*. This is called the Bayes theorem, which holds true for probability density functions as well. How does this help us? Now we have

which is great for us! There are three components in play here.

i)* P(E | p)*, which is called the *likelihood*. It is easy to calculate, and we did so in the previous section. In our current case with seven heads, we have

which means they are proportional to each other, that is, equal up until a multiplicative constant. This constant doesn’t really matter for us for reasons to be explained later.

ii) *P(p)*, which is called *prior*, because expresses our knowledge about the coin before we have observed any data. It is reasonable to assume that every parameter is equally probable, so

which is already familiar for us.

iii) *P(E)*, which does not need to be calculated, since the area under the function *P(p | E)* is always one. (In most cases calculating *P(E)*, it is computationally intractable even.) Because of this exact reason we do not really care about the multiplicative constant in the likelihood function.

From these, we can easily give a good estimation on the probability distribution of *p*, which describes the probability that a single coin toss will result in heads.

### Maximum likelihood estimation

Even though we have a probability distribution, it is generally useful to provide a concrete number as our estimate. In this case, we would like to have a parameter estimate for the probability of our coin turning up heads. Although the Bayesian route was easy for coin tossing, it can be impossible to calculate analytically. One may ask the question: given our observed outcome, which parameter is the most likely? If you think about it, this is described by the likelihood function

where *E* described the event *HHTHTHHHHT*. We calculated this by multiplying the probabilities of events we observed together.

In general, if we have the observations

with results

written in vector form for notational convenience, the likelihood function is defined by

which is a function of the variable θ. θ denotes all parameters of our probability distribution, thus it can be a scalar or a vector variable even. In our repeated coin tossing example, the *Y*-s denote the *i*-th coin toss experiment and *y* denotes the outcome. Here, *y* is one for heads and zero for tails. Also keep in mind that in general, *Y* can be discrete but continuous also.

Intuitively (which is a phrase we use quite a lot 🙂 ), the particular θ value where the likelihood function assumes its maximum would be a reasonable choice for our parameter estimation. This method is called the *maximum likelihood estimation* or MLE in short. In mathematical terms,

In the concrete example above with seven heads and three tails, this is 0.7. Although MLE is not as desirable as the full Bayesian treatment, it is often reasonable. Note that when the prior distribution is uniform, MLE is equivalent to estimating the parameter by maximizing the posterior distribution

This latter is called the *maximum a posteriori estimation* or MAP in short.

With all of these mathematical tools under our belt, we are ready to revisit the original regression example!

### Regression revisited

To recap, we have the observations

and we would like to predict the *y*-s from the *x*-es. Previously, we were looking for a function *f(x) = ax + b* such that

is reasonably close to the ground truth.

This time, let’s look from a probabilistic viewpoint! Now we have data points

coming from the distributions

For each data point, we have ground truth observations

coming from the distributions

It is reasonable to assume that all *Y*-s can be modelled as a normal distribution *N(μ(x), σ(x)²)*. For simplicity, we can assume that the variance is constant and that *μ(x) = ax + b* is a linear function. That is, we are looking to fit a model

Given all of our observations, the likelihood function can be written as

We want to maximize this function. It might seem difficult, but there is a standard mathematical trick: maximizing a function is the same as maximizing its logarithm. (As long the logarithm is properly defined, like here.) So, we have

We see the light at the end of the tunnel here. The first term can be omitted (since it is a constant), while minimizing a function is same as maximizing its negative. Thus,

It is no accident if it seems familiar: the right hand side is the mean square error! This says that the maximum likelihood estimate for fitting a model in the form of

is the general case of the linear regression we did earlier. There is one major difference however: the probabilistic model explains much more than a simple linear function. It is only the tip of the iceberg, there are many ways to generalize this simple probabilistic model to obtain more general fit for our data. One obvious method is to account for σ as a parameter and drop the constant assumption.

### Classification

Before we finish up, we will take a detailed look at classification, another major class of problems in machine learning. Here, we have a slightly different kind of problem. Our training data is again

where this time, the *y*-s are labels instead of real numbers. Thus, an *Y* is a discrete probability distribution. Our previous linear regression model is unable to properly capture this problem. In addition, although mean square error can be used when the labels are encoded with integers, it doesn’t really make sense.

For illustrative purposes, let’s consider a simple binary classification problem in one dimension with two classes, encoded with 0 and 1.

If you are familiar with these type of problems, you probably know that the usual solution is to fit a function of the form

where

is the well-known sigmoid function. This model is called *logistic regression*. Heuristically, the linear function is fitted such that it assumes positive values for *x*-es belonging to the first class and negative values for the opposite class. Then, the sigmoid converts these real numbers into probabilities. The higher *ax + b* is, the closer its sigmoid value is to 1 and similarly, the lower *ax + b* is, the closer its sigmoid value is to 0. Thus, *f(x)* effectively models the probability that *x* belongs to class 1.

To fit this model, the so-called cross-entropy loss is minimized, which is defined by

What does this loss function mean? In the regression example at the very beginning, mean square error was intuitively clear. This is not the case with cross-entropy. Have you ever thought about why cross-entropy loss is defined this way? By looking at this formula only, it is almost impossible to figure out.

If we look at the classification problem from the perspective we have gained in the previous sections, this binary classification problem can be solved with fitting a Bernoulli distribution to each *x*, that is, we model our data with

For the likelihood function, we have

After taking logarithms like before, we obtain

which is the negative of cross-entropy loss! All of a sudden, this mysterious formula filled with logarithms has a clear explanation. Minimizing the cross-entropy loss is simply modelling our data with Bernoulli distributions and taking the maximum likelihood estimate.

## Summary

When working with machine learning algorithms for problems such as classification and regression, we usually formulate questions in our spoken language like “What will be the price of this stock tomorrow?”, “Is this a negative or positive comment?” and so on. Here, my aim was to show that many of these algorithms are actually doing much more: they provide a deep statistical understanding of the underlying problem. Instead of simply estimating the stock price tomorrow, they are capable to provide much more information than a simple prediction. We have seen that the fundamental objects of machine learning such as linear regression, binary classification with logistic regression, mean square error and cross-entropy loss are all arising from very natural ideas in a statistical setting.

This is just the tip of the iceberg. Although we have seen only the most basic models, even the most advanced deep neural networks are built on these foundations. If you have understood these fundamentals, you are now one big step closer to master machine learning.