# Machine Learning, Machine Learning Theory To study structure,
tear away all flesh so
only the bone shows.

Linear algebra. Graph theory. If you are a data scientist, you have encountered both of these fields in your study or work at some point. They are part of a standard curriculum, frequently used tools in the kit of every engineer.

What is rarely taught, however, is that they have a very close and fruitful relationship. Graphs can be used to prove strong structural results about matrices easily and beautifully.

To begin our journey, first, we shall take a look at how a matrix can be described with a graph.

## Matrices as graphs

Suppose that we have a square matrix

We say that the weighted and directed graph

corresponds to A if

If this sounds complicated, here is an example.

So, for every square matrix, we have a weighted and directed graph. In general, having distinct representations for the same object is colossally useful in mathematics. Sometimes, complex things can be significantly simplified the moment you start looking at things from a different perspective.

Such as the case of matrices and graphs.

### Walking around the graph

A great feature of the graph representation is the ability to visualize matrix powers.

Suppose that the k-th power of A is denoted by

where k can be any positive integer. For k = 2, it is calculated by

which seems pretty mystical.

With graphs, there is a pretty simple explanation.

For any two nodes i and j, there are several ways to go from one to another. These are called walks. In general, a walk is defined by the sequence of vertices

where there must be an edge between vᵢ and vᵢ₊₁. Since the graph of the matrix is weighted, we can define the weight of the walk as well by multiplying the weights of each edge traversed:

(To motivate the definition, let’s suppose we are talking about a transition graph of a discrete Markov chain. In this case, the weight of a walk would equal to the probability of having consecutive states corresponding to the prescribed walk.)

So, in graph terminology,

equals to the sum of weights for all possible k-long paths between i and j. This is easy to see for k = 2, and the general case follows from this using induction.

## Mixing things up (literally)

If you are familiar with some advanced topics in linear algebra, you must have encountered the concept of similar matrices. A and B are called similar if there is a matrix P such that

holds. This is a very important concept. If we think of matrices as linear transformations, similarity means that A and B are essentially the same transformations, only in a different coordinate system. (And P is the change of coordinates.)

For example, some matrices can be diagonalized with similarity transformations. So, if we look at them using a special coordinate system, a complicated matrix may just describe a simple stretching.

An important special case is where the similarity matrix P is a so-called permutation matrix.

### Permutation matrices

In mathematics, any bijective mapping

is called a permutation. Specifically, if X = {1, 2, …, n}, then π is simply a reordering of the numbers.

Any such permutation can be represented with a matrix, defined by

If this is not easy to understand, no worries, I’ll give an example right now. For the permutation defined by

we have

Why do we define the permutation matrix this way? To see this, consider the following:

Multiplying a matrix with a permutation matrix shuffles its rows or columns, depending on whether we multiply from the left or right, so we have

and

(Recall that matrix multiplication is not commutative.)

The inverse of a permutation matrix is its transpose. This is easy to see, once you explicitly calculate the product

by hand.

#### Similarity transforms with permutation matrices

So, let’s go back to graphs and matrices. For a given permutation matrix and a matrix A, what does the similarity transform do? If you think about it, the matrix

contains identical elements, just its rows and columns are shuffled. In fact, their corresponding graphs are isomorphic with each other. (Which is a fancy expression for being the same after relabeling certain vertices.) Although showing this might look difficult, it can be done by simply noting three key things.

• The graph of APπ can be obtained from the graph of A by taking all edges (i, j) and replacing them with (i, π(j)).
• Similarly, the graph of PπᵀA can be obtained by replacing (i, j) with
(π⁻¹(i), j ).
• Every permutation graph can be written as a product of permutation matrices where only two elements are swapped. These are called transpositions and their inverses are themselves.

## The structure of matrices

A central question in the theory of matrices is to simplify and reveal their underlying structure by some kind of transformation, like similarity.

For example, as mentioned, certain matrices can be diagonalized by a similarity transformation:

Note that this is not true for all matrices. (Check out the spectral theorem if you are interested.) Diagonal matrices are special and easy to work with, so when diagonalization is possible, our job is much simpler.

Another special form is the block-triangular form. The matrix A is upper block-triangular, if there are submatrices B, C, D such that

(Note that 0 is a matrix with all zeros here.)

Definition. A nonnegative matrix A is called reducible, if it can be upper block-triangularized with a similarity transform using a permutation matrix P:

If it cannot be done, the matrix is called irreducible. From a graph-theoretical perspective, reducibility is equivalent to partitioning the nodes to two subsets S, T such that there are no outgoing edges from T to S.

Imagine that you are randomly walking along the edges of this graph, like a Markov chain. Reducibility means that once you enter T, you cannot leave it. (And, if there is a nonzero probability to enter, you will enter eventually.)

With irreducible and reducible matrices, nonnegative matrices can be significantly simplified, as we shall see next.

### The Frobenius normal form

Before going into explanations, let’s

just state the theorem.

Theorem. (Frobenius normal form) For every nonnegative matrix A, there is a permutation matrix P such that

where the Aᵢ –s are irreducible matrices.

This theorem is hard to prove using only the tools of algebra. (As it was done originally.) However, with graphs, it is almost trivial.

To do this, let’s introduce the concept of strongly connectedness nodes in the graph. The nodes i and j are strongly connected if there is a directed walk from i to j AND a directed walk from j to i. That is, they are mutually reachable from each other.

In the language of algebra, the relation “i and j are strongly connected” is an equivalence relation. This means that they partition V into subsets V₁, V₂, …, Vₖ such that all vertices in Vᵢ are strongly connected, but NOT strongly connected with any other vertex.

For illustration, the graph looks something like this.

Believe it or not, this proves the Frobenius theorem about the normal form. To see, we just have to renumber the vertices such that each strongly connected set of vertices are numbered consecutively. As we have seen, the renumbering of vertices is equivalent to applying a permutation transform. Hence,

is of the desired form.

This theorem illustrates the use of graph theory in linear algebra. By simply drawing a picture, so many structural patterns are revealed.

Of course, this is just the tip of the iceberg. If you are interested, check out the book A Combinatorial Approach to Matrix Theory and Its Applications by Richard A. Brualdi and Dragos Cvetkovic, which is full of beautiful mathematics regarding this topic.

### Conclusion

Graphs and matrices go hand in hand. Specifically, graph theory provides a new way to think about matrices. Although this is usually not part of a standard curriculum in linear algebra, it is a fruitful connection between the two. With it, certain structural aspects of matrices become trivial.