How to Build a Relational Database From Scratch

Relational databases are amazing. They are not only very effective at storing data, but they also have a beautiful underlying mathematical theory as well. However, especially when one is using a database through an ORM, it can seem like a black box. Opening the black boxes is an excellent way to master a subject, so I have decided to go into the details and build a minimal but functional relational database system on my own. In this post, I am sharing my journey with you.

Instead of trying to follow how SQL works under the hood, my aim is to show the fundamental principles behind relational databases. Even if one wants to be more specific, there are several SQL dialects like PostgreSQL, MySQL, and many more, which can have significant differences.

Before we dive deep into the implementation, we should have a solid understanding of the underlying mathematical objects. So, let’s talk about what relations really are!

(If you would like to follow interactively and play around with the code, you can find the repository on GitHub at https://github.com/cosmic-cortex/relational-databases-from-scratch!)

Relations

“Mathematicians are like Frenchmen: whatever you say to them they translate into their own language and forthwith it is something entirely different.” — Johann Wolfgang von Goethe

To understand relational databases, first, we shall define what is a relation. Mathematically speaking, a relation R over sets

is a subset of their Cartesian product:

To give you an example, the < operator is a relation. You can think of it as a set of all tuples of natural numbers where the first element in the tuple is less than the second. This is how actually < is defined in mathematics.

A more concrete example will help grasp this abstract definition. Suppose that we are talking about employees in a company. Each employee has a unique id, a name, position, and a salary:

In this model, an employee is a vector of four dimensions:

Employee relation is a set of these. Using relational databases terminology, a relation corresponds to a table and an element of a relation is a record in that table. In the following, I will use these terms interchangeably.

In Python, we are going to model records with special dictionaries and tables with a set of records. We could go with something more complex (like pandas data frames for instance) but these will be perfect. Note that by going with sets, duplicate records are not allowed.

class Record(dict):
def __hash__(self):
proxy = tuple(self.items())
return hash(proxy)
def __setitem__(self, key, value):
raise NotImplemented("Modifying values is not supported.")
view raw record.py hosted with ❤ by GitHub

Because dictionaries are unhashable by default, they cannot be inserted into sets. To circumvent this, I have added a hash method for the Record class, which returns the hash of the tuple obtained by the dict.

This is dangerous to do, so we should be careful. Because dictionaries are mutable, their “hash” (as I defined above) can change. Thus, if you put this into a set and change one of its value, the hash changes and because sets in Python use hashes under the hood, this can mess things up. So, to avoid this, I have explicitly disabled modifying the record by overwriting the __setitem__ special method.

To have some more data to play around with, we shall also add a table for tasks with the following columns: id, employee_id, completed. The employee_id will describe which employee does the task with id id belongs to, and completed is simply a Boolean, indicating its status.

def make_employee(id: int, name: str, position: str, salary: int):
return Record({"id": id, "name": name, "position": position, "salary": salary})
def make_task(id: int, employee_id: int, completed: bool):
return Record({"id": id, "employee_id": employee_id, "completed": completed})
view raw tables.py hosted with ❤ by GitHub

Our tables are going to be sets of records, which we create manually for now.

employees = {make_employee(0, "Michael Scott", "Regional Manager", 100000),
make_employee(1, "Dwight K. Schrute", "Assistant to the Regional Manager", 65000),
make_employee(2, "Pamela Beesly", "Receptionist", 40000),
make_employee(3, "James Halpert", "Sales", 55000),
make_employee(4, "Stanley Hudson", "Sales", 55000)}
tasks = {make_task(0, 0, False),
make_task(1, 0, False),
make_task(2, 1, True),
make_task(3, 1, True),
make_task(4, 1, True),
make_task(5, 2, True),
make_task(6, 3, False),
make_task(7, 3, False),
make_task(8, 3, True),
make_task(9, 3, False)}

Now that we have laid a proper foundation, it is time to look at what makes relational databases work: the operations.

Relational algebra: operations on relations

A relational database would be practically useless if we wouldn’t have ways to retrieve and structure information within the database. This task is achieved by applying operators on relations. The relational algebra is the mathematical structure defined by these operations. To be precise, elements of the relational algebra are functions whose inputs and outputs are relations.

If this sounds too abstract for you, let’s see some concrete examples!

Selection

One of the most fundamental operators is the selection operation, which filters the relation based on certain conditions. It is denoted by

where C denotes the conditions. In our example database, a condition can be that “the employee’s salary is more than 60000”. It can be thought of as a function that takes a record and returns a Boolean indicating whether the condition has been met.

def select(table: Set[Record], conditions: List[Callable]) -> Set[Record]:
table_out = {record for record in table if all(cond(record) for cond in conditions)}
return table_out
view raw select.py hosted with ❤ by GitHub

If we apply the condition “salary is more than 60000” for our employees table, this is what we get.

Projection

The select operator is used to select records in our table. However, we might wish to apply filters to the columns as well for removing unnecessary information. This can be done with the projection operator. It is denoted with

where S denotes the list of columns which should remain. This is also pretty simple to implement in our setting.

def project(table: Set[Record], columns: List[str]) -> Set[Record]:
table_out = {Record({column: record[column] for column in columns}) for record in table}
return table_out
view raw project.py hosted with ❤ by GitHub

This is what happens when we project the employees table to the id and name columns.

Rename

Renaming columns is often very useful. For instance, both of our tables have an id column, which might introduce some ambiguity, so renaming it is needed.

def _columns_in_table(table: Set[Record]) -> set:
return set.union(*[set(record.keys()) for record in table])
def rename(table: Set[Record], columns: dict) -> Set[Record]:
table_columns = _columns_in_table(table)
table_out = {
Record({columns.get(old_name, old_name): record[old_name] for old_name in table_columns})
for record in table
}
return table_out
view raw rename.py hosted with ❤ by GitHub

Cross-product

This is where things start to get interesting. The true strength of relational databases is the fact that you are able to combine information within several tables and answer complicated questions by querying these combined tables. The simplest method to do this is taking the cross product, which merges records together in all of the possible ways. I am going to show an example before the implementation to make sure it is clear.

Here is where we see that colliding column names can become an issue. Before the cross product is taken, each column is prefixed with the table name for every table.

def _prefix_record(row: dict, prefix: str) -> Record:
return Record({f"{prefix}.{key}": value for key, value in row.items()})
def _prefix_columns(table: Set[Record], prefix: str) -> Set[Record]:
return {_prefix_record(row, prefix) for row in table}
def cross_product(left: Set[Record], right: Set[Record]) -> Set[Record]:
# prefixing columns with table name
left = _prefix_columns(left, "left")
right = _prefix_columns(right, "right")
table_out = {Record({**row_l, **row_r}) for row_l, row_r in product(left, right)}
return table_out
view raw cross_product.py hosted with ❤ by GitHub

The cross-product operator is denoted with

You have probably observed that cross-product combines information without regarding logical coherence. Some records in the cross-product table shows an employee and a task which belongs to another employee. We will fix this later when we are using joins.

Union

So, we have seen that with the cross-product, we can combine tables “horizontally”. It is natural to expect the same “vertically”, as was the case with filtering rows (select) and columns (project). This can be done with the union operator.

Strictly speaking, this operator doesn’t really make sense when the tables have distinct columns. However, there is a solution for this: the missing columns can be padded with null values such as None.

def _pad_table(table: Set[Record], with_cols: List):
padding_row = {col: None for col in with_cols}
padded_table = {Record({**row, **padding_row}) for row in table}
return padded_table
def union(left: Set[Record], right: Set[Record]) -> Set[Record]:
# padding
left_cols = _columns_in_table(left)
right_cols = _columns_in_table(right)
left = _pad_table(left, right_cols.difference(left_cols))
right = _pad_table(right, left_cols.difference(right_cols))
table_out = left.union(right)
return table_out
view raw union.py hosted with ❤ by GitHub

Difference

Our last operator is the difference, which is again similar to the set-theoretic difference. It simply eliminates the rows of one table from another.

def difference(left: Set[Record], right: Set[Record]) -> Set[Record]:
return left.difference(right)
view raw difference.py hosted with ❤ by GitHub

Because tables in our implementation are sets, we can simply use the built-in method.

You might notice that the intersection is left out. It is because intersection can be expressed with a difference by

or

difference(left, difference(left, right))

in code.

The case of the intersection operator is not unique. In fact, all relevant operators can be obtained as a combination of the previous six operators, as we will see later.

SQL queries in terms of relational algebra

Now that we have all these seemingly abstract operations, we can begin to see how SQL queries translate to relational algebra. Suppose that we want to query the names of all employees who have a salary larger than 60k USD. In SQL, this would look like

SELECT name FROM employees WHERE salary > 60000

In our implementation, this is equivalent to

temp_table = select(employees, [lambda x: x['salary'] > 60000])
result = project(temp_table, ['name'])

So, we can see that our tools are powerful enough to express these type of queries. However, the neatest things are still ahead of us. To be able to answer more complex questions regarding our data (like what is the average salary of those employees who have an incomplete task), we need to be able to intelligently combine information between tables. These are done with joins.

Joins

There are several methods to combine tables together in the relational algebra. So far, we have seen one: the cross-product, which combined all records together, even when they were logically not consistent. However, combining tables horizontally can be done such that the resulting table will contain meaningful information. Again, there are several methods for this. We will begin with the most fundamental one: the theta join.

Theta join

The theta join operator is simply the combination of the cross-product and the select operations. It is denoted with

where θ denotes the condition of the select. For instance, the theta join of employees and tasks on the condition that employee['id'] and task['employee_id'] match looks like the following.

As you can see, this solves the problem we had for the cross-product, where the task in a given record did not necessarily belong to the employee in the same record.

Outer joins

In some occasions, it is useful to preserve some information during a join even if there is no logically corresponding element in the other table. This is what the left/right/full outer join operators do. We are not going to go into detail, but these operators take the rows in the left/right/both tables which are missing from the theta join and union them together.

Queries as elements of the relational algebra

We have seen that most queries, even joins, can be expressed with only six operators:

  • Select
  • Project
  • Rename
  • Cross-product
  • Union
  • Difference

To see a more complex query, let’s introduce a third table, containing the clients of Dunder Mifflin Scranton. Each client has an id, a name and a contact_id, which is the contact employee’s id.

def make_client(id: int, name: str, contact_id: int):
return Record({"id": id, "name": name, "contact_id": contact_id})
clients = {make_client(0, "Dunmore High School", 3),
make_client(1, "Lackawanna County", 0),
make_client(2, "Mr. Deckert", 1),
make_client(3, "Phil Maguire", 3),
make_client(4, "Harper Collins", 1),
make_client(5, "Apex Technology", 1)}
view raw clients.py hosted with ❤ by GitHub

Now consider the following query: “what are the names of the clients whose contacts have at least one incomplete task?”

To answer this, we have to combine information from all three tables. It is useful to think through the steps we need to take. Here, we are going to

  1. find the employees who have incomplete tasks by joining the employees and tasks tables together,
  2. join the resulting table to the clients table to match client names to contact employees,
  3. select the client names by projecting to the client name column.

In code, this would look like the following.

cp_1 = cross_product(left=employees, right=tasks)
s_1 = select(cp_1, [lambda x: x["left.id"] == x["right.employee_id"],
lambda x: x["right.completed"] == False])
cp_2 = cross_product(left=clients, right=s_1)
s_2 = select(cp_2, [lambda x: x["left.contact_id"] == x["right.left.id"]])
result = project(s_2, ["left.name"])
view raw query.py hosted with ❤ by GitHub

No matter how complicated, every query can be represented as a graph called the expression tree.

Expression tree for the query “what are the names of the clients whose contacts have at least one incomplete task?”

In fact, this is not the only possible way to do this query, there are alternative solutions. For example, we could create the cross-product for all three tables right away and filtered out the required records using a single select. When using SQL, the engine tries to optimize the query by estimating the required cost for a given execution plan and choosing the most optimal for you.

From relational algebra to SQL

Relational algebra is just the tip of the iceberg. It gives us the building blocks to represent queries, however, it does not give us the tools to find how a query can be optimally represented. Under the hood, SQL actually uses relational calculus, which essentially a declarative language based on relational algebra. This means that you only describe what you want, not how you want it. The latter part is figured out by the engine.

We are missing core SQL functionalities like for example aggregates, i.e. functions mapping relations to elements, such as the averaging function. So, there is still a long way to go for a fully functional relational database. However, the fundamentals are laid and the path towards it is clear.

Resources

During my quest, I had two excellent resources helping me to build a relational database:

Share on facebook
Share on twitter
Share on linkedin

Related posts