Hierarchical Data in Relational Databases Part 1

Scroll this

If you ever had to store hierarchical data (for example, tree of categories) in the relational database (for example MySQL), very soon, you realized it is not easy and efficient. In fact, relational databases are not designed for storing hierarchical data.
In this tutorial you will learn how to store hierarchical data in efficient way. In the first part we will learn about theory and how storing the hierarchical data in the database really works without using 3rd party libraries. In the second part we will demonstrate how to use stofDoctrineExtensionsBundle (which among other things support hierarchical data) with Symfony 4 and Doctrine.

Example

This is the example of data that we will use in this tutorial. Category tree of the bicycle shop (not so different from www.bicycle-discounts.com).

 

The adjacency list model

The adjacency list model is the simplest and straightforward way (but not very efficient) to represent hierarchical data. Each node has parent id. Here is the above category tree in MySQL table:

In order to build the tree, you need to retrieve the table rows and recursively build the tree in your client side code (PHP for example). It is impossible to build the tree using one SQL statement. Also if you want to fetch all the products that are in the category Bikes and all its subcategories – again it is impossible to do it in one SQL query. You would probably need one SQL statement to fetch the items, build the tree in your client side code, store the ids of the Bikes and its subcategories. And then again execute another SQL statement to fetch the products that belong to those categories.

So, for such a simple task we need to execute 2 SQL queries and have extra processing on the client side code. Far better solution is to use the nested set model.

The nested set model

Represent the categories as nested containers:

The term “Nested set model” was introduced by Joe Celko.

Hierarchy is still maintained. For example category Bikes still includes Mountain (and its subcategories) and Road & Time Trail. Also, observe that the category “Home “includes all categories between its left and right container border. The category “Mountain” includes all categories between its container left and right border. Even leafs (categories without children) includes everything between its container left and right border, true, they include only them self, but now we have the general rule. Category when represented as nested set include all subcategories between its left and right container border.

Lets show this on the diagram and enumerate containers borders. As you can see the borders are enumerated in a simple way, from left to right, and numbers are increased by one at every container border.

 

Every category now has its left and right values (as shown on the picture above). So all what is needed to fetch category and its subcategories is one simple query which will say “Give me all categories whose left and right values are between parent left and right values”. For example to fetch all subcategories of category Bike we will ask for all categories which have left value > 2 and right value < 13.

The nested set model as a tree

The nested set mode can be represented as a tree which you traverse preorder. It is called modified preorder tree traversal – MPTT, modified because each node is visited twice. On the first visit left value is set, on the second visit right value is set:

Nested sets are easier to understand – and we will use them when writing queries in this tutorial.

The category table now looks like this:

Please note that parent_id is not needed by the nested set model, but it is left for convenience. For simplicity we will assume that categories name are unique.

Retrieve the data

Retrieve the full tree (or full subtree)

Retrieve the “Bikes” category and all its subcategories.

Checkout the nested set model image: query simple selects all categories whose left and right values are between parent’s container left and right values.

Retrieve all parents of the node (and the node itself)

Retrieve all the parents (path to the category) of the “Fat bike” category (and the “Fat bike” category itself):

Take a look at the nested set model image – the query is very simple. First we find the node with the name “Fat bike” and then from the image it is clear that its parents will have left value smaller and the right value larger then the node in question.

Retrieve leaf nodes (nodes without children)

If you take another look at the nested set model image you will notice that difference between left and right values of leaf nodes is always 1. 🙂

Retrieve the depth of the nodes

This one might be a little bit tricky, so lets go through it step by step. The goal is to calculate the depth for each node, and depth of the node is related to the number of parents. Plan is to use previous query that returns parents of the node and count them. 🙂
So lets start with the query which will return all the parents of the node:

As you remember from previous examples this will return:

“Fat bike” has 3 parents and depth is 4 (if we count from 1, which usually we don’t 🙂 ).

Lets change so it prints only the node name and the parent name:

And now lets count the number of parent.name . 1 is subtracted because we want depth do be calculated from 0.

To get the depths for all nodes, we need to group by node.id or node.name and remove the WHERE condition.

Final SQL query which returns depth of the nodes:

Indented tree

Using MySQL string functions it is easy to retrieve indented tree (this query is based on query that returns depth of the nodes):

Joins between the nested set model hierarchical data and its items

Or, in other words, how to work with categories and products that belong to that categories. Sample products table and data:

Number of products in each category

Goal: retrieve the number of products from each category, parent categories should display the sum of products from all its subcategories.
This query might not be so straightforward, so lets write it step by step. The plan is to LEFT JOIN table product and display a product each time for its own category and its all parents categories and then group and count.

Retrieve all the combinations between subcategory and all its parent categories.

This query will return all parents for each node.

For example, category “Fat bike” is displayed 4 times – one for itself and another 3 times for each of its parent category (“Mountain”, “Bikes”, “Home”).

Now LEFT JOIN the product table:

Each product is retrieved several time – its own category, and for every parent category. For example “Kona Wo Fat Bike” from “Fat bike” is also displayed in parents categories (“Mountain”, “Bikes” and “Home”). This is exactly what we want – because each product should be counted in its own category and in each of its parent categories.

Final query

Changes: Remove node from SELECT, GROUP BY  parent.id and COUNT products and sort categories:

Quick check:

  • “Mountain”, “Enduro” and “XC” each has 2 products – so its parent “Mountain” has 6.
  • Road & Time Trail” has 3 products. So “Bikes” has 9 products.
  • “Components” has 4 products.
  • Wheels & Tyres” has 2 products from “Hubs” and 3 from “Rims” and “Tyres”. Total of 8 products.
  • So “Home” has 9+4+8 = 21 products.

Insert a new node

Lets add a new node “Clothing & Footwear” between “Bikes” and “Components”:

  • Left/Right values that will not be changed are displayed in black squares
  • Old values are displayed in red circles
  • New values are displayed in green squares

Since the “Components” category has right value 15, a new category “Clothing & Footwear” will have values 16 for left and 17 for right.
All categories on the right side of “Components” will have to be adjusted – their left and right values will have to be increased by two.
“Components” and all other categories on the left will remain unchanged.

SQL statements for node insert:

Compare data in the table with latest nested set model image:

Table lock is necessary to avoid race condition if more categories are being  edited at the same time. Also using transaction would be a good idea in case one of the statements fails – so we could rollback to the previous state.

Delete a node

Deleting a node is a little bit more complicated, we need to delete the all children of the that node and update the remaining nodes. This are our categories after “Clothing & Footwear” was added:

Lets delete category “Mountain”. Updated nested set model image (remove “Mountain” category), so it would be clear what needs to be done:

  • Left/Right values that will not be changed are displayed in black squares
  • Old values are displayed in red circles
  • New values are displayed in green squares

As “Mountain” category and its children are deleted – nodes values on the right side needs to be recalculated. Their values will be changed by width of “Mountain” container. Width is number of borders of “Mountain” container and its children containers.
There are 8 borders (count them by hand if you want), or 10 (right value of Mountain node) – 3 (left value of Mountain node) + 1 = 8.

With this in mind, it lets write SQL statements to delete the node.

Since I am using FOREIGN key with “NO ACTION” on update and delete in product table – I first need to delete products which belong to “Mountain” category and its subcategories. Also, there is similar situation with nested_set_category.product_id.  I have to sort nodes to be deleted by “lft DESC” in order to delete childrens first and then parents so that foreign key error is avoided.

Continue to 2nd part: Hierarchical Data in Relational Databases with Symfony 4 and Doctrine.

Links

4 Comments

  1. This is awesome. You know what you’re talkin’ about.
    Thank’s

  2. I was looking for tips on representing and querying hierarchical data, and especially found your description of the nested sets model useful.

    For querying hierarchical data in the adjacency list model I suggest looking up recursive common table expressions (CTEs). These are standard (ISO/IEC) SQL, although, depending on the DBMS, the implementation may not be complete or was only added relatively recently — they appeared in MySQL in version 8.0, the documentation is at https://dev.mysql.com/doc/refman/8.0/en/with.html

    Modern SQL indicates support for various DBMS offerings: https://modern-sql.com/caniuse/with_recursive_(top_level)

    I also use Oracle Database, which has had support for hierarchical or linked queries in the form of the non-standard CONNECT BY for quite some time.

Submit a comment