Using the Ltree Datatype in Postgres
— 08:54 reading time
Postgres has supported the ltree datatype as far back as 8.3 (potentially even further back but 8.3 is the oldest version of docs that were available). The ltree datatype is Postgres’ implementation of materialized paths, allowing the storage of label trees (hence the name) that represent the path to a record in a hierarchical structure. Materialized paths, adjacency lists and nested sets represent the three primary methods for modeling hierarchical data in a relational database. I’ll give a brief overview of those methods but will not delve too deeply as there are plenty of online resources that discuss the varying approaches at length. Instead, I’ll talk about how and why I decided to use ltree for FOX, our content management system.
Modeling a Hierarchy in a Relational Datastore
A problem that every software engineer will encounter in their lifetime is how to model hierarchical data in a relational database. They will have to make the choice between three modeling methods: adjacency lists, nested sets and materialized paths. All three have benefits and drawbacks, and it is important to recognize and evaluate how they will affect an application’s performance before making a decision.
Adjacency lists are possibly the simplest way to represent a data hierarchy in a relational database. At a minimum, a table implementing an adjacency list will need a
parent_id column that is used to reference another record in the same table. For a given record, you can find all of its children. Each child is associated with their own children until you get to a point where no more children are found.
The trade-off you make with an adjacency list is that writes may be fast but retrieving a record and all of its descendants can be time consuming. After getting the children for a record, each of those children needs to be iterated over in order to find the next set of children. The process repeats, generating a large number of queries and slowing down the application while it waits for the database to respond, until no more children are found. There is hope in Postgres as it has a mechanism that will do the iterative queries internally instead of requiring the application to do iteration.
If you need very fast reads you will generally choose to model your data with a nested set implementation. The exact details for each implementation vary but the basic idea is to define a left and right value on each record that encompasses the left and right values of all of its children. A drawing will help.
By using the values of left and right for each record you can write a single query to retrieve a record and all of its descendants. The query for the diagram above would look something like…
SELECT * FROM nested_set_table WHERE left >= 1 AND right <= 10;
Most implementations will also have a parent id field to allow for the left and right values to be rebuilt and a depth field to add some additional sorting options.
Materialized Paths rely on a pre-calculated path that describes the ancestry of a record. Again, a drawing will help.
Updating the diagram from the nested set discussion you can see how materialized paths work. In this example, the IDs of the record and its ancestors are used to construct the paths that are period separated lists of the ids. To retrieve the entire graph, the query would look something like…
SELECT * FROM materialized_path_table WHERE path LIKE '1.%';
Similar to nested set implementations, a parent_id column is normally present to rebuild the paths if needed and an order column to order siblings correctly. In Postgres the materialized path concept is implemented as the ltree datatype. Using ltree the query would look something like…
SELECT * FROM materialized_path_table WHERE path <@ '1'
How I Decided to Use Ltree
Picking a modeling technique can be critical to the success of your project. I’ll tell you how I eventually settled on ltree for FOX, but there are a few items to address before I can start. For the rest of the discussion, when I refer to a “node” I will be referring to an entry in the table that is storing hierarchical data. Additionally, the following list contains requirements and behavior observations that were known up front or identified as the application matured.
- Pages need to render quickly so it is critical to pull content back in as few queries as possible.
- Writes also need to be quick to reduce time wasted waiting for the application to respond.
- New nodes are added frequently.
- Moving a node to a new parent never happens.
- Trees within the table will be relatively shallow but they will be numerous.
- Old nodes are generally not deleted.
When the project started, the primary objective was speed, so I decided to use a nested set implementation. It was also helpful that the awesome nested set gem existed, making it easier to implement the pattern.
The Trouble With Nested Sets
When FOX was small it only supported a limited number of pages and the application performed well. However, as the number of pages increased, users started experiencing issues. The first issue identified was a problem related to the nested set methodology and how FOX uses the nodes table. This manifested itself as a slowdown when editing pages, particularly old ones.
Understanding how FOX manages content edits will provide you with some insight into the causes for the observed editing performance degradation of older pages. FOX manages content change in a very binary, read-only fashion. Once a piece of content is live it cannot be changed and it is consider read-only. To update the content, new records are added to the database and actions are recorded to activate/deactivate the record appropriately at publish time. This means FOX adds nodes constantly.
The performance problem materialized because nested set implementations need to keep the left and right value of all of the records in sync and sequential. In FOX, the oldest pages are the ones with nodes that have the lowest left values. Any new pages and their nodes are added to the right of the existing ones. Over time more and more nodes are added to the right of the original nodes. In a nested set implementation if you want to insert a new node, the left and right values of all of the nodes to the right must be updated. This results in a query that has a very long execution time. The query execution time coupled with only running a sole backend worker for the editor resulted in an awful experience for users. In response I increased the number of backend workers so the long running queries would not block others from using the application.
What seemed like a no-brainer actually contained a hidden problem. After increasing the number of workers, integrity issues with the left and right values began to surface. Seemingly at random, nodes would have left and right values that overlapped with other nodes and what should be a very orderly structure got distorted. Doing full rebuilds would sometimes fix the issues but they kept coming back. After investigating and some trial and error experiments, I determined that increasing the number of workers was allowing for a race condition to occur with the long running node queries. If two queries were created at relatively the same time and affected relatively the same area of the node tree, they would be run sequentially and result in integrity problems because the second query would have been run using old IDs that were updated by the first query.
With query times continuing to rise and the identified concurrency issues nested set could no longer be used to model our hierarchical data. As a stopgap I reduced the number of workers to one to ensure the integrity of the nodes table and I started looking for alternatives.
Ltree to the Rescue
I found the solution to the problem in materialized paths and the ltree datatype in Postgres. Materialized paths have several key advantages that make it a very good choice for the modeling needs of FOX.
- Insertion is very fast anywhere in the tree
- Nodes are not reassigned (move) to new parents, avoiding the performance problems associated with rebuilding the paths
- Reads are just as fast as nested sets
- The chance of for race conditions is lessened (though I do not think it is eliminated entirely)
After doing some initial investigation and experimentation with ltree I decided that it would probably solve the current problem with FOX and moved forward. I wrote a very basic library that implements the methods FOX was using from the nested set library with ltree and migrated the nodes table to ltree. The long running queries disappeared, much to our DBA’s delight, and I was able to boost the number of backend workers without having to worry too much about race conditions. It has been over six months since the switch and everything is going well.
There is no one best method of modeling hierarchical data in a relational table. The choice is situational, relying heavily on the requirements/behavior of the application. Even within the same project, multiple modeling approaches could be used. FOX uses both materialized paths and adjacency lists, but for different things. Given the success of ltree and materialized paths in FOX, we have started using it on other projects to model hierarchical data and it is proving quite useful.
Director, Software Engineering