Wednesday, March 24, 2010

Rendering Trees with Closure Tables

I got a comment from a reader about the Naive Trees section of my presentation SQL Antipatterns Strike Back. I've given this presentation at the MySQL Conference & Expo in the past.

I'd also like to mention that I've developed these ideas into a new book, SQL Antipatterns: Avoiding the Pitfalls of Database Programming. The book is now available in Beta and for pre-order from Pragmatic Bookshelf.

Here's the reader's question:
I would like to ask if there's a way I can dump all the hierarchies in a single query using a closure table? For example I have a following tree:

rootree
- 1stbranch
- midbranch
- corebranch
- leafnodes
- lastbranch
- lastleaf

and I want to display it like:

rootree -> 1stbranch
rootree -> midbranch
rootree -> midbranch -> corebranch
rootree -> midbranch -> corebranch -> leafnodes
rootree -> lastbranch
rootree -> lastbranch -> lastleaf

The Closure Table is a design for representing trees in a relational database by storing all the paths between tree nodes. Using the reader's example, one could define and populate two tables like this:
drop table if exists closure;
drop table if exists nodes;

create table nodes (
node int auto_increment primary key,
label varchar(20) not null
);

insert into nodes (node, label) values
(1, 'rootree'),
(2, '1stbranch'),
(3, 'midbranch'),
(4, 'corebranch'),
(5, 'leafnodes'),
(6, 'lastbranch'),
(7, 'lastleaf');

create table closure (
ancestor int not null,
descendant int not null,
primary key (ancestor, descendant),
foreign key (ancestor) references nodes(node),
foreign key (descendant) references nodes(node)
);

insert into closure (ancestor, descendant) values
(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (1,7),
(2,2),
(3,3), (3,4), (3,5),
(4,4), (4,5),
(5,5),
(6,6), (6,7),
(7,7);
What we need to do is find all the descendants of the root node 1, then for each of these descendant nodes, list its ancestors in order, separated by an arrow. We can use MySQL's useful GROUP_CONCAT() function to build this list for us.
select group_concat(n.label order by n.node separator ' -> ') as path
from closure d
join closure a on (a.descendant = d.descendant)
join nodes n on (n.node = a.ancestor)
where d.ancestor = 1 and d.descendant != d.ancestor
group by d.descendant;

Here's the output in the MySQL client. It looks like what the reader asked for:
+-------------------------------------------------+
| path |
+-------------------------------------------------+
| rootree -> 1stbranch |
| rootree -> midbranch |
| rootree -> midbranch -> corebranch |
| rootree -> midbranch -> corebranch -> leafnodes |
| rootree -> lastbranch |
| rootree -> lastbranch -> lastleaf |
+-------------------------------------------------+
I do assume for the purposes of ordering that all of a node's ancestors have a lower node number. You could alternatively use a pathlength column to the closure table and sort by that.

The Closure Table design is nice compared to the Nested Sets (or Preorder Traversal) design, because it supports the use of referential integrity. By using indexes, the EXPLAIN report shows that MySQL query optimizer does a pretty good job on it (I've omitted a few columns for brevity):
+-------+--------+-------------------+--------------------------+
| table | type | ref | Extra |
+-------+--------+-------------------+--------------------------+
| d | range | NULL | Using where; Using index |
| a | ref | test.d.descendant | |
| n | eq_ref | test.a.ancestor | |
+-------+--------+-------------------+--------------------------+

22 comments:

xxx said...

Wow. It seems your book is what I need. Ordered a PDF version -)

info said...

Hi Bill,

How would you do this and taking all the structure into an array instead of a big string?

Thanks!

Vani said...

I came across your slides while looking for better patterns for storing hierarchical data. It's definitely one of the best articles i came across.

Closure tables just got my attention. I am planning to use it for a folders design for my project.Folders in my project are heavily shared, moved, created, deleted in the system. They are also permission-ed.

Do you think closure tables is a good idea? or Path enumeration is better?

I really appreciate your feedback.

Thanks!

Bill Karwin said...

@Vani: Hi thanks for your comment.

Closure Table has many advantages, but if one of the common operations you want to do is to move a subtree, this may not be the best design.

Path Enumeration may be easier for that particular kind of change. To move a subtree, you can do it by doing a string-replace on the prefix of a path.

In the example in this blog, suppose I want to move 'corebranch' to be a child of 'lastbranch':

UPDATE Folders
SET path = REPLACE(path, 'rootree/midbranch/', 'rootree/lastbranch/)
WHERE path LIKE 'rootree/midbranch/corebranch/%'

The best tree design for your app depends on what kinds of operations you need to do with your tree.

Vani said...

@Bill: Thank you Sir for your prompt reply.

My app simply resembles the google docs for example. I let users organize their files in folders and share them with fellow users.I was looking at hierarchical tags(labels, as in gmail/google docs). But the problem with that is sharing, i am not sure if hierarchical tags works well with permissions.

Sheldon said...

Hi Bill

First, thanks a lot for the slides from your presentation on closure trees. They are great, finally I'll be able to get fast trees working with referential integrity.

Anyway, I wanted to ask, when you mention that you depend on the id order in this post, and the alternative would be to depend on some kind of length field, I'm really not sure what you mean, or how that could be used. Would you mind elaborating, I'm trying to get this working with preorder traversal like in the example, but as far as I can tell the order wouldn't be correct if I say, moved a subtree about?

Would really appreciate the help.

Regards
Sheldon

milehighrockstar said...

Bill, thanks for pointing me to your blog! I'm reading your book and it's very helpful (as well as Joe Celko's two SQL for Smarties books on hierarchical data). I'm sold on the versatility and integrity of closure tables; however, I'm finding the queries for returning the data I need to be less than intuitive.

Using the above example, how would one restructure the SELECT statement to list only rows with complete paths, i.e., exclude the rows containing sub-paths of succeeding rows?

Desired result:
rootree -> 1stbranch
rootree -> midbranch -> corebranch -> leafnodes
rootree -> lastbranch -> lastleaf

Thanks again for your help!

Jordi said...

Hi Bill, first of all, I am reading your and I think it is a treasure! Its full of real examples.

My question is how I can implement a threaded comment system, where the comments are paginated?

I don't know how I can get it done with closure tables.

For example: Show 5 comments and their replies

Comment 1
Reply
Reply
Reply
Comment 2
Reply
Comment 3
Comment 4
Comment 5
Reply
Reply
Reply
Reply

Jordi said...

I think I'll do 2 queries. The first to return all 5 comments and then make a query that returns me the answers to these 5 comments.

In a column of the table where comment information resides, will store if it is a comment or a response.

Sergio said...

Hi Bill. I've noticed in your closure table samples each node (even leaf nodes) are referencing to themselves. There's no way to distinguish root nodes from branches and last segments of the tree, if yo do it this way. I would rather let root nodes to have a reference to themselves only. In this case they can be distinguished from the rest by either length, which is 0 or equal ancestor and descendant values.

On another note, wondering if you have any suggestions regarding the most effective way to find all last segments of a particular tree?

Bill Karwin said...

Hi @Sergio, thanks for the questions.

The root node of a tree is the node that has no ancestors, besides itself.

SELECT c.* FROM closure AS c
LEFT OUTER JOIN closure AS anc
ON anc.descendant = c.descendant AND anc.ancestor <> c.ancestor
WHERE anc.ancestor IS NULL

Finding leaves is the complementary query, finding nodes with no descendent besides themselves.

SELECT c.* FROM closure AS c
LEFT OUTER JOIN closure AS des
ON des.ancestor = c.ancestor AND des.descendant <> c.descendant
WHERE des.descendant IS NULL

Russ said...

I've looked into the closure tables and implemented a version using path_length, and so far I love it.

However, there is one thing I have yet to solve. I can grab all ancestors or descendants of a node easily with path length. However, in my tree some nodes are in multiple subtrees, and I haven't figured out how to enumerate a single path. That is, when I say give me all ancestors and sort by path_length, this does not give me a single path to that particular node (as it would if the node was only in a single path). Instead, it's a mix of two or more paths, but I can't match a level 2 with a level 3 correctly (which ancestor of the level 3 is the correct match for that particular path?).

Is there a way around this in SQL, or must it be left up to the application layer. The path chosen isn't all that important - just that it's a correct path (or alternatively, return all paths and then leave it up to the application layer to pick one).

Altec said...

Hi, interesting topic.
I would like ask which models (for hierarchical data) it is best for which application.
(I mean shich model is the best for e-shop, blog, ,....)
Thanks.

Altec said...
This comment has been removed by the author.
Bill Karwin said...

Hi Altec, thanks for your question, but I can't answer it at that level. There are many types of tasks you might need to do with hierarchical data in any given application. The choice of what design to use for the data has more to do with the specific queries you need to run, not so much with the purpose of the application.

Niko Brauc said...

I've used closure tables in some of my recent apps and have mixed feelings. I still like nested set very much but
I wanted to implement multiple parents. That's ideal for closure tables. And now everything works simple and fast.

But I wonder if self referencing nodes are really needed.
Isn't it simpler using table with null-able anestors, eg:

CREATE TABLE `category_closure` (
`category_id` int(10) unsigned NOT NULL,
`ancestor_id` int(10) unsigned DEFAULT NULL,
`distance` tinyint(3) unsigned DEFAULT NULL,
UNIQUE KEY `category_id` (`category_id`,`ancestor_id`),
KEY `ancestor_id` (`ancestor_id`)
);

In this case we needn't (shouldn't) inserting self referencing nodes, eg. (2, 2, 0),
only childrend and other descendants.

I know that now we should by hand prevent inserting multimple inserting
of the same root category, eg.
(1, null, null)
(1, null, null)

But on the other hand root categories manipulating can be much simpler.

What's your opinion? Is it worth using null-able ancestor?

Bill Karwin said...

@Nico: I have at least three reasons why the self-referencing nodes are useful.

1. Primary key columns must be non-nullable, and the two columns of ancestor, descendant in the closure table serve as the best primary key.

2. The self-referencing row makes it easier when you add a new child node. For example, if you have a path A-B-C-D and you want to add a new child of D, you just run:

SELECT ancestor, E FROM closure WHERE descendant = D

If you didn't have a self-referencing row (D,D), you'd have to add the last path (D,E) by hand anyway.

3. Most of the time when you want to query either a chain of ancestors of D, you'd want to include D in the result too. E.g. a breadcrumbs query. If you didn't have the self-referencing row, you'd get the ancestors of D, but not D itself, from this query:

SELECT ancestor FROM closure WHERE descendant = D

With the self-referencing row, you get ancestors and also D itself. You get a similar benefit when you want to query for a subtree and include the top node of that subtree.

SELECT descendant FROM closure WHERE ancestor = B

So yes, I do think the self-referencing row gives several benefits, even though it looks superfluous at first.

Niko Brauc said...

More than enough reasons for the symmetrical structure :-) Thanks for the quick and comprehensive response.

Justin Zaun said...

Hi Bill,

First, thanks for all your different posts and reticles on Closure Tables. They've helped a lot. I have everything just about working but I can't seem to select my tree in the correct order out of my table. I get all the correct nodes, just not in the correct order, even if ordering by pathlength.

You can see my full SQL here:

http://stackoverflow.com/questions/8252323/mysql-closure-table-hierarchical-database-how-to-pull-information-out-in-the-c

Thanks.

Tukki said...

@Bill, thanks for your great posts on closure tables.

And I made a closure table demo base on your posts, and choosing another solution to render sub-tree without care about the parents. I'm just a newbie, and I don't know if this absolutely right, but it worked.

Here this my note: http://goo.gl/UIXPA

anietog said...

Hi Bill,
first of all thanks a lot for this article, it is really useful at least for me.

I was wondering this desing needs to be changed slightly if we want to store more than one tree? i mean to have N root nodes? Or as you have answered @Sergio each node with no ancestors will be the root of a particular tree? Thanks a lot for your time

anietog said...

I have already found an answer to my question. http://stackoverflow.com/questions/4958570/how-to-find-distinct-root-nodes-of-newest-nodes-in-trees-hold-in-closure-table

Thanks a lot