Monday, April 20, 2009

SQL Antipatterns Strike Back! Slides

I presented my tutorial at the MySQL Conference & Expo today. I have fun preparing it and presenting it, and I got many good questions and comments from the audience. Thanks to everyone for coming and participating!

I have uploaded my slides with a Creative Common 3.0 license to my SlideShare account:

For those who did not get to see my tutorial, I'm presenting some selections from it during a 45-minute session at the MySQL Camp on Wednesday at 2:00pm, under the title "Practical Object-Oriented Models in SQL."

See you next year!


Sheeri K. Cabral said...

Bill, can you edit to put a link to your slides?

erangalp said...

Hey Bill, great presentation! a lot of hidden gems in this one.

I have one question - you skipped quickly over the depth column in the closure table model, but how do you actually calculate it? it seems less straightforward than I thought initially

Bill Karwin said...

@erangalp: Calculate depth as you insert a node.

The depth of the degenerate path (a node's path to itself) is always zero.

Then when you copy its ancestors' paths, add 1 to the depth of each path.

pytrin said...

Though if I insert the ancestor path using the query you suggested in the presentation, it makes no distinction to depth. Is there a way to do this in a simple manner, or do I have to calculate per depth level per node insertion?

Bill Karwin said...

Here's a query to insert a new node #8 as a child of node #5 in the Closure Table design, including calculation of the depth:

INSERT INTO TreePaths (ancestor, descendant, depth)
SELECT 8, 8, 0
SELECT ancestor, 8, depth+1
FROM TreePaths

WHERE descendant = 5;

Or are you trying to fill in the depth values for a populate TreePaths table? That's a bit tricker than providing the depth value for each node as you insert it.

JamesFM said...

Hi Bill
Thanks for putting this on slideshare - it's really helping me.
I'm working with a tree of about 25,000 nodes, using a simple ParentID approach. The main load is querying for arbitrary subtrees, so I'm interested in nested sets and the closure table (which is new to me).
Do you know of accessible introductions to using closure tables? I'm not having much luck just googling.

Bill Karwin said...

It seems that Nested Sets has become the most popular to write about, among the "fancy" tree implementations.

Vadim Tropashko writes about the "Transitive Closure Relation" in his blog ( and also in his book "SQL Design Patterns" he covers it in the context of representing general graphs, not just trees.

This is the same thing as my Closure Table, but he approaches it from a much deeper theoretical direction.

Bill Karwin said...

Sorry - Tropashko also calls it the "Transitive Adjacency Relation."

jamesfm said...

Very helpful - thank you.
I'm moving towards the sense that I would use the classic adjacency list as the "canonical representation" of the tree, and then periodically populate a different representation.
Then I'd use the adjacency list for some queries (get parent, get children) and the other representation for other queries (get all descendants, get all ancestors).
So far I've found three other kinds of representation I could use - nested sets, materialised path, and closure table. Of which the latter looks to be simplest to both populate and query.
Do that seem a good way to go?

Bill Karwin said...

Sure, that seems fine. The plain Adjacency List design is good for certain types of queries, but bad for querying whole subtrees.

FWIW, I have written applications that used the Closure Table alone, without the traditional Adjacency List, and it worked out fine.

Another option if you need to query trees only from the root node (and not arbitrary internal subtrees), is to use a hybrid approach with Adjacency List (the parent_id column): add one more column "root_id" so every descendant knows its root ancestor.

So querying a whole tree is as simple: SELECT * FROM Comments WHERE root_id = 1;

jamesfm said...

Good thoughts.
I have to find arbitrary subtrees anywhere in the hierarchy.
In fact there's also an emerging need to find the "extended family" of a node, by which I mean "find all the descendants of an ancestor of this node, where the ancestor is of a particular type".
This was looking hard just with an adjacency list approach - but I think it will be simple with a closure table.

Jean-Rubin L said...

I found your slides on Slideshare after posting a question on phpfreaks and believe it I've been using it a lot. Not just for the closure table approach but for enums and other goodies that were in there so thank you for this document.

I have question regarding the closure table. I have a tree structure that is working well. Now I want to copy a node and all its children and I am having a hard time finding the right approach to do it. When I copy for instance node A with children B and C, it will create Notes A' B' C' with new IDs ( I have an auto-incrementprimary key) Reproducing the relationship OF A, B and C poses me a challenge however. In the closure table you have the relations regarding the original nodes but you cannot build reproduce the relation with the new nodes as they were just created and there's no information about them in the closure table. Do you have any suggestion as to how I should tackle this? Thanks in advance,
JR from Montreal

Jean-Rubin L said...
This comment has been removed by a blog administrator.
Bill Karwin said...

@Jean-Rubin: Good question. Relocating or cloning a subtree in the Closure Table or Nested Sets designs isn't very easy.

For the case you describe, I'd recommend creating a temp table that maps the node id's of the old subtree to the node id's of the cloned subtree.


Then if you can query an original subtree (rooted at node #4 for instance), you can insert a clone subtree with equivalent structure but using the cloned nodes:

INSERT INTO ClosureTable (ancestor, descendant)
SELECT n1.clone_id, n2.clone_id
FROM ClosureTable c
JOIN NodeMap n1 ON (c.ancestor = n1.original_id)
JOIN NodeMap n2 ON (c.descendant = n2.original_id)
WHERE c.ancestor = 4;

Warren Benedetto said...

Bill, awesome presentation. I have returned to your slides many times over the last few months, especially the section on closure tables.

I'm just curious ... suppose I wanted to sever a child tree (and all of its descendants) and move it to a new parent. If the new parent is at a different depth in the tree than the old one, all the depths will need to be recalculated.

It seems the easiest thing to do would be to DELETE the child and its descendants (from the TreePath table), then INSERT them with the new parent. Then there would be no need to re-calculate the depths.

However, I'm wondering if there's a more efficient way to do it, via an UPDATE query. Any thoughts on how such a query might be structured? If it's possible, would it be more efficient than the DELETE/INSERT method?

Jean-Rubin L said...

Warren, it's funny you should mention that. I am currently working on moving nodes around in the tree. Your question is very valid but just my two cents: even if you use a delete / update, the depth of the new nodes will probably change as the insertion may take place at a different level. So far I have systematically removed the child nodes and recalculated the ancestor-descendant-depth combination. Hopes this helps a little, JR

Bill Karwin said...

@Warren: Consider a transformation of a tree like (A (B (D (F (H, I))), C (E (G)))) -> (A (B (D), C (E (G (F (H, I))))))

That is, move a subtree F with its children H and I. Old parent is D. New parent is G.

Paths within the subtree (e.g. FF, HH, II, FH, and FI) don't need to change at all. Keep in mind the "depth" is really pathlength, not the level in the tree.

Paths from ancestor nodes into the subtree have to be recalculated, up to the common ancestor of the old parent and the new parent.

I can update paths from D and simply change the parent to G, and pathlengths stay the same.

I can delete paths from B into the moving subtree.

And I can insert new paths from C and E into the subtree, generating the paths I need by joining CE and CG to the paths inside the subtree, and adding 1 to the pathlengths.

Paths from A into the subtree need to add 1 to pathlength because the length from A to G is 1 greater than the length from A to D.

Hmm. I'm thinking I could write a whole book showing how to do each type of tree operation with each type of hierarchical data design. Maybe that'll be my next book.

Arthur Layese said...


Thanks for the great presentation, I learned a lot.

I would like to ask if there's a way I can dump all the hierarchies in a single query using a closure table? For example I have a following tree:

- 1stbranch
- midbranch
- corebranch
- leafnodes
- lastbranch
- lastleaf

and I want to display it like:

rootree -> 1stbranch
rootree -> midbranch
rootree -> midbranch -> corebranch
rootree -> midbranch -> corebranch -> leafnodes
rootree -> lastbranch
rootree -> lastbranch -> lastleaf

Is this possible with a single query alone? What are other alternatives to achieve this on a closure table approach?


Bill Karwin said...

Hi @Arthur, thanks for the question. I've posted a new blog entry to answer this:

Arthur Layese said...

Hi Bill,

That's a great solution. Thanks a lot.

Arthur Layese said...


I'm very much interested in enumeration table model and found out that there's a missing sample implementation on how to do the ordering of the leaf nodes. Again for a given example:

rootree -> 1stbranch
rootree -> midbranch
rootree -> lastbranch

if I'm going to move the 1stbranch to make it the last branch of the tree so that the new tree will be something like:

rootree -> midbranch
rootree -> lastbranch
rootree -> 1stbranch

Hopefully you may include a short discussion/topic about this in your upcoming book.

Bill Karwin said...

Hi @Arthur,

If you've studied graph theory in computer science classes you'll remember that the order of child nodes in a tree is irrelevant to the graph.

That is, (R (F) (M) (L)) is the same tree as (R (M) (L) (F)).

So keep in mind that display order data properly should be separate from the hierarchy data.

But since path enumeration is essentially a denormalized solution already, perhaps we can bend the rules.

Therefore you could just make sure to update the primary key values of tree nodes, so that the natural order of these values matches your desired display order.

This is admittedly a fragile solution, and pretty much a hack of coupling the sibling order to the primary key values (which should be arbitrary and independent of meaning).

This illustrates that each solution to represent trees in SQL has its strengths and weaknesses, and therefore which design you use should be based on the types of queries you need in a given application.

It would take a whole book to compare how to implement every scenario in all the designs of trees. In my book "SQL Antipatterns," I had to limit myself to 20 pages on hierarchical data, because there are other topics to cover.

Arthur Layese said...


Thank you very much for a well detailed explanation. I took a look at your upcoming book and it's pretty interesting and tempting to have a copy soon.

info said...
This comment has been removed by the author.
info said...
This comment has been removed by the author.
apren said...

Hi Bill, thanks a lot for the presentation.

I have two question about polymorphic associations.

if i have 5 entities with comments, what solution you recommend?. I like the solution "exclusive arcs" but it is good with 5 foreign keys?

In "reverse relationship", when i delete a bug (cascade delete), their comments are orphans in comments table.

Bill Karwin said...

@apren: Thanks for reading my presentation! I think the last solution, to create a common super-type table, is almost always the best way to represent polymorphic relationships.

marco.markl said...

Hi Bill!

I have a question about Closure Table.

I would like to ask if there's a way I can find all roots in a query?

- sub1.1
- sub1.2
- sub2.1
-- sub 2.1.1
- sub2.2



Bill Karwin said...

@marco.markl: Hi thanks for the question. You can get all root nodes by searching for the entries in TreePaths such that no path exists to that node as the descendant.

SELECT root.descendant
FROM TreePaths AS root
LEFT JOIN TreePaths AS parent
ON root.descendant = parent.descendant
AND parent.ancestor <> parent.descendant
WHERE parent.descendant IS NULL;

This will naturally find all the self-referencing (zero-length) paths of root nodes.

marco.markl said...

@Bill Karwin: Thanks for the Answer!

My Solution ;-)

SELECT descendant FROM TreePaths
GROUP BY descendant
HAVING count(descendant) = 1

Greets from Vienna / Austria

Bill Karwin said...

@marco.markl: I suggest benchmarking both solutions.

I've adopted the habit of avoiding GROUP BY when I use MySQL, because it optimizes so poorly in many cases.

No matter what brand of database you use, you should use EXPLAIN to analyze the optimization plan, and perhaps some profiling to measure the final performance.

Anyway, good job for thinking of another solution!

Lennart said...

Hi Bill. Your presentation is a very nice summary of different ways to represent an hierarchy in a rdbms. I have some notes on an implementation of a closure table:

It is a slight variation of your closure table, where I have both a parent relation, and an ancestor relation. The ancestor relation is maintained via triggers on the parents table.

The tables/triggers code and some sample data can be downloaded at the end of the article in case someone is interested in testing it.

Bill Karwin said...

@Lennart: Thanks very much! Your solution looks interesting and very expressive.

It would take a while for me to analyze your article fully, but I think I can do similar tree operations with my solution of a single closure table.

It's always good to have multiple solution to choose from. The best solution for a given application depends on the operations you need to do in that app.

Lennart said...

Hi Bill, it is indeed possible to express all operations with just the closure relation (since we can derive parents from it). The benefit (IMO) is that a developer can focus on how to update the parents relation, and the triggers will maintain the closure.

To be honest I feel that some of the operations discussed are a bit weird (I can't for example imagine when one would like to "Find the Immediate Subordinates of a Node" including the node under investigation), but I wanted to mimic the operations from the NS article for comparison.

I definitely agree on your last remark and I find your presentation very valuable for anyone who must decide on how to represent a tree.

chunlinyao said...

@Bill Karwin said...

> Or are you trying to fill in the depth values for a populate TreePaths table? That's a bit tricker than providing the depth value for each node as you insert it.
Cannot found a good solution to recalculate depth when I relocate a subtree.

dasersoft said...

Bill you are just awesome. I made true with your presentation on slideshare and it has helped me alot. However i posted this question sometimes ago in stackoverflow i dont know how u can throw more light.

4 forms are distributed, each form brings 4 people and each 4 people bring another four people. The network grows downword until 7 generations. How can i find downliners and upliners assuming i decide to pick a node at any level?


A brings A B C D
E then brings F G H J
F brings WXYZ

and so on and so forth. At 7th generation we experience an halt. Meanwhile each node begins a lineage downwords.

Question 1: How can i move upward like 7 levels upward to calculate the upliners of a node. Cos what i have been seeing around is how to query children of a node. This time around i want to be able to calculate upliners.

Emmanuel Bouton said...

Hello Bill,

Thanks for sharing your knowledge :)

I have a question on the Closure Table design ...
I wonder which request I can run to easily construct an object oriented tree.

This is my entity :
class Environment
protected $id;
protected $name;
protected $parent;
protected $children;

with the associated getters/setters.

getParent() and getChildren() are loaded lazily, but I'd like to load the full hierarchy on each request.

The problem with the following request (to get children) is that I cannot guess the ancestor of the environments at the second level ...

SELECT e.*, p.length FROM environments e JOIN environments_paths AS p ON = p.descendant_id WHERE p.ancestor_id = 7;

How would you do that (in php) ?

Bill Karwin said...

Hi Emmanuel,

You can get all descendants of 7 with their immediate parents:

SELECT p.ancestor_id AS parent_id, e.*
FROM environments e
JOIN environment_paths AS c ON = c.descendant_id
JOIN environment_paths AS p ON c.descendant_id = p.descendant_id AND p.length = 1
WHERE c.ancestor_id = 7;

Then as you fetch the result, populate an array of objects indexed by the child id.

while ($row = $stmt->fetch()) {
if (!isset($nodes[$row["id"]]) {
$nodes[$row["id"]] = new Environment($row);

Your constructor for Environment may have logic in it to make an SQL query, but I like to code my object constructors to allow an optional argument for the initial data. That way I can avoid making many SQL queries when I already got all the data for many nodes by joining in the tree query.

Then pass through the array again, assigning objects for parent and children:

foreach ($nodes as $node) {
$nodes[$node->parent_id]->children[] = $node;
$node->parent = $nodes[$node->parent_id];

Emmanuel Bouton said...

Amazing ! Thanks a lot for your quick answer !

I'll do the following :

* Query for the getParent() :
SELECT e.* FROM environments e LEFT JOIN environments_paths AS p ON = p.ancestor_id WHERE p.descendant_id = 7 and != 7 order by p.length;
=> I exclude the row for id 7 because my entity is already « hydrated ».
I'll just have to loop on the result set to set the parents recursively.

* Query for the getChildren() (as you said) :
SELECT p.ancestor_id AS parent_id, e.* FROM environments e LEFT JOIN environments_paths AS c ON = c.descendant_id LEFT JOIN environments_paths AS p ON c.descendant_id = p.descendant_id AND p.length = 1 WHERE c.ancestor_id = 7 AND id != 7;
=> I prefer to store the results by parent_id :

$children = [];
foreach ($rows as $row) {
$children[$row['parent_id']][] = $row;
setAllChildren($environment, $children);

function setAllChildren($environment, $allChildrenRows)
$childrenRows = $allChildrenRows[$environment->getId()];
if (isset($childrenRows)) {
foreach ($childrenRows as $row) {
$child = new Environment($row);
setAllChildren($child, $allChildrenRows);

Haven't tested it yet but i think it could do the job.
Thanks again for your help !!

Emmanuel Bouton said...

Just another question ...

In your slideshare models-for-hierarchical-data (page 63), how would you move the whole subtree #4 under #2 (in SQL) ?

Emmanuel Bouton said...

Ok I found my response in one of your post on mysqlperformanceblog :)

Thanks a lot !

Unknown said...

INSERT INTO TreePaths (ancestor, descendant, depth)
SELECT 8, 8, 0
SELECT ancestor, 8, depth+1
FROM TreePaths

WHERE descendant = 5;

That doesn't appear to work, at least not on my mySQL 5.6.16 build - the UNION ALL query provided actually gives two rows, one with a path depth of 0 and another with a path depth of 1, and as such causes a primary key error when it tries to insert the same ancestor/descendent twice.

Unless I'm misunderstanding something, which is entirely possible.

What would be helpful, if you had the time to providing it, would be a method of calculating the depth AFTER insert. You've said it's complicated, but having followed you around the Internet, I can't find an example of where you've done it, and this project I'm working on isn't necessarily able to guarantee inserting nodes in order.

Bill Karwin said...

Hi Unknown,

The idea is that 8 is just an example value, not a literal value you must use. It's the id of a new node, that is not in the TreePaths yet before you run the INSERT.

Since it's not in the tree, then "SELECT ancestor, 8" cannot generate the pair 8,8.

I think you already have an existing node 8 in your tree, which is one of the ancestors of 5.