Monday, April 20, 2009

SQL Antipatterns Strike Back! Slides

I presented my tutorial at the MySQL Conference & Expo today. I have fun preparing it and presenting it, and I got many good questions and comments from the audience. Thanks to everyone for coming and participating!

I have uploaded my slides with a Creative Common 3.0 license to my SlideShare account: http://www.slideshare.net/billkarwin

For those who did not get to see my tutorial, I'm presenting some selections from it during a 45-minute session at the MySQL Camp on Wednesday at 2:00pm, under the title "Practical Object-Oriented Models in SQL."

See you next year!

17 comments:

Sheeri K. Cabral said...

Bill, can you edit http://forge.mysql.com/wiki/MySQLConf2009MondayNotes to put a link to your slides?

erangalp said...

Hey Bill, great presentation! a lot of hidden gems in this one.

I have one question - you skipped quickly over the depth column in the closure table model, but how do you actually calculate it? it seems less straightforward than I thought initially

Bill Karwin said...

@erangalp: Calculate depth as you insert a node.

The depth of the degenerate path (a node's path to itself) is always zero.

Then when you copy its ancestors' paths, add 1 to the depth of each path.

pytrin said...

Though if I insert the ancestor path using the query you suggested in the presentation, it makes no distinction to depth. Is there a way to do this in a simple manner, or do I have to calculate per depth level per node insertion?

Bill Karwin said...

Here's a query to insert a new node #8 as a child of node #5 in the Closure Table design, including calculation of the depth:

INSERT INTO TreePaths (ancestor, descendant, depth)
SELECT 8, 8, 0
UNION ALL
SELECT ancestor, 8, depth+1
FROM TreePaths

WHERE descendant = 5;

Or are you trying to fill in the depth values for a populate TreePaths table? That's a bit tricker than providing the depth value for each node as you insert it.

JamesFM said...

Hi Bill
Thanks for putting this on slideshare - it's really helping me.
I'm working with a tree of about 25,000 nodes, using a simple ParentID approach. The main load is querying for arbitrary subtrees, so I'm interested in nested sets and the closure table (which is new to me).
Do you know of accessible introductions to using closure tables? I'm not having much luck just googling.

Bill Karwin said...

It seems that Nested Sets has become the most popular to write about, among the "fancy" tree implementations.

Vadim Tropashko writes about the "Transitive Closure Relation" in his blog (http://vadimtropashko.wordpress.com/nested-sets-and-relational-division/) and also in his book "SQL Design Patterns" he covers it in the context of representing general graphs, not just trees.

This is the same thing as my Closure Table, but he approaches it from a much deeper theoretical direction.

Bill Karwin said...

Sorry - Tropashko also calls it the "Transitive Adjacency Relation."

jamesfm said...

Very helpful - thank you.
I'm moving towards the sense that I would use the classic adjacency list as the "canonical representation" of the tree, and then periodically populate a different representation.
Then I'd use the adjacency list for some queries (get parent, get children) and the other representation for other queries (get all descendants, get all ancestors).
So far I've found three other kinds of representation I could use - nested sets, materialised path, and closure table. Of which the latter looks to be simplest to both populate and query.
Do that seem a good way to go?

Bill Karwin said...

Sure, that seems fine. The plain Adjacency List design is good for certain types of queries, but bad for querying whole subtrees.

FWIW, I have written applications that used the Closure Table alone, without the traditional Adjacency List, and it worked out fine.

Another option if you need to query trees only from the root node (and not arbitrary internal subtrees), is to use a hybrid approach with Adjacency List (the parent_id column): add one more column "root_id" so every descendant knows its root ancestor.

So querying a whole tree is as simple: SELECT * FROM Comments WHERE root_id = 1;

jamesfm said...

Good thoughts.
I have to find arbitrary subtrees anywhere in the hierarchy.
In fact there's also an emerging need to find the "extended family" of a node, by which I mean "find all the descendants of an ancestor of this node, where the ancestor is of a particular type".
This was looking hard just with an adjacency list approach - but I think it will be simple with a closure table.

Jean-Rubin L said...

Hi,
I found your slides on Slideshare after posting a question on phpfreaks and believe it I've been using it a lot. Not just for the closure table approach but for enums and other goodies that were in there so thank you for this document.

I have question regarding the closure table. I have a tree structure that is working well. Now I want to copy a node and all its children and I am having a hard time finding the right approach to do it. When I copy for instance node A with children B and C, it will create Notes A' B' C' with new IDs ( I have an auto-incrementprimary key) Reproducing the relationship OF A, B and C poses me a challenge however. In the closure table you have the relations regarding the original nodes but you cannot build reproduce the relation with the new nodes as they were just created and there's no information about them in the closure table. Do you have any suggestion as to how I should tackle this? Thanks in advance,
JR from Montreal

Jean-Rubin L said...
This post has been removed by a blog administrator.
Bill Karwin said...

@Jean-Rubin: Good question. Relocating or cloning a subtree in the Closure Table or Nested Sets designs isn't very easy.

For the case you describe, I'd recommend creating a temp table that maps the node id's of the old subtree to the node id's of the cloned subtree.

CREATE TEMPORARY TABLE NodeMap (
original_id BIGINT PRIMARY KEY,
clone_id BIGINT NOT NULL
);

Then if you can query an original subtree (rooted at node #4 for instance), you can insert a clone subtree with equivalent structure but using the cloned nodes:

INSERT INTO ClosureTable (ancestor, descendant)
SELECT n1.clone_id, n2.clone_id
FROM ClosureTable c
JOIN NodeMap n1 ON (c.ancestor = n1.original_id)
JOIN NodeMap n2 ON (c.descendant = n2.original_id)
WHERE c.ancestor = 4;

Warren Benedetto said...

Bill, awesome presentation. I have returned to your slides many times over the last few months, especially the section on closure tables.

I'm just curious ... suppose I wanted to sever a child tree (and all of its descendants) and move it to a new parent. If the new parent is at a different depth in the tree than the old one, all the depths will need to be recalculated.

It seems the easiest thing to do would be to DELETE the child and its descendants (from the TreePath table), then INSERT them with the new parent. Then there would be no need to re-calculate the depths.

However, I'm wondering if there's a more efficient way to do it, via an UPDATE query. Any thoughts on how such a query might be structured? If it's possible, would it be more efficient than the DELETE/INSERT method?

Jean-Rubin L said...

Warren, it's funny you should mention that. I am currently working on moving nodes around in the tree. Your question is very valid but just my two cents: even if you use a delete / update, the depth of the new nodes will probably change as the insertion may take place at a different level. So far I have systematically removed the child nodes and recalculated the ancestor-descendant-depth combination. Hopes this helps a little, JR

Bill Karwin said...

@Warren: Consider a transformation of a tree like (A (B (D (F (H, I))), C (E (G)))) -> (A (B (D), C (E (G (F (H, I))))))

That is, move a subtree F with its children H and I. Old parent is D. New parent is G.

Paths within the subtree (e.g. FF, HH, II, FH, and FI) don't need to change at all. Keep in mind the "depth" is really pathlength, not the level in the tree.

Paths from ancestor nodes into the subtree have to be recalculated, up to the common ancestor of the old parent and the new parent.

I can update paths from D and simply change the parent to G, and pathlengths stay the same.

I can delete paths from B into the moving subtree.

And I can insert new paths from C and E into the subtree, generating the paths I need by joining CE and CG to the paths inside the subtree, and adding 1 to the pathlengths.

Paths from A into the subtree need to add 1 to pathlength because the length from A to G is 1 greater than the length from A to D.

Hmm. I'm thinking I could write a whole book showing how to do each type of tree operation with each type of hierarchical data design. Maybe that'll be my next book.