Thursday, May 21, 2009

EAV FAIL

fail owned pwned pictures
see more Fail Blog

This illustrates (by counter-example) an important characteristic of a normalized database: each logical "type" of attribute belongs in a separate column.

Just because three values happen to be numeric doesn't mean it makes sense to SUM() them together. But if dissimilar attributes are stored in the same column, it's tempting to treat them as compatible in this way.

This also shows a fallacy of the Entity-Attribute-Value antipattern. In this design, all attribute values are stored in a single column.

CREATE TABLE EntityAttributeValue (
entity VARCHAR(20) NOT NULL,
attribute VARCHAR(20) NOT NULL,
value VARCHAR(1000) NOT NULL,
PRIMARY KEY (entity, attribute)
);

INSERT INTO EntityAttributeValue (entity, attribute, value)
VALUES
('New Cuyama', 'Population', '562'),
('New Cuyama', 'Ft. above sea level', '2150'),
('New Cuyama', 'Established', '1951'),

SELECT SUM(value) FROM EntityAttributeValue
WHERE entity = 'New Cuyama';


The Entity-Attribute-Value design does not support or conform to rules of database normalization.

update: To be clear, the proper way to design a database is to put different attributes in different columns. Use column names, not strings, to identify the attributes.

CREATE TABLE Cities (
city_id SERIAL PRIMARY KEY,
city_name VARCHAR(100) NOT NULL,
population INT UNSIGNED NOT NULL,
feet_altitude SMALLINT UNSIGNED NOT NULL,
year_established SMALLINT UNSIGNED NOT NULL
);

17 comments:

Wenbert Del Rosario said...

what would be the correct way to structure your database using your example?

At first glance, I would do something like this:

Table: Entity
Fields: id, entity

Table: Attributes
Fields: id, attr

Table: EntityAttributes
Fields: id, entity_id, attr_id, value

But then again, I was in a hurry or being lazy in a not-so-important project, I'd follow the Entity-Attribute-Value design. Because querying that kind of database is easier. Less table joins, etc.

Roland Bouman said...

@Wenbert Del Rosario

" Because querying that kind of database is easier. Less table joins, etc"

You gotto be kidding...seriously, the main characteristic of EAV schemas is that they have a "store everything, query nothing" property. You can store what ever you like, because it's so generic. Because it's so generic, you can't query it, because nothing really ever means something.

And talking about joins...what do you think the query would look like to get one row for Cuyama showing Population, ft above sea level and Established?

I have used EAV too, but only in an isolated case where an extensible storage seemed to best fit the bill. We have not, and will not, solve the problem of querying that database. If we eed reports, we transform graphs of interest to regular table structures and query those)

Bill Karwin said...

@Wenbert: In your example, you *still* have all three values occupying the same column (EntityAttributes.value).

The point is that in a well-designed table, you shouldn't put different types of things into the same column.

The correct table would be:

Cities (city_id, city_name, population, feet_altitude, year_established)

Different attributes in different columns. Use column names, not strings, to identify the attributes.

Wenbert Del Rosario said...

@Bill Karwin Thanks. That cleared up things for me. :-)

Willo van der Merwe said...

Hi Bill,

As you know I'm pro EAV, but this article really had me in stitches... :D

Willo

Bill Karwin said...

@Willo: Yeah, I realize it's possible to make a functioning system with EAV.

But that's like saying one can build a house of bricks without using mortar. You can make it stand if you're careful enough, but you better not lean against it! ;-)

dbscience said...

I generally don't like this pattern either as it doesn't scale, but what rule of normalization does it violate?

Assuming the entity and attribute are immutable (and if they aren't, they can be by using entity and attribute ids), this doesn't violate the core rule of normalization that nothing is duplicated. There are no insertion, update, and deletion anomalies with this data model.

For example:

CREATE TABLE EntityAttributeValue (
entityId Int NOT NULL,
attributeId Int NOT NULL,
value VARCHAR(1000) NOT NULL,
PRIMARY KEY (entity, attribute)
);

Which implies associated entity and attribute tables.

Plus, just because a column isn't additive doesn't mean a table isn't normalized. Think of an international financial system where the amounts are stated in different currencies and the amounts can't just be totaled.

Again, to be clear, I agree this is a bad general model, but it is normalized to the key, the whole key, and nothing but the key form.

Bill Karwin said...

@dbscience: The brief definition of 1NF is that the table faithfully represents a relation, and has no repeating groups.

What does it mean to faithfully represent a relation? One of the criteria is that each column & row intersection has exactly one value in the appropriate type.

What is the appropriate type? Well, in an EAV table the type varies per row, depending on what's in `attribute_id`. Though we may use a large varchar as a sort of universal type, the domain of permitted values is different for each attribute.

You can't do that with an attribute in a proper relation. The attribute must have one type (or domain) which applies on all rows. So the EAV table isn't a proper relation, therefore EAV fails to meet the criteria of 1NF.

The meaning of any given value in an EAV table also varies depending on what attribute the row represents. "1968" means something different as population, altitude, or year established. The attribute id is *part* of the key of the EAV table, not the whole key. Therefore the EAV table also violates 2NF.

Finally, though the EAV is implemented as a table, it also serves as a meta-model for another table.

A relation has a header, that names a set of attributes, followed by rows, where there must be a value for each of the attributes in the header.

If each entity can supplement its attributes at will, or omit some of its attributes (by not storing a row in the EAV table for a given attribute), this fails to be a relation.

So the actual EAV table is not a relation, because the domain and meaning of the `value` are variable, and the implicit table that EAV is modeling is not a relation because any entity is allowed to have more or fewer attributes than those named in the header.

dbscience said...

By this logic this table isn't normalized.

create table Order
(orderId int,
currencyId int,
amount number)

Because the amount column meaning depends on the value of the currencyId column. For example, 100 could mean 100 dollars or 100 yen depending on the currencyId value.

Bill Karwin said...
This post has been removed by the author.
Bill Karwin said...

@dbscience: Yes, that seems to be correct. Unless `currencyId` is part of a compound primary key with `orderId`, the table you describe isn't in 3NF (it violates "nothing but the key").

Likewise, an Order that has columns
UnitPrice, Quantity, TotalPrice is also not in normal form. The TotalPrice depends on the other two columns, which are not part of the table's key.

dbscience said...

My point is the order table doesn't have any insert, delete or update anomalies, the ultimate goal of normalization, and is therefore normalized. The amount column depends on the orderId key for its value if not its meaning, which is how I interprect normalization.

Like everything complicated, there are multiple ways to interprect something and I suspect we aren't going to convince each other of the other's viewpoint.

But I do agree that the eav pattern is a bad general concept.

Willo van der Merwe said...

@Bill
I disagree. I believe that EAVs do have a place and can accelerate development and implementation enormously. It's also brilliant in prototyping.

But it needs to be implemented carefully and not used a generic stop-gap.

EAVs need a specific structure and I tend to use systems as dbscience describes: with a separate attribute table.

I like to think of EAVs in the same way as pointers in C. You have to be careful, but the benefits can be enormous.

From my experience EAVs are completely different from from conventional database modeling and one should not attempt to apply traditional relational database theories to it. It's closer to Object Orientated database systems.

Bill Karwin said...

@Willo: Certainly EAV and other database designs have their place. I'm not saying the relational paradigm is the only one we should use. There are problems it doesn't solve easily.

SQL is a language intended to work on relational databases. If you need to manage data in a non-relational way, that's fine. But trying to manage a non-relational data structure with technology designed for relational data is bound to be awkward. For example, you can't model constraints like NOT NULL in EAV.

Hence the need to supplement SQL with code in the application layer, to make sure the data doesn't fall apart. EAV frameworks like yours can provide the mortar for this brick wall.

I think EAV is also like the Semantic Web paradigm, where any entity can have variable attributes, and you can query what attributes an entity has. Thus data and metadata are interchangeable from the DBMS perspective. A a new query language, SPARQL, was invented for this.

Willo van der Merwe said...

@Bill
Yes, certainly, but due to the lack of tools that will efficiently retrieve and filter large amounts of data, and given the ease of installation, and the shear proliferation of SQL server installations out there, using a traditional relational database engine as a storage mechanism makes sense.

So that is what it becomes, just a data store. If XML parsers were fast enough it would probably make more sense to store your EAV data in XML.

Even though RDF is very close to EAV it is not the same. There definitely is a relationship there, but they solve different problems. I had a long hard look at the RDF specification and one of the major differences between EAV and RDF is that, typically, in EAV you have a clear separation between schema and data, while in RDF schema is part of the data.

I have been looking at SPARQL and how it can be used for EAVs.

pompomru said...

If EAV is antiptter then how we can solve a problem when entities should be configured from admin panel?

Bill Karwin said...

@pompomru: Adding attributes should be done by adding columns to the table.