Keeping it Simple: EAV FAIL

Thursday, May 21, 2009

EAV FAIL

Photo by Mike Gogulski, used in accordance with the Creative Commons BY-SA 3.0 license

The photo above illustrates (by counter-example) an important characteristic of a normalized database: each logical "type" of attribute belongs in a separate column.

Just because three values happen to be numeric doesn't mean it makes sense to SUM() them together. But if dissimilar attributes are stored in the same column, it's tempting to treat them as compatible in this way.

This also shows a fallacy of the Entity-Attribute-Value antipattern. In this design, all attribute values are stored in a single column.

CREATE TABLE EntityAttributeValue (
entity        VARCHAR(20) NOT NULL,
attribute     VARCHAR(20) NOT NULL,
value         VARCHAR(1000) NOT NULL,
PRIMARY KEY (entity, attribute)
);

INSERT INTO EntityAttributeValue (entity, attribute, value)
VALUES
('New Cuyama', 'Population',          '562'),
('New Cuyama', 'Ft. above sea level', '2150'),
('New Cuyama', 'Established',         '1951'),

SELECT SUM(value) FROM EntityAttributeValue
WHERE entity = 'New Cuyama';

The Entity-Attribute-Value design does not support or conform to rules of database normalization.

To be clear, the proper way to design a database is to put different attributes in different columns. Use column names, not strings, to identify the attributes.

CREATE TABLE Cities (
 city_id          SERIAL PRIMARY KEY,
 city_name        VARCHAR(100) NOT NULL,
 population       INT UNSIGNED NOT NULL,
 feet_altitude    SMALLINT UNSIGNED NOT NULL,
 year_established SMALLINT UNSIGNED NOT NULL
);

118 comments:

Wenbert Del Rosario said...: what would be the correct way to structure your database using your example?

At first glance, I would do something like this:

Table: Entity
Fields: id, entity

Table: Attributes
Fields: id, attr

Table: EntityAttributes
Fields: id, entity_id, attr_id, value

But then again, I was in a hurry or being lazy in a not-so-important project, I'd follow the Entity-Attribute-Value design. Because querying that kind of database is easier. Less table joins, etc.; 5/21/2009 6:34 PM
rpbouman said...: @Wenbert Del Rosario

" Because querying that kind of database is easier. Less table joins, etc"

You gotto be kidding...seriously, the main characteristic of EAV schemas is that they have a "store everything, query nothing" property. You can store what ever you like, because it's so generic. Because it's so generic, you can't query it, because nothing really ever means something.

And talking about joins...what do you think the query would look like to get one row for Cuyama showing Population, ft above sea level and Established?

I have used EAV too, but only in an isolated case where an extensible storage seemed to best fit the bill. We have not, and will not, solve the problem of querying that database. If we eed reports, we transform graphs of interest to regular table structures and query those); 5/21/2009 7:29 PM
Bill Karwin said...: @Wenbert: In your example, you *still* have all three values occupying the same column (EntityAttributes.value).

The point is that in a well-designed table, you shouldn't put different types of things into the same column.

The correct table would be:

Cities (city_id, city_name, population, feet_altitude, year_established)

Different attributes in different columns. Use column names, not strings, to identify the attributes.; 5/21/2009 7:33 PM
Wenbert Del Rosario said...: @Bill Karwin Thanks. That cleared up things for me. :-); 5/21/2009 7:40 PM
Willo van der Merwe said...: Hi Bill,

As you know I'm pro EAV, but this article really had me in stitches... :D

Willo; 5/22/2009 4:36 AM
Bill Karwin said...: @Willo: Yeah, I realize it's possible to make a functioning system with EAV.

But that's like saying one can build a house of bricks without using mortar. You can make it stand if you're careful enough, but you better not lean against it! ;-); 5/22/2009 11:36 AM
dbscience said...: I generally don't like this pattern either as it doesn't scale, but what rule of normalization does it violate?

Assuming the entity and attribute are immutable (and if they aren't, they can be by using entity and attribute ids), this doesn't violate the core rule of normalization that nothing is duplicated. There are no insertion, update, and deletion anomalies with this data model.

For example:

CREATE TABLE EntityAttributeValue (
entityId Int NOT NULL,
attributeId Int NOT NULL,
value VARCHAR(1000) NOT NULL,
PRIMARY KEY (entity, attribute)
);

Which implies associated entity and attribute tables.

Plus, just because a column isn't additive doesn't mean a table isn't normalized. Think of an international financial system where the amounts are stated in different currencies and the amounts can't just be totaled.

Again, to be clear, I agree this is a bad general model, but it is normalized to the key, the whole key, and nothing but the key form.; 5/22/2009 1:14 PM
Bill Karwin said...: @dbscience: The brief definition of 1NF is that the table faithfully represents a relation, and has no repeating groups.

What does it mean to faithfully represent a relation? One of the criteria is that each column & row intersection has exactly one value in the appropriate type.

What is the appropriate type? Well, in an EAV table the type varies per row, depending on what's in `attribute_id`. Though we may use a large varchar as a sort of universal type, the domain of permitted values is different for each attribute.

You can't do that with an attribute in a proper relation. The attribute must have one type (or domain) which applies on all rows. So the EAV table isn't a proper relation, therefore EAV fails to meet the criteria of 1NF.

The meaning of any given value in an EAV table also varies depending on what attribute the row represents. "1968" means something different as population, altitude, or year established. The attribute id is *part* of the key of the EAV table, not the whole key. Therefore the EAV table also violates 2NF.

Finally, though the EAV is implemented as a table, it also serves as a meta-model for another table.

A relation has a header, that names a set of attributes, followed by rows, where there must be a value for each of the attributes in the header.

If each entity can supplement its attributes at will, or omit some of its attributes (by not storing a row in the EAV table for a given attribute), this fails to be a relation.

So the actual EAV table is not a relation, because the domain and meaning of the `value` are variable, and the implicit table that EAV is modeling is not a relation because any entity is allowed to have more or fewer attributes than those named in the header.; 5/22/2009 3:33 PM
dbscience said...: By this logic this table isn't normalized.

create table Order
(orderId int,
currencyId int,
amount number)

Because the amount column meaning depends on the value of the currencyId column. For example, 100 could mean 100 dollars or 100 yen depending on the currencyId value.; 5/22/2009 4:31 PM
Bill Karwin said...: @dbscience: Yes, that seems to be correct. Unless `currencyId` is part of a compound primary key with `orderId`, the table you describe isn't in 3NF (it violates "nothing but the key").

Likewise, an Order that has columns
UnitPrice, Quantity, TotalPrice is also not in normal form. The TotalPrice depends on the other two columns, which are not part of the table's key.; 5/22/2009 5:34 PM
dbscience said...: My point is the order table doesn't have any insert, delete or update anomalies, the ultimate goal of normalization, and is therefore normalized. The amount column depends on the orderId key for its value if not its meaning, which is how I interprect normalization.

Like everything complicated, there are multiple ways to interprect something and I suspect we aren't going to convince each other of the other's viewpoint.

But I do agree that the eav pattern is a bad general concept.; 5/23/2009 7:26 AM
Willo van der Merwe said...: @Bill
I disagree. I believe that EAVs do have a place and can accelerate development and implementation enormously. It's also brilliant in prototyping.

But it needs to be implemented carefully and not used a generic stop-gap.

EAVs need a specific structure and I tend to use systems as dbscience describes: with a separate attribute table.

I like to think of EAVs in the same way as pointers in C. You have to be careful, but the benefits can be enormous.

From my experience EAVs are completely different from from conventional database modeling and one should not attempt to apply traditional relational database theories to it. It's closer to Object Orientated database systems.; 5/25/2009 12:02 AM
Bill Karwin said...: @Willo: Certainly EAV and other database designs have their place. I'm not saying the relational paradigm is the only one we should use. There are problems it doesn't solve easily.

SQL is a language intended to work on relational databases. If you need to manage data in a non-relational way, that's fine. But trying to manage a non-relational data structure with technology designed for relational data is bound to be awkward. For example, you can't model constraints like NOT NULL in EAV.

Hence the need to supplement SQL with code in the application layer, to make sure the data doesn't fall apart. EAV frameworks like yours can provide the mortar for this brick wall.

I think EAV is also like the Semantic Web paradigm, where any entity can have variable attributes, and you can query what attributes an entity has. Thus data and metadata are interchangeable from the DBMS perspective. A a new query language, SPARQL, was invented for this.; 5/25/2009 11:44 AM
Willo van der Merwe said...: @Bill
Yes, certainly, but due to the lack of tools that will efficiently retrieve and filter large amounts of data, and given the ease of installation, and the shear proliferation of SQL server installations out there, using a traditional relational database engine as a storage mechanism makes sense.

So that is what it becomes, just a data store. If XML parsers were fast enough it would probably make more sense to store your EAV data in XML.

Even though RDF is very close to EAV it is not the same. There definitely is a relationship there, but they solve different problems. I had a long hard look at the RDF specification and one of the major differences between EAV and RDF is that, typically, in EAV you have a clear separation between schema and data, while in RDF schema is part of the data.

I have been looking at SPARQL and how it can be used for EAVs.; 5/26/2009 12:59 AM
game-street.ru said...: If EAV is antiptter then how we can solve a problem when entities should be configured from admin panel?; 9/02/2009 1:45 PM
Bill Karwin said...: @pompomru: Adding attributes should be done by adding columns to the table.; 9/27/2009 1:51 PM
Chris said...: Your example is an over-generalization. The purpose of the EAV pattern is to efficiently store sparse data in a RDMS. If you don't have sparse data, then EAV is unsuitable. However, if your data contains millions of features (i.e. "columns") but each sample (i.e. "row") only has a dozen non-null values, then its extremely poor database design NOT to use EAV.; 4/08/2010 7:18 AM
Bill Karwin said...: @Chris: Point taken, using EAV as sparse data storage is not too different from a simple many-to-many table. If a given attribute is applicable to the entity, store a row. If not applicable, store no row.

Then you can add supportive attributes to the many-to-many relationship. That's where we get the `value` column.

But many people use EAV for something other than sparse data. They use it to represent a non-sparse table when they haven't decided on the attribute columns, or when they want to make it user-extensible.

The difference that gets people into trouble is when you try to treat rows with distinct `attribute` values as if they were separate columns in a conventional table.; 4/08/2010 10:47 AM
Prakash M. Nadkarni said...: Your points are semi-valid - but to call EAV an "anti-pattern" is an extreme viewpoint. The fact is that the EAV approach is to be used, like any advanced modeling technique, only in EXCEPTIONAL circumstances where the attributes are sparse (for a given entity) and very numerous (across all entities). Obviously, it's such a pain to work with that you don't employ it when the components of your data model are mostly stable and well-known.

Further, an EAV design is unworkable without a supporting metadata infrastructure to ensure validity of the EAV tables' contents. The metadata tables have to employ 3NF plus features like declarative referential integrity and constraints maximally. If not, the 2nd law of thermodynamics takes over.; 4/20/2010 3:56 PM
Prakash M. Nadkarni said...: Follow-up: Oracle Corporation sells a product called Oracle Clinical for clinical trials data management and has recently spent $700M to acquire Phase Forward, the market leader in Clinical Trials software. Both products implement an EAV subschema for their clinical data, though the majority of the tables are in 3NF.

Is it just possible that the vendor of an RDBMS and a (fairly large) niche market leader know something about the judicious use of EAV that you haven't considered?; 4/21/2010 6:01 AM
Bill Karwin said...: @Prakash: The EAV model dates back to medical records systems in the early 1970's, when relational databases were still just research and no implementations existed.

So it's not surprising that the Phase Forward product uses a similar design. It might even have a direct lineage to one of those early medical systems.

As for Oracle's acquisition of Phase Forward, I'm sure they chose that company for its business potential, not for its use of relationally correct design.

The fact is that EAV is not a relational design, but as you point out, there are exceptional circumstances that make EAV the only option, given other requirements for a project.; 4/21/2010 12:40 PM
vlc said...: At a logical level are the system tables of a DBMS not EAV-based? When we create new tables are their definitions not stored using some variation of that model? Physically there may tend to be new files for new tables but not necessarily and only for the sake of performance. DBMSes now also offer various partitioning options to remove the problems of many rows in a table. If you focus on the fact that it is a metadata schema, I can't see why you think EAV model is not normalised. The fact that a typical query on the application-data is more complicated is not a big disadvantage and does not signify that it isn't normalised. The values stored in the values table, from a metadata point of view are all of type 'mixed'. Does that violate any rule of normalisation?; 5/27/2010 8:33 AM
Bill Karwin said...: @Luc: Thanks for your comment. In the relational model, there is no such thing as a "mixed" data type. Every column must have a single data type. You must be able to compare every value in that column to another value in that column for equality or inequality. In other words, every value in a given column comes from a single domain.

When you use EAV, values in the "value" column come from all sorts of different domains. You can't compare a value '1234' to another value '1234' and say whether they represent the same element from the same domain. You can't say whether those two values are equal even though they look identical.

The meaning of domains and the fact that each column in a table has exactly one domain is fundamental to being relational. EAV breaks this rule.

Can you give an example of where system tables do something similar? That is, where two values in a given column can look identical yet be logically different and incomparable?; 5/27/2010 10:31 AM
vlc said...: @BillK: I was perhaps using the term 'mixed' too loosely; from the viewpoint of the metadata it is just one logical type...'Value'. I don't have a better general word for it. In RDF terms it is 'object' (the other two being 'subject' and 'predicate'. Also, we should really distinguish between the normalised entities (logical data-stores) and the eventual physical tables, files, XML documents, binary-files, etc. which actually get used to implement the normalised model. Using a non-specialised DBMS to implement an EAV schema is likely to be quite in-efficient because the DBMS is itself holding metatdata about the the metadata which is the EAV; so the EAV is adding another layer of abstraction above the application-data-schema (which can also, of course be an inplementation of a fully normalised ER model). By the way (I can almost see you shaking your head already :-) ), I don't agree if you believe that an EAV ER model can't be normalised. EAV-based or schema-less databases which store these 'mixed' values by explicit logical design and may be more efficient, include Amazon SimpleDB, Google AppEngine datastore, Microsoft SQL Data Services and in-principle at least, RDF datastores in general. All this is just my opinion of course. It's not like I've spent time ever researching it, but I have used systems which run on this type of schema and they do a very good job. The schema is beautiful. The only 'weakness' is in speed of retrieving data from complex queries. But there are ways to mitigate against that problem. The huge benefits to application development are too good to ignore for most business solutions.; 5/29/2010 11:52 AM
Bill Karwin said...: @Luc: Thanks for your elaboration. I worked for a company that developed Seamark Navigator, a faceted search engine, sort of along the same lines as Jena. So I'm familiar with RDF.

You're right that EAV has two aspects, one the literal SQL table where attributes are stored one per row, and the abstract table it's meant to model, which has variable attributes per row. This layer of abstraction is called the Inner-Platform Effect (http://en.wikipedia.org/wiki/Inner-platform_effect).

It's true that data with variable attributes is useful and necessary for some applications. But my point is that data in this form is not relational.

So it's not surprising that representing non-relational data with a relational database must employ a level of abstraction, resulting in the Inner-Platform Effect.; 5/29/2010 1:14 PM
vlc said...: @BillK: hmm, did you write that wiki article? Almost like running an emulator? I'm still not convinced EAV is not relational or 3-NF. e.g. its data is always mutually accurate.; 5/29/2010 4:18 PM
Bill Karwin said...: @Luc: I recommend a book "SQL and Relational Theory" by C. J. Date, who is the world's leading authority on the relational model. Read it and I think you will be more likely to agree that types are fundamental to relations.; 5/29/2010 9:08 PM
vlc said...: lol... I'm quite sure I owned a copy of that book while at Uni. Tuples etc. etc. OK... I'll re-scan it, just for you :-); 5/30/2010 1:52 AM
vlc said...: @BillK: by the way, if the 'value' is stored as an id (foreign key) pointing to the actual value (stored in another table dedicated to that datatype), would that not satisfy your objection while still being an EAV model?; 5/30/2010 2:41 AM
Bill Karwin said...: @Luc: The book I mentioned is from 2009. But no doubt it contains similar content to Date's earlier books.

Re the "id" column, no, it makes no difference. You're still trying to store values of a different type in the same column in the EAV table. An extra layer of indirection doesn't help.

By "different type" I'm not talking about SQL data types. I'm talking about the logical domain of those values. You can use one domain per column, full stop.

Also, to do what you describe, that id would have to reference different tables depending on the attribute type. This is called polymorphic associations and it's also a non-relational design.

The fact that you can't declare a foreign key constraint to enforce polymorphic associations should be a clue that it doesn't fit the relational model.; 5/30/2010 10:04 AM
vlc said...: @BillK: Yes, I'm remembering his and Codds much older books... Date was easier to read I think. OK, I think Codd made a distinction between Domain and Type but Date views them as synonymous, however sticking with the term domain rather than data-type... that domain, for the EAV system is a special or user-defined one anyway (the 'attribute-value' domain). It says that here is a Value, how to use that value (outside of the EAV system) is explained by what attribute of what entity it belongs to. And those value can be very complex according to the context in which they are used (atomicity is dependant upon usage, yes?). In EAV relational algebra works fine for manipulating the system. I doubt that would be the case if it were not a relational (and normalised) system.; 5/31/2010 12:28 AM
Bill Karwin said...: @Luc: Consider that you can't enforce UNIQUE or FOREIGN KEY constraints on the values of a given attribute. You can't because all the entity's values are stored in the same column. You can't place any constraints for one attribute because they would affect all attributes.

This is the consequences of trying to store multiple domains in a single column.

You also can't enforce NOT NULL, because you can't make a constraint that a row must exist for each given entity-attribute pair.

Joe Celko explains it also:
http://www.simple-talk.com/sql/t-sql-programming/avoiding-the-eav-of-destruction/; 5/31/2010 1:36 AM
CmdrAndreyasn said...: I'll sum EAV up in a simple statement. Magento Performance. Gets better in a negative direction with every major update as the number of tables increases. Death by a thousand joins.; 9/07/2010 9:52 PM
Unknown said...: Yes. This is why Magento is struggling with scalability. Just ask their dev team how many hours they spent on inventing their own 'full page caching' ...

They are trying hard to move away from the EAV model. In fact I heard that the Enterprise version is already moving away from the EAV model for the order history tables..; 11/05/2010 9:57 AM
Bill Karwin said...: @Ron: Yeah, I thought Magento was a non-starter as soon as I took a look at the code and realized they were another case like "Vision" in the Bad CaRMa article.

If they need that degree of extensibility, they should switch to a document-oriented database.; 11/05/2010 11:30 AM
Willo van der Merwe said...: Again, sorry guys.

Yes, I agree. Implementing EAV in a conventional SQL database is a challenge and should not be taken lightly. There are much better tools out there that'll fit EAV like a glove.

It doesn't mean that EAV is an anti-pattern. Hell, the Universal Data Model is effectively an EAV.

All it means is: choose your supporting technology carefully.; 11/05/2010 1:55 PM
Unknown said...: I've been following the thread for a bit. Really interesting stuff considering that this question has irritated me for some time.

I don't want to take the conversation off track, but what solutions might one pose to the dilemma of disjoint subtypes. It seems like EAV fits here, but is bound to degrade over time and impose scalability constraints (which is why I view it as an antipattern).

I'm wondering what possible solutions exist in terms of RDMS if there are any at all and how can object/graph databases address the problem.; 11/16/2010 7:00 PM
Tom H Anderson said...: In regards to the fallacy of EAV: In a properly designed EAV database the decision comes to reference attributes to attribute sets through a many to many or one to many relationship. I believe 98% of developers choose one to many completely validating your point. However, in a one to many relationship columns are not treated as the same data since all attributes must be unique to each attribute set.

You also touch on values. Magento made a wise choice to use five values tables for each entity. I've found their example overly complicated and well founded.; 1/19/2011 12:10 AM
Bruce said...: Every discussion I see on the evils of EAV uses a brain-dead example (population, elevation of a city), and assumes we are going to try to use some kind of two-table solution for the entire domain.

NOT ONE TIME have I ever seen someone who throws rocks at this solution recommend how to solve the underlying problem. Which is, in a nutshell, using a concrete example:

How would you implement a product configurator? To use my own real-world example, what would you suggest the FeatureDescription table looks like when it's describing the features of a 4D Ultrasound system, but also the features of an aluminum fishing boat?

I can tell you from experience; I can log into the exact same web application (with a different subdomain) and configure both of these things - with all the correct rules guiding each one. Was this application easy to write? No. And in fact the database schema consists of almost 100 tables that exist to support customer records, sales, work orders, etc. But none of it would have been possible without EAVs for the product configuration data. Period.

Until I see some response that goes beyond the ridiculous suggestions from the Usual Suspect that I have 1000 different DB schemas for every possible product/feature domain, I'm simply ignoring those that whack on EAV and never offer an alternative.; 2/15/2011 2:36 PM
Bill Karwin said...: Hi @Bruce, I feel your frustration. I have asked relational database gurus how they would solve problems like that which you describe, and I seldom get a good answer -- or any response at all.

Suppose you have to manage variable attributes per subtype, and you have a fixed number of subtypes, you can use Class Table Inheritance (http://martinfowler.com/eaaCatalog/classTableInheritance.html) or Concrete Table Inheritance (http://martinfowler.com/eaaCatalog/concreteTableInheritance.html)

But sometimes that's not practical, because you have too many subtypes or subtypes that must be created at runtime without modifying the schema. Or you really have no subtypes at all, you allow every entity to have distinct attributes.

The best compromise I've seen to solve that problem in a relational database is to encode the variable attributes in a BLOB, using your choice of XML, JSON, etc. Then for each attribute, create an inverted index table to point back to rows where a given value in that attribute occurs.

This solution was written about in 2009 in "How FriendFeed uses MySQL to store Schema-less Data" (http://bret.appspot.com/entry/how-friendfeed-uses-mysql).

The other answer is: go ahead and store schema-less database, just don't do it in a relational database. Relational and schemaless are completely in opposition to each other.

So you can use Solr, CouchDB, MongoDB, Hadoop, whatever non-relational solution you find meets your needs for non-relational data. They have specialized ways of storing that data, and specialized ways of querying it.

But SQL is for relational data. Using EAV in a relational database and trying to query it with SQL is not the answer.; 2/15/2011 3:22 PM
Bruce said...: I'm not sure what is gained by using two different storage solutions. If the domain objects maintain the typing going in and out of the database (e.g., "Fabric Color" is a text attribute with choices of "Red, Green, Yellow"; "Additional Trolling Motor" is a boolean attribute) - why should this data be "query-friendly", per se?

Sure, maybe I won't be (easily) able to answer the question "how many Red Model XYZs did we sell last year that had the optional Hammeranger" - but I wouldn't be able to in any case with BLOB data, so it's a moot point.

At some point, I'm going to have to pass in known parameters - "get me all the used features between date X and date Y" - and then use a custom query engine that works on in-memory domain objects. How I populate them is immaterial.

Smart EAVs don't have only three columns - they have lots of columns that are universally applicable. Using my example, perhaps a "Phase-In date" and "Phase-Out date". It would be silly to store these kinds of things in EAV columns. This allows reasonable limits on the query and then the in-memory query engine can take over. Implemented correctly, the user never even needs to see the difference between the baked-in attributes and the dynamic ones.

Ad-hoc queries become less practical, of course - but that's a trade-off that I simply have to live with in this problem domain.; 2/15/2011 3:44 PM
Bill Karwin said...: @Bruce: Did you read the article about FriendFeed I linked to?; 2/15/2011 5:51 PM
Bruce said...: I did read it, thanks. A couple of times actually. If I understand it correctly, their BLOB data is a pickled Python dictionary that has some sort of key values in it (along with other arbitrary data).

So, the primary ID of this blob data is extracted and placed as the primary key in the ENTITY table, then all other key values are stored into separate index table(s) that reference back to the entity table.

For example, if we had a "part_id" in the BLOB, we would create an index table with "part_id, entity_id" as the composite primary key, and this would point back to the correct row in the ENTITY table.

I made sense of it, but I don't see how it solves the query problem. Maybe I'm missing something. Using their example, how could we write a query like "select all ENTITY where TITLE contains 'we just launched'"?

It seems they solved a performance problem, but I don't see how it's any better than an EAV for querying. Maybe I'm missing something and you can comment. Thanks.; 2/16/2011 9:09 AM
Bill Karwin said...: @Bruce: The difference from a relational perspective is that columns are stored in columns, not rows.

> how could we write a query like "select all ENTITY where TITLE contains 'we just launched'"

SELECT e.entity_id, e.attribute_xml
FROM Entities e
JOIN Titles t ON e.entity_id = t.entity_id
WHERE t.title CONTAINS 'we just launched'

Then you fetch the blob of XML and parse it in your application, to discover all the other attributes of that entity.

But the idea of the Titles table is that if it's in sync with your XML blobs, it points correctly to rows where the value you're looking for occurs in the blob.

So you can have a proper table whose metadata names the "title" attribute and uses the correct data type. There's no way you're going to store in that same table the postal code or the birthdate or the password of the same entity. Each attribute goes in its own separate table.

It's better than EAV because in EAV you need to do a join per attribute you want to fetch, and if you add an attribute, you need to re-code all your queries.

Whereas in inverted index, you do a join per attribute that you need to filter on, so if you add an attribute to the XML, you fetch it automatically with the rest of the blob, with no need for code changes.

(ps: CONTAINS is not a standard SQL operator, it exists in Microsoft SQL Server iirc.); 2/18/2011 8:29 AM
Bruce said...: Bill:

Yeah the CONTAINS thing is probably because I'm used to using my in-memory search parser language instead of SQL. =c)

I checked out Anchor Modeling (basically a separate table for each attribute, no updates, no nulls, and use views for querying). It seemed a little extreme to me for an OLTP scenario.

The reverse index makes sense for RI where you reference another entity. The parent entity can safely do a cascade delete, and the referenced entity can have a where-used that is fairly easy to implement, although it can't allow a cascade delete. I tried it for image references and it worked well.

It still doesn't solve the problem that EAVs do, however, which is dynamic data without schema changes.; 2/23/2011 8:19 AM
Bill Karwin said...: @Bruce: "It still doesn't solve the problem that EAVs do, however, which is dynamic data without schema changes."

Yes it does. You're free to put any dynamic data you want to into the XML blob, including new attributes at will. No schema changes needed.

But this means you have to run a table-scan and row-by-row XML parsing if you want to search for a given value in one of those new attributes.

If you want to make it efficient to search for a value in a particular attribute, then you have to create a new attribute table.

This is no different from document-oriented NoSQL databases e.g. CouchDB and MongoDB. They allow you to do ad hoc map/reduce queries, but it searches the hard way, over the whole data collection. If you want to run that search repeatedly, it pays to create an index, which is a DDL operation just like it is in SQL.

Or you could put the data into something like Solr, which allows documents to have distinct sets of attributes, and it implicitly indexes every attribute.; 2/23/2011 11:54 AM
A inflorit muscata-n glastra, dragostea e everywhere said...: Hello gentlement - may I first state that I have found your dialog most instructive. I would have been the most dedicated adversary of EAV models (except for obvious practical uses) - that is unless I had met with a dilemma of my own involving it. I need to be able to add dynamically any number of "fields" to a document, each field having a different data type. Since the data types are limited (say 5), an obvious solution would be to just add 5 nullable columns. But it's still wasteful (even though it is a recommended way to model class inheritance). Can you think of any other way?; 12/12/2011 10:31 PM
Bill Karwin said...: @Deroude, InnoDB doesn't store column values when they are NULL, so I don't think adding five new columns is so wasteful.

But it's very likely that you'll need to add more than five.

You should read "How FriendFeed uses MySQL to store schema-less data" (http://bret.appspot.com/entry/how-friendfeed-uses-mysql). This is a compromise design that allows you to add dynamically any number of "document fields" to a row, without changing the schema, and without breaking your relational model as badly as EAV does.; 12/18/2011 11:48 AM
Unknown said...: Wow, this blog post is epic fail in itself.; 7/06/2012 12:16 PM
Bill Karwin said...: James Kessler, negativity is not constructive. If you are only going to take pot shots, please keep them to yourself. If you're willing to elaborate and describe why you say that, please do.; 7/06/2012 1:52 PM
sharifi14 said...: Hi Bill

Interesting discussion. Your advice and comments on EAV patterns around the web are always useful reading.

My problem seems to be an issue that others have too (I've read some of your comments on Stack Overflow). Here's my situation:

I want to create a database of around 1000 products. However products will have different attributes depending on their category, i.e. Computers may have CPU speed, hard drive storage, SKU; whereas Books may have author, publisher, pages, ISBN.

I can see three possible methods:

1/ Use EAV pattern, i.e. have one Attribute table and one Value table, something similar to this: http://i43.tinypic.com/2j2defp.png. This method is quick to set up and quite easy to maintain too, but as you mentioned, 'The point is that in a well-designed table, you shouldn't put different types of things into the same column.'

2/ Use a single table called Products and have lots of columns for all possible attributes. However there would be many NULL values in this table, as a Book isn't going to have a value for CPU speed for example!

3/ Create a table for each Category, i.e category_computers, category_books. These Category tables contain all the relevant attributes, so we won't get lots of NULL values. The problem with this is that it may result in duplicate columns across multiple tables, and also we will end up with loads of different tables!

Which of these three methods would you recommend? If none, what would you use as an alternative?; 7/08/2012 10:33 AM
Bill Karwin said...: Hi sharifi14, thanks for you question.

Your option 2 is called Single Table Inheritance. Its downside is that if you have a lot of NULL columns, and if you ever add a new category you have to add more columns.

Your option 3 is called Concrete Table Inheritance. You end up with N tables for N categories, and you have the problem of duplicating columns that are common to all categories.

Another option is Class Table Inheritance
http://martinfowler.com/eaaCatalog/classTableInheritance.html
You'd get N+1 tables for N categories; the extra table has columns that are common to all categories.

Finally, Serialized LOB
http://martinfowler.com/eaaCatalog/serializedLOB.html
Where you have one table, and columns that are common get conventional columns. Other columns that are custom or category-specific are serialized together as XML or JSON or something and stored in a BLOB. This makes it inconvenient to query or update individual custom fields without fetching the whole blob, but it only requires one table.

You can index the Serialized LOB pattern with a technique described here:
http://backchannel.org/blog/friendfeed-schemaless-mysql; 7/09/2012 3:55 PM
Jon said...: @Bill Informative post. Like some posters before me, I happen to have a problem with growing schema. What do you think about the following approach? Products -> Products_PC, Products -> Products_Apparel, Products -> Products_Vehicle.

Basically, Products_PC will consist of all columns relevant to PC and will inherit other attributes from Products (ie price, cost, etc). As client requires more products, I shall get them to engage our team again to further customize the schema. Will this be a good approach?; 7/11/2012 1:55 AM
Bill Karwin said...: @Jon, Yes, you're basically describing the pattern that Martin Fowler calls Class Table Inheritance.

It can be an effective alternative to EAV, as long as you can take the time to create a new table every time you add a new product type.

But if you absolutely must support creation of new subtypes without schema changes (even the creation of a new empty table), it may not be flexible enough.

I recommend reading the book Patterns of Enterprise Application Architecture.
http://martinfowler.com/books/eaa.html; 7/11/2012 9:44 AM
Jon said...: @Bill
Thanks. I'm not much of a database guru, so bear with my ignorance, if any. I'll go through your link. Much thanks.; 7/12/2012 12:27 AM
DragonMan said...: Your missing the point of EAV. While denormalization is not in most cases proper for relational databases, it is necessary for datawarehousing. EAV is useful for the organization of heterogeneous data in which you might not know what the attributes will be. Contrary to the Cities example table you provided.; 9/07/2012 6:44 AM
Bill Karwin said...: Hi DragonMan, I see your point, but typically in data warehousing you *do* know what attributes will be.

A significant part of Ralph Kimball's books on data warehousing make the point that you *must* know what facts you need to store before you can design a DW schema for them.

The requirement to accommodate heterogeneous data is a challenge for any project. There are a number of enterprise design patterns designed to handle them:
- Single Table Inheritance
- Concrete Table Inheritance
- Class Table Inheritance
- Serialized LOB, optionally supported by inverted indexes.

Lastly, there are non-relational data stores that can handle data with variable attributes.; 9/07/2012 7:42 AM
Unknown said...: Excellent post and excellent explanations Bill. The EAV problem couldn't be explained better. Appart from other problems, as someone in the comments pointed, that's why, for example, Magento is so unbelievable slow. 400+ tables for a simple shopping cart. The database is horrible complex and the queries... man, you should see the queries, they are hilarious.

Keep up the good work Bill. You have some great articles! Cheers; 9/07/2012 4:40 PM
Matt C said...: Still one of the best EAV threads ever. I'm trying to get a feel for all the Magento forks that are out there in the world, and posted some EAV thoughts here: http://mwic.org/wp/?p=51

I'd be honored to hear anyone's comments on this --especially, of course, Bill's; 10/16/2012 8:04 AM
Bill Karwin said...: Matt C, Thanks for the kind words! I've posted a reply comment on your blog.; 10/16/2012 9:55 AM
HEALIS said...: Hi, I really enjoyed reading the pros and cons on EAV, especially your strong truthfull arguments Bill. What I would like to add here for your readers is the fact that Entity Attribute and Value are simply abstractions that we made in order to cope with complexity of real problems, the same is true for RDF and even ER and EER I dear to say. It is a simplistic model. In practice you discover that attributes are in fact entities and values can also be names for other entities. In that scenario relations may exist among entities and attributes and values. I am looking into ways of modeling that complex scenario with other alternative models as a better candidate for WEB 3.0 than RDF (see Neurorganon NULO); 12/30/2012 2:14 PM
Bill Karwin said...: Hi HEALIS, thanks for your comment.

I agree that there are complex real-world problems that are hard to map into the relational model. But my view is that using a relational database for such data is a misapplication of that model.

In a relational model, any attribute can have a lookup table that enumerates the set of values. The lookup table may have more columns that serve to annotate the discrete values within that attribute.

I'm not a fan of Semantic Web; I think it's an impractical abstraction. There's no way to optimize a dynamic data model in the way we do with a relatively static relational schema. The only other option to avoid scanning your full data store on every query is a navigational / graph database model.

A couple of products in this vein I've heard of (but I haven't used them) are Neo4j and OrientDB.; 12/31/2012 7:58 AM
Mike Parkin said...: Hi Guys,

What a great selection of comments and answers, I've been thinking about this issue a lot over the last few years and all of the things I have thought about are raised here.

We currently use a very simple Serialised LOB approach and are now looking at indexing options. We will be looking into the other design patterns and Elastic search that are mentioned. Will come back when we have made our decisions (perhaps a blog post or something).

Thanks very much!

-------------

One addition I wanted to make is that it is possible to enforce unique using EAV - magento does this as follows:

There are two tables, one contains the entity *definition* and the other the attributes for those entities.

Entities
------
entity_id
entity_name
entity_table

Attributes
--------
attribute_id
entity_id
attribute_name
attribute_type (varchar, text, int)

Then for each entity a concrete table is created, one for the entity itself and then one table for each attribute type - for example customer:

customer_entity
-------------
customer_id
created_at
updated_at

customer_entity_varchar
customer_entity_int
customer_entity_text
customer_entity_....etc
-----------------
customer_id
attribute_id
value

Let's say you wanted to add 'email' as a unique attribute for customers. It would now be possible using this approach to say that value on customer_entity_text must be unique (using a composite index against the attribute_id and value); 2/06/2013 7:18 AM
Bill Karwin said...: Hi Mike, thanks for your thoughts and comment!

Regarding the Magento EAV schema, how could email be unique if the customer entity has two text attributes, one that should be unique (email) and one that shouldn't be unique (postal address)? Wouldn't both attributes of the same type text be stored in the same `customer_entity_text` table with different attribute_ids?; 2/06/2013 7:52 AM
Mike Parkin said...: Hi Bill - you're absolutely correct - that wouldn't be possible! :-); 2/06/2013 10:09 AM
Mike Parkin said...: Hi Bill,

We have a new member of staff and because of this are re-visiting the love and joy that is EAV.

Whilst we were having a discussion, we realised that it is actually possible to have unique constraints, it would just require extra tables.

You would need to have a table for unique and non-unique:

customer_entity_varchar
customer_entity_int
customer_entity_text
customer_entity_unique_varchar
customer_entity_unique_int
customer_entity_unique_text

This way you can keep unique and non-unique attributes.

An obvious real-world problem we can see with this approach though is that updating the 'unique' constraint on an entity would be quite an expensive operation (copying data between tables).

:); 4/30/2013 1:58 AM
Bill Karwin said...: Hi Mike,

What if your entity requires the combination of two attributes to be unique? For example, how would you model this table:

CREATE TABLE Playoffs (
game_id INT PRIMARY KEY,
team1 INT NOT NULL,
team2 INT NOT NULL
UNIQUE KEY (team1, team2)
);; 4/30/2013 5:00 PM
Unknown said...: Just a thought on enforcing uniqueness in EAV for some attributes and not others:

Could you not just add a unique_key column with a trigger that looks up the attribute in the metadata, determines whether or not it is unique, and then either adds an incrementing number (not unique), or a constant 1 (unique)? You could then put a unique index across the columns attribute, value, unique_key. Whenever unique_key = 1, you have then effectively enforced unique values for that particular attribute; where unique_key increments, you can re-use the same value...

Does this make sense?

(You can't use this to enforce a unique constraint on a combination of two attributes, but you should only really be using constraints like that when you know that both attributes will definitely be present - in which case you could put them into standard columns, rather than into the dynamic EAV section of your schema.); 5/22/2013 4:06 PM
Bill Karwin said...: Hi Sam, thanks for your idea.

Yes, that would support uniqueness for some attributes but not others. But it just adds evidence that EAV is an example of the Inner-Platform Effect.

One definition of the Inner-Platform Effect is a result of designing a system to be so customizable that it ends becoming a poor replica of the platform it was designed with.; 5/23/2013 8:11 AM
ronnie said...: I'm making a product review website, wherein, the columns for various varieties of products are 15-20 in number. They could increase too, since we'll be taking feedback from users as to what else attribute that describes the product to add as a column in the corresponding product category's table.

my question is, does this EAV table good for my need?
Say, for CARS, i would have "power", "torque", "handling" as hard coded columns. But in the other model, I'll have 3 columns - car_model, attribute_name, and attrib_value. Then in that table I could insert power, torque, handling as data in the attribute_name column, and their corresponding data in the same row under the attrib_value column.

Which sounds better>; 8/21/2013 9:25 AM
Bill Karwin said...: Hi Ronnie,

I would choose the "Serialized LOB" pattern for your case. If you have numerous optional attributes of a product, and users may define new attributes, then you app should serialize all the product-specific attributes (think JSON or something) and store them in a BLOB or TEXT column.

See also my recent presentation Extensible Data Modeling with MySQL; 8/26/2013 11:00 AM
nyb said...: Bill,

How does one go about convincing "agile" developers that EAV is a poor choice, given that they see it as a way to speed up their development time. They complain that there is a lot of code they have to write if they have to define each entity in their code... Is there any way to convince them that if they have a table with the name of "account_attributes" they are doing it wrong? Columns *are* attributes.; 10/18/2013 6:23 AM
Bill Karwin said...: Hi @NWest, thanks for the question.

It might be worth pointing out to these developers that when they use EAV, they have to write a lot more custom application code to enforce data integrity.

If they have multiple applications using the same database, they have to implement similar data integrity code in each app (some projects even have multiple apps in multiple languages accessing the same data). That violates the agile principle of "don't repeat yourself."

The database enforces data integrity constraints with less code, since SQL is a declarative language, and enforce these rules consistently for all apps that use that database. This should be more appealing to agile development buffs.; 10/18/2013 2:38 PM
Bill Karwin said...: Data integrity constraints include FOREIGN KEYs, but also UNIQUE keys and even NOT NULL. There's no way to enforce NOT NULL in EAV, that is, to make a specific attribute mandatory.; 10/18/2013 2:41 PM
Cassiano Rabelo said...: Hello Bill.
Great article!

Unfortunately knowing what I shouldn't do, doesn't mean I know what to do :)

I need to create a database to store user generated surveys. Something very similar to Google Forms.
I'm completely lost on how to approach this problem. How would you design a scheme for something like this?
Thanks a lot!; 11/07/2013 4:19 PM
Bill Karwin said...: Hi Cassiano,

It depends a lot on how you will use the data you store.

Check out my presentation on Extensible Data Modeling, which covers the pros and cons of several solutions.

For example, you could use Serialized LOB and just store a JSON encoding of the whole form. That might be sufficient, if the only thing the database does is store and retrieve the whole form data.

But if you need to search it, for example to count how many people gave a specific answer, then you should have some way of indexing the form fields for searching.

My presentation gives several alternatives to EAV, which can be useful in different circumstances.; 11/07/2013 6:29 PM
Cassiano Rabelo said...: Thanks Bill! Very helpful presentation! Cheers!; 11/08/2013 5:00 PM
Jaimie Sirovich said...: @Bill

Worth a note that in addition to hot table changes, MariaDB has added a native Serialized LOB support. PostgreSQL also has H-Store, which is a LOB + supports indexing via GiST. And I think you're talking about MSSQL's sparse columns when you speak of variable attributes. Very cool feature, actually.

And I despise the concrete inheritance idea but Maria has CONNECT, which could help here. PostgreSQL also has native support for class table inheritance, no?

CREATE TABLE capitals (
state char(2)
) INHERITS (cities);

RE: enterprise design patterns designed to handle them:
- Single Table Inheritance
- Concrete Table Inheritance
- Class Table Inheritance
- Serialized LOB, optionally supported by inverted indexes.
- Non-relational data stores that can handle data with variable attributes.; 2/20/2014 11:53 AM
Jaimie Sirovich said...: Ah, I misread your last one and didn't realize you meant document stores. And you mentioned MariaDB's serialized LOB feature in your slide deck. PostreSQL never gets any love for its version with GiST indexing, though :) You can skip the hacky inverted index table. GiST does it for you generically.

Slide 69 is gold.

MS-SQL's compromise is actually interesting. I think MS-SQL allows for hot schema changes as well, so it's actually viable to just slap on a column and make it sparse if you know it's specialized. Then simply make your application dynamically add columns with a prefix or something to prevent collisions with built-ins.

What do you think of that?; 2/20/2014 10:46 PM
Mellow Belly said...: Good read: Incredibly relevant 5 years after posting, and probably well into the future. I was about make the mistake of going with the design you warn against, then I realized you are right. Most importantly you're right from a computational view: even if I end up with thousands of user generated tables, access is going to be fasted for just about any query that way.

But I also feel what many were saying - its absurd to log in to phMyAdmin an flip through thousands of pages of user generated tables to get to what I want. What really needs to happen is a new SQL construct like "TableList" or something, that allows for the functionality we need without cluttering up so many interfaces and views of the database, and can possibly even allow for EAV like queries, where many tables are queried as one when they share a logically identical column.; 1/02/2015 5:57 PM
Bill Karwin said...: Hi Luke,

Thanks for your words about this blog post.

I'm not sure I'd recommend any architecture that resulted in thousands of user-defined tables. Besides phpMyAdmin, there would be a lot of other bottlenecks created by having so much metadata.

If you have to support that much fluidity of properties, using a document store might be better, or else if you use an RDBMS, then store the custom properties in a BLOB.; 1/02/2015 10:38 PM
Mellow Belly said...: Hey Bill, thanks for your response.

My calculations were that if the info on the tables (internal to db) were properly indexed, it would be computationally faster than the EAV thing, especially for joins, and more robust. But you warn about bottlenecks - can you point me in a direction where I could learn more about that?

The kind of app I'm talking about: Call it myCoolEasyForms.com. People can log in and design forms with any of a diverse set of predefined elements, send them out to groups to get the info, Then query the info they collected, including things like row level access so people can edit their own form submissions, etc.

Would you really recommend a document store for this sort of thing? Also the link you've referenced in these comments to friend feed is dead. Do you know of another?; 1/09/2015 9:31 AM
Mellow Belly said...: Just to be a little clearer on what I just said, I'm no DB guru, I'm just thinking data structures: If I had a large list of 'table' objects indexed by a balanced binary tree, with a key indexed by a binary tree, it would take
log2(T)+log2(R) operations to retrieve a record, where T is table number and R is rows in that table, irregardless of the number of columns. But in the other model, to retrieve a record I have to do a join that takes
C * log2(sum(R)) operations, where C is the number of columns and sum(R) is the sum of ALL logically equivalent entries from all tables. Its doomed to be waay slower.; 1/09/2015 11:09 AM
Bill Karwin said...: Hi Luke,

The article on FriendFeed seems to live here now:
http://backchannel.org/blog/friendfeed-schemaless-mysql

I also cover the equivalent solution as "Inverted Indexes" in my presentation:
http://www.slideshare.net/billkarwin/extensible-data-modeling

The bottleneck of creating a large number of tables is that the metadata needs to be stored somewhere. A large data dictionary can consume an unbounded amount of memory. See the following two blogs about the effect of this in MySQL:

http://www.percona.com/blog/2010/05/06/how-much-memory-innodb-dictionary-can-take/
http://www.percona.com/blog/2009/02/11/limiting-innodb-data-dictionary/

Your example of a generic forms processing solution is a classic example of an application that tempts one to use EAV. For an application that demands that much flexibility, I would use either a semi-structured data technology (like a document store) or else an inverted index, and keep the form data stored in a BLOB.; 1/09/2015 11:57 AM
Mellow Belly said...: Excellent presentation, those links were really informative.

The inverted indexes don't sound too bad, actually. Just to make sure I have this right though, for the example site I described above what you would do is have a table with say JSON blobs that define the forms (for rendering and validation) then a table with all the input (in JSON) from all the forms, then an indexed table for each of the possible logical type of items that could be in the forms you might search by? (like email, searchable text, etc.)

Thanks for sharing your insight on this, I appreciate it.; 1/16/2015 7:30 PM
Bill Karwin said...: Luke,

Right, it would be good to have one table for form definitions, then a second table for form responses. There would of course be many rows in the responses table for each form definition.

And yes, for each field you want to be searchable (or sortable), create an inverted index table.; 1/16/2015 9:41 PM
Unknown said...: @Bill @Luke PostgreSQL now has some MongoDB-like behavior and will index JSON data. So you can stay relational where possible, and "break the rules" when needed without entirely losing your indexes. See http://www.postgresql.org/docs/9.4/static/datatype-json.html#JSON-INDEXING; 1/25/2015 4:03 AM
Unknown said...: Very interesting read Bill, I follow your EAV posts with interest. Our company has developed a halfway house of EAV modelling which is causing me all sorts of headaches but I'm struggling to convince the application team it is a bad idea.

In its simplest form we have implemented Key value pairs for each of our core dimensions and moved the attribute descriptions into an EAV table, but only for the dimension attributes. This is proving to be a bit of a support nightmare.

What are your thoughts on creating a dimension with generic attributes but still storing in columns.

Reason being we want to keep our ETL code consistent between clients but want to allow one retailer who sells electronics to use one of the product attributes to store RESOLUTION in one field but the other retailer who sells food products to use a generic attribute field to store a LOW_FAT_FLAG as one of their attributes. (There is then meta data explaining what field is what) I wonder if it's worth it and whether we should look at other solutions.; 3/25/2015 2:42 PM
Unknown said...: @Anthony You can do that, but then you're going to have first class and second class data in terms of how to and how easy it is to load. Personally, I would keep a configuration file to append columns or use something like PostgreSQL's table inheritance to get done what you want.; 3/25/2015 5:38 PM
Bill Karwin said...: @Anthony,

It sounds like you have used the EAV to implement another antipattern that I've seen called the One True Lookup Table.

This is not how data warehouse or star schema are typically designed. In a star schema, each dimension gets its own separate table. That way, you can use columns in the conventional way, with proper data types and constraints (which will be necessary if you want to extend out to use a snowflake schema). Also you can join the dimension tables back to your fact table in a straightforward way.

You should read one of the books written by Ralph Kimball about data warehousing, if you need some expert authority on the matter.

I understand the desire to keep the ETL code consistent between clients. Otherwise you get a different support nightmare, because every client would have custom code that needs to be maintained. But as it is now, it sounds like the code to manage the EAV is hard to maintain, so you really haven't gained anything, and you have an inefficient and fragile database too.

Here's an alternative design: store the data warehouse in a conventional star schema. But do not try to maintain lots of ad hoc custom code for each client. Instead, generate the code from the client's metadata. That is, each client would have a custom schema, and a set of classes to query that schema. But these classes are generated code. Then you just have to maintain the client's metadata, which is probably a lot easier. Then each time something changes in the metadata, re-run your code generation tool.

Another idea is to check out the product made by Looker.com, which could help manage complex queries against your data models. I visited that company some time ago and I was very impressed. Their query builder is very versatile, while still working with non-EAV schemas.; 3/25/2015 11:54 PM
Anonymous said...: Hi Bill,

Good post!

I've also written about EAV/MUCK here:
https://rodgersnotes.wordpress.com/2010/09/21/muck-massively-unified-code-key-generic-three-table-data-model/

There's a good point in here about table joins.
Joins, of course, are fundamental to making an RDBMS work.

It would be interesting to see a side by side comparison of a 5 table join in RDBMS vs EAV/MUCK.

Best,

Rodger; 7/28/2015 10:33 AM
Unknown said...: hi bill ,i have a website with eav and launched, let me introduce my website, my website is about a buying and selling between peoples , peoples can insert their second hand products to website and other peoples can see that product and call them , i have about 100 category in website like mobile ,tv,home pets, ....... and each category have some fileds forself , so i use 3 table like magento attribute_value_int[attributeid,product_id,value] attribute_value_decimal attribute_value_string so the main problem is sql time because for each attribute i should use a JOIN and it takes system slowly but the question is people can insert many products to website and we estimate products go to 3-4 million record in 2 years . can eav handle this ? now i have 68000 records and i want to know if i change my system find a way to port my data to new design .; 9/25/2015 7:05 AM
Bill Karwin said...: Hi babak faghihian, thanks for writing.

First of all, I recommend don't do all those JOINs for each EAV attribute. You don't need to pivot the EAV attributes into one row using SQL. Instead, query for all the attributes but fetch them as multiple rows, as they are stored in the database. In your application, you can fetch the rows sequentially and build a hash array or an object.

Here's an example in PHP of what I mean:

$sql = "SELECT product_id, attribute, value FROM product_eav WHERE product_id IN ( ... )";
$stmt = $pdo->query($sql);

$objects = array();
while ($row = $stmt->fetch()) {
if (!isset($objects[ $row["product_id"] ])) {
$object[ $row["product_id"] ] = new Product();
}
$objects[ $row["product_id"] ]->$row["attribute"] = $row["value"];
}

I believe this is the way Magento works, with a "Table Module" pattern to map multiple rows from the EAV table into one object instance in their application.

See also:
http://martinfowler.com/eaaCatalog/tableModule.html
http://stackoverflow.com/questions/433819/table-module-vs-domain-model

Second, 68000 rows is a small table by modern standards. 3-4 million rows in 2 years should not be a problem. I work on systems that have hundreds of millions of rows or even billions in a single table. You may need a powerful server with a lot of RAM dedicated to the database cache (for example in MySQL, this is the innodb_buffer_pool_size). But with carefully chosen indexes, you can have a very large table and still work on it efficiently.; 9/26/2015 10:03 AM
Unknown said...: hi bill,thank you for your response , in your way the question is i want to search product with their additional fields how can i search them in my queries? and every parameter related to specific table so in my web application i searched 20 product so 20 product * 3 table [attribute_value_int][attribute_value_varchar][attribute_value_decimal] means 60 query in every search request?

and how i search between 3 tables in same time because every field condition depend to their type table

so is it better to use a single query with many joins ?or fetching records by 60 queries?

and another question is can i safe in future with eav where we got 3-4 melion second hand products?; 9/27/2015 2:50 AM
Bill Karwin said...: Babak,

The only thing worse than an EAV table is *three* EAV tables! :-)

If you store int, decimal, and varchar data differently, then you should have a record somewhere else for each attribute, to remember what format the value is. This will help you query the correct table.

Suppose you search for books with 500 pages. First look up the attribute called "pages".

SELECT attribute_id, type FROM attributes WHERE name='pages'.

This returns for example 1234 and "int". Now you know which EAV table to query for the pages.

SELECT pages.* FROM attribute_value_int AS pages WHERE pages.attribute_id=1234 AND pages.value=500;

But that gets you only one attribute for the book -- the pages.

If you want all the other attributes for the same product, you can look up the pages attribute with the query above, and JOIN it to other rows in each EAV table matching the same product. Since you have three EAV tables, you may combine the results using UNION:

SELECT other.* FROM attribute_value_int AS pages
JOIN attribute_value_varchar AS other USING (product_id)
WHERE pages.attribute_id=1234 AND pages.value=500
UNION
SELECT other.* FROM attribute_value_int AS pages
JOIN attribute_value_int AS other USING (product_id)
WHERE pages.attribute_id=1234 AND pages.value=500
UNION
SELECT other.* FROM attribute_value_int AS pages
JOIN attribute_value_decimal AS other USING (product_id)
WHERE pages.attribute_id=1234 AND pages.value=500;

This returns attributes from all three tables, for any attribute that belongs to the same product.

I want to be clear that I DO NOT LIKE EAV. Supporting flexible attributes is fundamentally incompatible with a relational database, and trying to use EAV to do it is always more difficult than using normal columns for attributes. If you think you should use EAV, then you should consider using a NoSQL database for this project instead.

Another piece of advice I have is that any strategy for scalability must include a plan for archiving data. No solution can continue with good performance while you add more data for years and years. Your project is for selling second hand products (I have read about Mozando, which does something like eBay in Iran), I expect that once a product sale is complete, perhaps no one is interested in searching old products that have already been sold. Or at least the types of searches are different, they may want to know history but they aren't looking for something to buy.

So your application should copy the data to another table for archiving after the sale is finished. This is the time to change the EAV format into JSON and store all variable attributes in a BLOB, on one row per product. Keep common attributes like the seller, the buyer, the sale date and sale price in normal columns.

Then remove that sold product's attributes from your EAV table. That keeps the EAV table only for current products that are still for sale, and the table stays smaller and queries will be faster.; 9/27/2015 1:58 PM
Unknown said...: Thank you for your help and for your good and complete response , its for one attribute searching if attributes in search grow up then foreach attribute i should repeat this three queries? so if i want to search with 5 additional parameters the count of querise grow up ?yeap ?

So after reading your slides i find a better in my mind i want to know what your opinion ,
I make the attributes of each product json and push them into main table as extra_info column ,
so if i search with parameter i search in tables with select in like :
select from advertise where advertise.id in (select advertise_id from attribute_value_int where attr_id=2 and value =5 ) and advertise_id in (select advertise_id from attribute_value_int where attr_id=4 and value= 6 )

then for showing the complete fields i json decode extra_info column of main table and dont need to unnecessary joins for fetching all atributes to show

in fact i search eav tables to find which advertise have this parameters and after find that with foreach in my php code i decode json fields ,

whats your opinion? is it have better performance ?; 9/28/2015 1:56 AM
Bill Karwin said...: babak,

In your query above MySQL will use an index for each attribute lookup. But although it should use the primary key of the advertise table, in practice this type of usage of subqueries is a weakness of MySQL. I would recommend using joins instead.

select a.* from advertise as a
join attribute_value_int as i1 on a.id=i1.advertise_id
join attribute_value_int as i2 on a.id=i2.advertise_id
where i1.attr_id=2 and i1.value =5 and i2.attr_id=4 and i2.value= 6

I still can't endorse the use of EAV tables in SQL, because they're not relational. But I understand if you are stuck with them at least for now.; 9/30/2015 8:47 AM
Jaimie Sirovich said...: What about sparse columns in MS SQL? They really do look like a good solution to most of this muck.; 9/30/2015 10:34 AM
Bill Karwin said...: Jaimie,

Microsoft SQL Server sparse columns have at least these two limitations:

1. You must use Microsoft SQL Server. There's nothing wrong with using Microsoft, but it doesn't help people who use a different RDBMS.

2. You still have to declare the columns in your CREATE TABLE statement. The reason that people use EAV is that they can't declare all the columns they need, and they would have to add new columns with ALTER TABLE. Whether you use sparse columns or not, you'd have to declare your columns by name.

Sparse columns are a storage-level improvement to avoid taking space by storing NULLs. MySQL's InnoDB for example already does this optimization by default.; 9/30/2015 4:41 PM
Jaimie Sirovich said...: Given it's MS SQL. It's just a design I like. Doesn't it also allow for hot column addition? If so, the schema change isn't a big deal. Anything is better than EAV.; 9/30/2015 4:44 PM
Bill Karwin said...: Jaimie, I'm not a Microsoft user so I don't know from experience. But glancing at the docs, I get the impression that adding a sparse column to a table is done in exactly the same way as adding a non-sparse column, with ALTER TABLE.

The benefit of sparse columns is that they don't take any space if you store zero or null. You can have up to 100,000 sparse columns, and these don't count against the limit of 1024 conventional columns. They also impose a storage penalty for non-null values, and accessing sparse columns may have some performance penalty.

Nowhere have I read that you can declare new column names on the fly without an ALTER TABLE.

Some people count sparse columns as an alternative to EAV for the task of storing semi-structured data, but it doesn't seem to support dynamic creation of new sparse columns without DDL.

Here are some blogs I have read:
https://www.simple-talk.com/sql/t-sql-programming/null-friendly-using-sparse-columns-and-column-sets-in-sql-server/
http://stackoverflow.com/questions/7609608/sql-server-alter-table-how-to-add-sparse-definition
http://blog.sqlauthority.com/2008/07/10/sql-server-2008-introduction-to-sparse-columns/
http://blog.sqlauthority.com/2008/07/14/sql-server-2008-introduction-to-sparse-columns-part-2/; 9/30/2015 8:02 PM
Jaimie Sirovich said...: You're probably right, and I'm not an MSSQL user, either. It's just one of the better approaches I've seen to the EAV problem. It doesn't screw up locality. For the life of me I don't understand why people think EAV is better than just storing the rest of 'it' in a BLOB. You'll end up with some synchronization issues, but I've never seen an EAV query actually use an index in a meaningful way. MariaDB (and I think PostgreSQL) even has a way to pull stuff out of a BLOB. The data is scattered all over the drive, and even SSD benefits from locality. Rows are good. Even noSQL has "rows." . But I'm preaching to the choir, I think.; 9/30/2015 8:22 PM
Unknown said...: hi again bill,
after weeks i havent good sleep and i thing yet to eav model

one of my categories have 15 additional fields this field changed every year maybe in all categories (maybe add a column or remove a column every six months not in every second ) and when i use eav query it takes 16 join with main table what should i do is it better to port my data to another scenario ? ( i told u before that my product go to 3 - 4 millions ) another thing that you know its every of 15 additional fields of one of my categories is searchable (means a query with 16 left join and can have 15 where condition ) what u suggest; 10/28/2015 6:22 AM
Bill Karwin said...: Hi Jaimie,

You might like the new feature in MySQL 5.7. They now support JSON as a data type, with functions to reference individual fields within the JSON. They also support expression indexes, so in theory you can create an index on the search for a specific field in the JSON.; 11/01/2015 7:32 AM
Bill Karwin said...: Babak,

The best option is to convert your product data to a non-relational database such as MongoDB or Apache Solr.
Using EAV is never going to be an easy option.; 11/01/2015 7:43 AM
Jaimie Sirovich said...: @Bill

Maria has had that for awhile. Now I just need to see how different it looks. Unfortunately, it looks like they're forking a tiny bit. Babak should probably just use Solr or Sphinx. You can almost use Sphinx like a storage engine.; 11/01/2015 11:29 AM
Unknown said...: Bill,
thank you for your response can i use mysql 5.7 json field to store my additional fields in new column and search with that ?why you prefer to use mongo db than use json column of mysql 5.7 ?; 11/04/2015 1:11 AM
Bill Karwin said...: Babak,

I'm not a fan of MongoDB, mostly for its naive marketing claims that it's magic and makes everything easy and fast. That's not true. There are also some serious tradeoffs with performance if you care about data durability.

I haven't tested the MySQL 5.7 JSON fields, but as I understand the feature, searching the JSON will incur a table-scan unless you use the new expression indexing feature to index selection of a specific field from the JSON. That won't work when you need to search for multiple fields. Well, it will work to search for one field, and that might narrow down the search to a subset of matching records, so that it's tolerable to search the other fields without the help of an index. But you won't be able to search for multiple fields with the help of an index in the same query.

MySQL 5.7 is also very young currently. As with all major MySQL releases, it has a lot of new features not yet field-tested. It's certain to have bugs not yet discovered. I like to follow "the rule of .20" which means don't trust the new version in production until it reaches 5.x.20. This was a pretty good guess in the past, because 5.1.20, 5.5.20, 5.6.20 were much better than their first GA releases. The current GA release is 5.7.9, so we have some months to go. You can use that current version in an environment more tolerant of failure, but I wouldn't rely on it for production yet.; 11/04/2015 6:50 AM
Jaimie Sirovich said...: @Bill I don't see how it indexes an array either. PostgreSQL has GIN and GIST for a generic implementation of an inverted index, which basically means it's going to do an index merge across the entire array. AFAIK an expression-based index in MySQL would still only work for primitives. Am I missing something?

Not that it's a great idea, but you can model a JSON index with a bunch of carefully-named FTS tokens or something.; 11/04/2015 9:37 AM
Bill Karwin said...: @Jaimie: Right, the expression-based index can only index the result of a scalar expression. MySQL doesn't support an array type, but if you have an array in your JSON structure, you'd have to make an expression to return a specific fixed array element.

It also means you have to create a new GENERATED column for each JSON field you want to be indexed, so it doesn't work for arbitrarily diverse JSON documents. I.e. you can't make arbitrary new fields indexable.

But that's no different in MongoDB -- you still need to know which fields you want to index up front.

In a fulltext search solution like PostgreSQL's, you do get indexing of all words in the document, but it's not the same as searching for a field with a specific value in a JSON type document.

The rquirement to have both dynamic creation of fields, *and* automatic indexing is, as far as I know, impossible. Babak should not feel stressed about his project to the point of losing sleep. It's an unsolvable problem, which means any employer requiring that he solve it is being unreasonable.; 11/04/2015 6:49 PM
Jaimie Sirovich said...: @Bill Mongo lets you index arrays at least to the extent PostgreSQL does. I wish MySQL had arrays and something like GIN/GIST. But what I was getting at is that they're basically generalized FTS. I think one is an inverted index, and one uses a sigfile. That's all FTS really is, so you could use FTS to model a JSON array if you really need to. I think PostgreSQL's FTS is based on GIN&GIST anyway, which makes sense. IMO the lack of arrays makes the JSON functionality really limited unless you could generate an expression that maps to a 1:M relationship somehow. But that seems even less straighforward.

The MySQL scalar indexing is probably really great for JSON blobs unless you're dealing with arrays. PostgreSQL wins again, but I don't love using it. Maybe he should use PostgreSQL?; 11/05/2015 7:31 AM
Anonymous said...: What about removing attributes or renaming? What solutions, concepts exists for that?; 11/17/2016 12:54 AM
Bill Karwin said...: @chrigaa, thanks for your question. The proper way to add, remove, or rename attributes in a relational database is to use ALTER TABLE.; 11/17/2016 11:48 AM
Zen Force said...: Hi Bill
i went through your slide deck for extensible data modeling solutions.
* Extra Columns
• Entity-Attribute-Value
• Class Table Inheritance
• Serialized LOB & Inverted Indexes
• Online Schema Changes
• Non-Relational Databases

You may be already aware SalesForce uses one Big Table where as WorkDay uses EAV model.

Which solution you recommend for a SAAS application that need to run on MySQL Server?; 4/17/2017 1:05 PM
Bill Karwin said...: Hi Zen Force, thanks for the comment.

MySQL 5.7 has added support for a native JSON data type. You can now use it like a document-oriented database when you need to have flexible fields, and like a more traditional relational database for the rest of your data.

See https://dev.mysql.com/doc/refman/5.7/en/json.html; 4/17/2017 3:19 PM
Zen Force said...: This comment has been removed by a blog administrator.; 4/19/2017 11:53 AM
Unknown said...: Some of the solutions offered are worse than the problem(s).

Using XML/JSON blobs is not a good solution. Any read needs to de-serialize the entire blob. Any update needs to de-serialize and re-serialize the entire blob. That's a lot of work if you have thousands of potential attributes. What if your attributes contain snippets of html, xml or JSON?
This design also lacks UNIQUE, NULL and foreign key constraints which are presented as negatives of EAV.
If we're going to make arguments based on relational purity, how can we justify storing multiple data points in one column? And how about storing data only once which is violated by the Inverted Indexes solution?

Non-relational databases add a lot of complexity(loading, indexing, client-side code) while at the same time don't offer easy solutions to enforcing data integrity constraints.

Dynamically adding columns to tables works but there are limits. MSSQL has a hard 8060 bytes row size limit no matter how many columns you have, SPARSE or not. Makes INSERT/UPDATE behavior unpredictable. You will get input-dependent errors.
Class table inheritance works unless you need a lot of attributes in which case you're back to the row size limitation above.; 8/27/2018 12:23 PM
Bill Karwin said...: Hi Unknown, thanks for the comment.

I am exploring use of JSON in MySQL 5.7 more these days and I would agree with you. Almost every use of JSON I have seen has resulted in more complexity of the data model, SQL queries, and application code. It also sacrifices all the benefits that SQL schemas provide, such as data types and constraints, even NOT NULL, as you say.

Another issue with JSON is its complex logic. Developers are confused already by tri-value logic in SQL (true, false NULL). They will find that JSON introduces an additional complexity: now logic must handle JSON attributes that are true, false, present but NULL, and not present. That is, a missing attribute is different from an attribute that is present but has a NULL value.

JSON is like a form of denormalization. Any type of denormalization can provide optimization for one specific query type, but it comes at the expense of every other type of query you might do against the same data. Developers have to be mindful of this, and make sure they have analyzed all potential query types they might do against their data. If they can't predict the queries, they will find denormalized designs create more problems down the road.; 9/09/2018 3:22 PM

Keeping it Simple

Thursday, May 21, 2009

EAV FAIL

118 comments:

Subscribe Now

Blog Archive

About Me

Labels