Tuesday, January 13, 2009

The Next-Gen Databases

A user recently asked:
I'm learning traditional Relational Databases (with PostgreSQL) and doing some research I've come across some new types of databases. CouchDB, Drizzle, and Scalaris to name a few, what is going to be the next database technologies to deal with?
SQL is a language for querying and manipulating relational databases. SQL is dictated by an international standard. While the standard is revised, it seems to always work within the relational database paradigm.

Here are a few new data storage technologies that are getting attention currently:

  • CouchDB is a non-relational database. They call it a document-oriented database.
  • Amazon SimpleDB is also a non-relational database accessed in a distributed manner through a web service. Amazon also has a distributed key-value store called Dynamo, which powers some of its S3 services.
  • Dynomite and Kai are open source solutions inspired by Amazon Dynamo.
  • BigTable is a proprietary data storage solution used by Google, and implemented using their Google File System technology. Google's MapReduce framework uses BigTable.
  • Hadoop is an open-source technology inspired by Google's MapReduce, and serving a similar need, to distribute the work of very large scale data stores.
  • Scalaris is a distributed transactional key/value store. Also not relational, and does not use SQL. It's a research project from the Zuse Institute in Berlin, Germany.
  • RDF is a standard for storing semantic data, in which data and metadata are interchangeable. It has its own query language SPARQL, which resembles SQL superficially, but is actually totally different.
  • Vertica is a highly scalable column-oriented analytic database designed for distributed (grid) architecture. It does claim to be relational and SQL-compliant. It can be used through Amazon's Elastic Compute Cloud.
  • Greenplum is a high-scale data warehousing DBMS, which implements both MapReduce and SQL.
  • XML isn't a DBMS at all, it's an interchange format. But some DBMS products work with data in XML format.
  • ODBMS, or Object Databases, are for managing complex data. There don't seem to be any dominant ODBMS products in the mainstream, perhaps because of lack of standardization. Standard SQL is gradually gaining some OO features (e.g. extensible data types and tables).
  • Drizzle is a relational database, drawing a lot of its code from MySQL. It includes various architectural changes designed to manage data in a scalable "cloud computing" system architecture. Presumably it will continue to use standard SQL with some MySQL enhancements.

Relational databases have weaknesses, to be sure. People have been arguing that they don't handle all data modeling requirements since the day it was first introduced.

Year after year, researchers come up with new ways of managing data to satisfy special requirements: either requirements to handle data relationships that don't fit into the relational model, or else requirements of high-scale volume or speed that demand data processing be done on distributed collections of servers, instead of central database servers.

Even though these advanced technologies do great things to solve the specialized problem they were designed for, relational databases are still a good general-purpose solution for most business needs. SQL isn't going away.

I'm posting to my blog the questions I've answered on StackOverflow, which earned the "Good Answer" badge. This was my answer to "The Next-Gen Databases."


JanL said...

Scalaris is not Google Technology: http://www.zib.de/CSR/Projects/scalaris/

Bill Karwin said...

@JanL: Thanks much! My mistake, I shouldn't have assumed it was a Google project just because the code is hosted at code.google.com. I've corrected the link above.

Jim said...

You should probably also add Dynomite (http://github.com/cliffmoon/dynomite/tree/master) and Kai (http://kai.wiki.sourceforge.net) to your list. Both are similar to Scalaris in scope/implementation, by which I mean Erlang-based distributed key-value stores similar to Dynamo (which is also missing from your list.)

What might be a more interesting organization of your list would be to remove things like RDF and XML, which have nothing much to do with the subject at hand, and organize the remainder on broader categories like column-oriented DBs, distributed key-value stores, document-oriented DBs, etc.

Bill Karwin said...

@Jim: Thanks, I've added the solutions you mentioned. I didn't intend this to be a comprehensive list of alternative data store technology, just a sampling.

I'm keeping XML and RDF in the list, even though I realize these are data formats and not query engines, because they receive at least as much attention as many of the other solutions.

Zenobia Godschalk said...

I would also add Aster Data Systems, which was an early adopter of MapReduce and also leverages SQL for their 'frontline data warehouse'. They count MySpace, Akamai and others as customers, and JAFCO/Sequoia just gave them $12M in Series B funding.

L. Mohan Arun said...

Here is related reading on the subject


I dont know how he included key value stores based on [lang="Java"] because anything based on Java is slow, disorganized and heavy.