Polyglot Persistence
Posted on October 15, 2008 by Scott Leberknight
In late 2006 Neal Ford wrote about Polyglot Programming and predicted the wave of language choice we are now seeing in the industry to use the right language for the specific job at hand. Instead of assuming a "default" language like Java or C# and then warring over the many different available frameworks, polyglot programming is all about using the right language for the job rather than just the right framework(s). For a while now I've thought about the fact that, paralleling Neal's description of polyglot programming, a relational database seems to be the accepted and default choice for persistence. Sometimes this is due to the fact that organizations have standardized on RDBMS systems and there isn't even any other choice. Other times it is simply what we're used to doing, and possibly we don't even consider alternatives. But now, with things like Amazon SimpleDB, Google Bigtable, Microsoft SQL Server Data Services (SSDS), CouchDB, and lots more, it seems like we're now seeing the beginning of Polyglot Persistence in addition to polyglot programming.
Polyglot Persistence, like polyglot programming, is all about choosing the right persistence option for the task at hand. For example, some co-workers of mine on one project are effectively using Lucene as their primary datastore, since the application they've built is mainly to do complex full-text searches very fast against huge datasets. Most people probably don't think of Lucene as a data store and just consider it as their full-text search engine. But for this particular application, which aggregates multiple disparate datasets, glues them together, and performs full-text search against the consolidated view of the data, it makes a good deal of sense. It also helped that in a bake-off against a very popular traditional RDBMS system's full-text add-on product, the Lucene search solution blew the doors off the traditional RDBMS in terms of performance, and that was even after a team of consultants from the vendor came in and tried to optimize the search performance. So, in this case a non-relational data store made more sense in terms of the problem context, which was data aggregation and fast full-text search.
Within the past few years we've started to see and hear about how companies like Amazon and Google are using non-traditional data stores such as SimpleDB and Bigtable for their own applications. Google App Engine in fact provides access to Bigtable, described as a "sparse, distributed multi-dimensional sorted map," as the sole persistent store for Google App Engine applications. Other organizations like the Apache Software Foundation have gotten into the non-relational data store market as well with things like CouchDB which is described as "a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API." One of the common threads among all these non-relational stores is that they are distributed, designed for fault tolerance, embrace asynchronicity, and are based on BASE (Basically Available, Soft State, Eventually Consistent) and CAP (Consistency, Availability, Partition Tolerance) principles as opposed to traditional ACID (Atomicity, Consistency, Isolation, Durability) properties found in traditional RDBMS systems. In addition, they are almost all either "schemaless" or provide a flexible architecture that promotes ease of schema changes over time, again as opposed to the rigid and inflexible schemas of traditional relational databases.
I don't think it's a coincidence that the companies creating and now offering these alternative data stores - free, commercial, or hybrid models like Google App Engine which is free up to a certain point - are all giants in distributed computing and deal with data on a massive scale. My guess is that perhaps they initially deployed some things on traditional RBDMS systems and outgrew them or maybe they simply thought they could do it better for their own specific problems. But as a result, I think over time that organizations are going to start thinking more and more about the type of persistence they need for different problems, and that ultimately the RDBMS will be but one of the available persistence choices.