How to get started with NoSQL

In the early days of the world-wide web, it was pretty common to see an animated GIF of a digging road worker along with the caption, “Under construction”. It was the unfinished website equivalent of asking guests to excuse the mess.

Introduction to NoSQL

As the web matured, those “Under construction” GIFs went away. However, their spiritual descendent hung around for quite some time: web applications that were unavailable, or perhaps read-only, for planned maintenance. And that planned maintenance often involved a database schema migration.

Now our expectations are different. If our mobile and web apps don’t work round the clock then it’s a big deal; in some cases it’s newsworthy.

It’s not just expectations of availability that have changed: the shape and volume of data touched by our applications is, at times, vastly different than just a few years ago.

So, why do relational databases still sit at the core of almost every web application?

Trade-offs

The answer, of course, is that we know relational databases. We understand the trade-offs, we know how to plan for them and how to query them. Relational databases work and are still relevant.

However, it’s worth noting that the benefits we get from relational databases don’t come for free. Just because we understand the trade-offs it doesn’t mean they’re not there.

The queryability we get from SQL comes, in part, thanks to enforced schemas that are easily indexed but not easily changed. The strong consistency we enjoy from relational databases makes scaling potentially costly and high write-availability a challenge.

NoSQL databases have plenty of trade-offs too. By understanding the basics of each type of NoSQL database we can better know when to use them and what those trade-offs will be.

Four basic types of non-relational database

NoSQL covers many types of system that store and retrieve data. It’s not a particularly accurate name, as many are adopting SQL-like query languages of their own. It’s more accurate to call them non-relational databases, as it’s in the data model — not necessarily the query language — that they differ from relational databases.

There are broadly four types of non-relational database in common usage:

key-value
document
column
graph.

As well as the data model, there is another key aspect that will influence how you architect your application and what trade-offs you need to make: how easy does that particular DBMS make it to run a single database across more than one server?

Distributed versus single instance

Only a handful of NoSQL databases in common usage are truly single server. However, of those that are designed to scale-out each differs in its approach.

The primary impacts of those different approaches are:

consistency of your data versus its availability
whether the size of your dataset can be larger than the storage size of your largest machine
the introduction of single points of failure
how much work you need to do in the application layer.

We’ll touch on these briefly later but to cover them in full will take a separate post.

Key-value stores

Most of us are already familiar with key-value stores: they work much like hashes or dictionaries.

In non-relational database terms, the defining characteristics of key-value stores are:

a single value has a single key
the value is opaque to the database system
in a pure key-value system, the key is your only index.

So, why would you use a data store that seems to offer so little?

Precisely because they do so little, key-value stores can be usefully flexible:

You can store anything: binaries, POJOs, text, session data, whatever you choose.
Schema is unenforced: you can change it from one KV pair to the next.
They’re quick: there’s next to no overhead in storing and retrieving a key-value pair. There are no disk seeks during query processing, no complex JOINs to resolve, no indexing to complete.
Distributing KV pairs is easy: they’re discrete units of data.

These make key-value stores a good choice for:

storing object state
storing session state
offloading/caching data from a more expensive source (e.g. a mainframe or data analysis tool)
storing high velocity data
storing unpredictable and varied data
storing large volumes of data by adding more servers.

Where you need any kind of query beyond key look-up, key-value stores are not suited to that use case. This introduces one of the key trade-offs of pure key-value stores: beyond basic storage and retrieval, they require that you do everything else in the application layer.

Examples of key-value stores

LevelDB: simple, single-instance and requires few resources.
Riak: distributed eventually consistent database that runs across larger clusters of servers.
Couchbase: distributed strongly consistent database that runs on clusters and has strong document database functionality.

Document databases

Document databases share some things in common with key-value stores:

documents are easy to distribute
schema is unenforced
with most document databases you can use a key to look-up the entire document.

Where they differ from key-value stores is most interesting, though. Document databases index and allow you to query the contents of the data. That means they’re much closer to what we’d think of as a database than just a basic storage and retrieval mechanism.

Most of today’s document databases index and query JSON data, while one or two work with XML instead. If you’ve used Lotus Notes and Domino, then you’ve used a form of document database; but don’t let that put you off.

Modern document databases are well suited to web applications:

Much of the data we work with in a web application is either already in JSON or is easily serialised to JSON.
Flexible schemas mean that down-time for schema migrations is a thing of the past, so you can iterate quickly on your app’s functionality without incurring downtime.
Documents are easy to distribute, so scaling with rises and falls in demand becomes easy.
Structured document formats, such as JSON, are easy to index and so query.

When it comes to use cases, there is some cross-over with the key-value stores, but the queryability of the document database means it does a great deal more work for you rather than offloading that work to your application layer.

You might consider a document database for:

user profiles
session storage
any structured textual content: product catalogues, social media newsfeeds, content management and so on
object state (if easily serialised to JSON or XML).

There are two main trade-offs:

keeping schema in check becomes your job
querying isn’t yet as optimised as you’ll find in SQL databases.

It’s also worth noting that, as with all of these non-relational databases, the software itself and the surrounding ecosystem are less mature than you might be used to.

Examples of document databases

MongoDB: the most famous of the non-relational databases, it uses simple method chaining for query, is easy to get started and has a great community but falls down at larger scales.
CouchDB: an early NoSQL pioneer it’s not so great for larger datasets and relies on map-reduce for query. However, all that should change with its 2.0 release.
Couchbase: the key-value store introduced map-reduce, sub-document updates and a SQL-like query language to become a document database.
Elasticsearch: while not primarily a document database, Elasticsearch allows you to store, index and search documents and so if that’s your only query mechanism that may be enough for you.

Column stores

Column stores are at once both a little alien and somewhat familiar. They were designed to make it easy to store and query time-series data. What can make them confusing, at first, is that they share some terminology with relational databases, but with different meanings.

With column stores, rows can hold any number of columns and each column has its own arbitrary number of values associated with it. What makes column stores most interesting is that the data in one column is held sequentially on disk, making it fast to run range queries across that data.

Imagine an instrument that measures the flow of fluid through a pipeline. Every second it records the flow rate. In a column store, this particular instrument would have its own column and each second’s measurement would be another value in that column. Appending data is cheap in a column store, making them ideal for time series data.

The main trade-off with a column store is that you lose some flexibility. You must think up-front about the shape of the data you’re storing but, most importantly, you need to consider how you’ll query the data as that’ll influence how you store it on disk.

Examples of column databases

Cassandra: the most famous column store also uses the Dynamo paper’s method of distributing data around a cluster, giving it high read/write availability at the expense of consistency.
HBase: coming from the Hadoop world, HBase is strongly consistent at the expense of availability.

Graph databases

The previous three database types start with the data first and treat query almost as a secondary consideration. Graph databases, instead, are all about query and they’re based on Euler’s graph theory.

In graph theory, a dataset consists of nodes on a graph and the connections between those nodes. What is interesting are the routes that the connections enable between the nodes.

A common example of a graph database use case is a social network. Each node is a person and the connections are the ways in which those people are connected to one another. So, two people who share a love of the work of Douglas Adams could have directional connections between them tagged with “Douglas Adams”. If those people each have non-mutual Douglas Adams connections, then the social network could help them to find other Douglas Adams fans. How? The graph database can traverse the graph of people looking for connections tagged “Douglas Adams”.

This makes graph databases ideal for dating sites and so on. Perhaps more interestingly it also helps to identify unusual elements in datasets, such as fraudulent activity in banking.

The big trade-off is that, to have queries run in an acceptable amount of time, connected data must live on the same server. That makes it harder to scale-out a graph database. It might also be necessary to run some other kind of database to store the meat of the data about the nodes in the graph.

Examples of graph databases

Neo4J: by far the best known graph database, Neo4J uses a query language based on ASCII art representations of the query.
ArrangoDB: a multi-model database which handles documents and graph-type query.

What next?

We’re now six, or maybe eleven, years into the current NoSQL movement. The hype has, more or less, died down and the reality is emerging of where each of these types of database is useful.

Relational databases will not be going anywhere, but it’s worth considering the trade-offs that you make with any type of data store and choosing a set of database technologies whose advantages for your use case make the trade-offs worthwhile.